icon-carat-right menu search cmu-wordmark

Weaknesses and Vulnerabilities in Modern AI: Integrity, Confidentiality, and Governance

Headshot of Bill Scherlis.

In the development of AI systems for mission applications, it is essential to recognize the kinds of weaknesses and vulnerabilities unique to modern AI models. This is important for design, implementation, and test and evaluation (T&E) for AI models and AI-based systems. The October 2023 Executive Order on AI highlights the importance of red teams, and we can expect that these weaknesses and vulnerabilities will be a focus of attention for any T&E activity.

This blog post examines a number of specific weaknesses and vulnerabilities associated with modern artificial intelligence (AI) models that are based on neural networks. These neural models include machine learning (ML) and generative AI, particularly large language models (LLMs). We focus on three aspects:

  • Triggers, including both attack vectors for deliberate adversarial action (exploiting vulnerabilities) and intrinsic limitations due to the statistical nature of the models (manifestations from weaknesses)
  • The nature of operational consequences, including the kinds of potential failures or harms in operations
  • Methods to mitigate them, including both engineering and operational actions

This is the second installment in a four-part series of blog posts focused on AI for critical systems where trustworthiness—based on checkable evidence—is essential for operational acceptance. The four parts are relatively independent of each other and address this challenge in stages:

  • Part 1: What are appropriate concepts of security and safety for modern neural-network-based AI, including ML and generative AI, such as LLMs? What are the AI-specific challenges in developing safe and secure systems? What are the limits to trustworthiness with modern AI, and why are these limits fundamental?
  • Part 2 (this part): What are examples of the kinds of risks specific to modern AI, including risks associated with confidentiality, integrity, and governance (the CIG framework), with and without adversaries? What are the attack surfaces, and what kinds of mitigations are currently being developed and employed for these weaknesses and vulnerabilities?
  • Part 3: How can we conceptualize T&E practices appropriate to modern AI? How, more generally, can frameworks for risk management (RMFs) be conceptualized for modern AI analogous to those for cyber risk? How can a practice of AI engineering address challenges in the near term, and how does it interact with software engineering and cybersecurity considerations?
  • Part 4: What are the benefits of looking beyond the purely neural network models of modern AI towards hybrid approaches? What are current examples that illustrate the potential benefits, and how, looking ahead, can these approaches advance us beyond the fundamental limits of modern AI? What are prospects in the near and longer terms for hybrid AI approaches that are verifiably trustworthy and that can support highly critical applications?

The sections below identify specific examples of weaknesses and vulnerabilities, organized according to three categories of consequences—integrity, confidentiality, and governance. This builds on a number of NIST touchstones, including the AI RMF Framework, which includes an AI RMF playbook, a draft generative AI RMF profile, a model-focused categorization of adversarial ML attacks, and a testbed for evaluation and experimentation. The NIST RMF organizes actions into four categories: govern (cultivate risk-aware organizational culture), map (recognize usage context), measure (identify, analyze, and assess risks), and manage (prioritize and act). CIG builds on these NIST touchstones, with a focus on consequences of both attacks (enabled by vulnerabilities) and adverse accidental outcomes (enabled by weaknesses), with an intent to anticipate hybrid AI approaches that can safely—and verifiably—support highly critical applications.

Risks, Part 1: Integrity

In the context of modern neural-network-based AI, including ML and generative AI, integrity risks refer to the potential for attacks that could cause systems to produce results not intended by designers, implementers, and evaluators. We note that, because specifications of intent—beyond curation of the corpus of training data—are difficult or infeasible for many neural-network models, the concept of “intended results” has only informal meaning.

The paragraphs below identify several kinds of integrity attacks against neural networks and the nature of the weaknesses and vulnerabilities that are exploited, along with some discussion of potential mitigations.

Data poisoning. In data poisoning attacks, an adversary interferes with the data that an ML algorithm is trained on, for example by injecting additional data elements during the training process. (Poisoning can also be effective in supervised learning.) These attacks enable an adversary to interfere with test-time and runtime behaviors of the trained algorithm, either by degrading overall effectiveness (accuracy) or by causing the algorithm to produce incorrect results in specific situations. Research has shown that a surprisingly small amount of manipulated training data, even just a handful of samples, can lead to large changes in the behavior of the trained model. Data poisoning attacks are of particular concern when the quality of the training data cannot be readily ascertained; this difficulty can be amplified by the need to continuously retrain algorithms with newly acquired data.

Relevant to national security and health domains, poisoning attacks can occur in federated learning, where a collection of organizations jointly train an algorithm without directly sharing the data that each organization possesses. Because the training data isn’t shared, it can be difficult for any party to determine the quality of the overall corpus of data. There are similar risks with public data, where adversaries can readily deploy adversarial training inputs. Related attacks can affect transfer learning methods, where a new model is derived from a previously trained model. It may be impossible to ascertain what data sources were used to train the source model, which would cloak any adversarial training affecting the derived model. (A number of hypotheses attempt to explain the surprising level of transferability across models, including, for larger models, commonality of data in the training corpus and in fine-tuning for alignment.)

Misdirection and evasion attacks. Evasion attacks are characterized by an adversary attempting to cause a trained model to produce incorrect outputs during the operation of a system. Examples of results include misidentifying an object in an image, misclassifying risks in advising bank loan officers, and incorrectly judging the likelihood that a patient would benefit from a particular treatment. These attacks are accomplished by the adversary’s manipulation of an input or query given to the trained model. Evasion attacks are often categorized as either untargeted (the adversary’s goal is to trick the model into producing any incorrect answer) or targeted (the adversary’s goal is to trick the model into producing a specific incorrect answer). One example of an attack involves misdirecting neural networks for face recognition by placing colored dots on eyeglass frames. In many evasion attacks, it is important for the attacker-manipulated or attacker-provided input to appear benign, such that a cursory examination of the input by a human expert won’t reveal the attack. There is also the well-known attack of stickers on a stop sign. These stickers are unlikely to be noticed by human drivers—since many stop signs have stickers and other defacements—but carefully placed stickers function as patches that can reliably misdirect a sign-classification network into seeing a speed limit sign. This kind of spoofing has a relatively low work factor and indeed has been the subject of undergraduate homework assignments.

In evaluating the susceptibility of models to evasion attacks, a key consideration is to define what it means for a model’s output to be correct. For many applications, correctness could be defined as always giving the answer that a human would give. Needless to say, this can be difficult to test with any degree of comprehensiveness. Additionally, there are applications where this criterion may not be sufficient. For example, we may want to prohibit outputs that are accurate but harmful, such as detailed instructions on how to make an explosive or commit credit-card fraud.

One of the principal challenges in evaluation, as noted above, is defining design intent regarding system function and quality attributes, analogous to a traditional software specification. It remains a research problem to develop effective means to specify intent for many kinds of ML or LLMs. How can the outputs of models be comprehensively verified against some ground truth to guard against misinformation or disinformation? Given that full specifications are rarely possible, the three CIG categories are not crisply delineated, and indeed this kind of attack poses both an integrity and confidentiality risk.

Inexactness. The fundamental weakness shared by all modern AI technologies derives from the statistical nature of neural networks and their training: The results of neural network models are statistical predictions. Results are in a distribution, and both memorization and hallucination are within the bounds of that distribution. Research is leading to rapid improvement: Model designs are improving, training corpora are increasing in scale, and more computational resources are being applied to training processes. It is nonetheless essential keep in mind that the resulting neural-network models are stochastically-based, and therefore are inexact predictors.

Generative AI hallucinations. The statistical modeling that is characteristic of LLM neural network architectures can lead to generated content that conflicts with input training data or that is inconsistent with facts. We say that this conflicting and incorrect content is hallucinated. Hallucinations can be representative elements generated from within a category of responses. This is why there is often a blurry similarity with the actual facts—called aleatoric uncertainty in the context of uncertainty quantification (UQ) modeling mitigation techniques (see below).

Reasoning failures. Corollary to the statistical inexactness is the fact that neural-network models do not have intrinsic capacity to plan or reason. As Yann LeCun noted, “[The models’] understanding of the world is very superficial, in large part because they are trained purely on text” and “auto-regressive LLMs have very limited reasoning and planning abilities.” The operation of LLMs, for example, is an iteration of predicting the next word in a text or building on the context of a prompt and the previous text string that it has produced. LLMs can be prompted to create the appearance of reasoning and, in so doing, often give better predictions that might create an appearance of reasoning. One of the prompt techniques to accomplish this is called chain-of-thought (CoT) prompting. This creates a simulacrum of planning and reasoning (in a kind of Kahneman “fast-thinking” style), but it has unavoidably inexact results, which become more evident once reasoning chains scale up even to a small extent. A recent study suggested that chains longer than even a dozen steps are in general not faithful to the reasoning done without CoT. Among the many metrics on mechanical reasoning systems and computation generally, two are particularly pertinent in this comparison: (1) capacity for external checks on the soundness of the reasoning structures produced by an LLM, and (2) numbers of steps of reasoning and/or computation undertaken.

Examples of Approaches to Mitigation

In addition to the approaches mentioned in the above sampling of weaknesses and vulnerabilities, there are a number of approaches being explored that have the potential to mitigate a broad range of weaknesses and vulnerabilities.

Uncertainty quantification. Uncertainty quantification, in the context of ML, focuses on identifying the kinds of statistical predictive uncertainties that arise in ML models, with a goal of modeling and measuring those uncertainties. In the context of ML, a distinction is made between uncertainties relating to inherently random statistical effects (so-called aleatoric) and uncertainties relating to insufficiencies in the representation of knowledge in a model (so-called epistemic). Epistemic uncertainty can be reduced through additional training and improved network architecture. Aleatoric uncertainty relates to the statistical association of inputs and outputs and can be irreducible. UQ approaches depend on precise specifications of the statistical features of the problem.

UQ approaches are less useful in ML applications where adversaries have access to ML attack surfaces. There are UQ methods that attempt to detect samples that are not in the central portion of a probability distribution of expected inputs. These are also susceptible to attacks.

Many ML models can be equipped with the ability to express confidence or, inversely, the likelihood of failure. This enables modeling the effects of the failures at the system level so their effects can be mitigated during deployment. This is done through a combination of approaches to quantifying the uncertainty in ML models and building software frameworks for reasoning with uncertainty, and safely handling the cases where ML models are uncertain.

Retrieval augmented generation (RAG). Some studies suggest building in a capacity for the LLM to check consistency of outputs against sources expected to represent ground truth, such as knowledge bases and certain websites such as Wikipedia. Retrieval augmented generation (RAG) refers to this idea of using external databases to verify and correct LLM outputs. RAG is a potential mitigation for both evasion attacks and generative AI hallucinations, but it is imperfect because the retrieval results are processed by the neural network.

Representation engineering. Raising the level of abstraction in a white-box analysis can potentially improve understanding of a range of undesirable behaviors in models, including hallucination, biases, and harmful response generation. There are a number of approaches that attempt feature extraction. This form of testing requires white-box access to model internals, but there are preliminary results that suggest similar effects may be possible in black-box testing scenarios by optimizing prompts that target the same key internal representations. This is a small step to piercing the veil of opacity that is associated with larger neural-network models. More recent work, under the rubric of automated interpretability, has taken initial steps to automating an iterative process of experimentation to identify concepts latent in neural networks and then give them names.

Risks, Part 2: Confidentiality

For modern AI systems, confidentiality risks relate to unintended revelation of training data or architectural features of the neural model. These include so-called “jailbreak” attacks (not to be confused with iOS jailbreaking) that induce LLMs to produce results that cross boundaries set by the LLM designers to prevent certain kinds of dangerous responses—that is, to defy guardrail capabilities that inhibit dissemination of harmful content. (It could, of course, also be argued that these are integrity attacks. Indeed, the statistical derivation of neural-network-based modern AI models makes them unavoidably resistant to comprehensive technical specification, however, and so the three CIG categories are not exactly delineated.)

A principal confidentiality risk is privacy breaches. There is a common supposition, for example, that models trained on large corpora of private or sensitive data, such as health or financial records, can be counted on not to reveal that data when they are applied to recognition or classification tasks. This is now understood to be incorrect. Diverse kinds of privacy attacks have been demonstrated, and in many contexts and missions these attacks have security-related significance.

Manual LLM jailbreak and transfer. As noted above, there are methods for developing prompt injection or jailbreak attacks that subvert the LLM guardrails that are typically integrated into LLMs through fine-tuning cycles. Indeed, Carnegie Mellon collaborated in developing a universal attack method that is transferable among LLM models including, very recently, Meta’s Llama generative model. There are also methods for adapting manual jailbreak techniques so they are robust (i.e., applicable across multiple public LLM model APIs and open source LLM models) and often transferable to proprietary-model APIs. Attackers may fine-tune a set of open source models to mimic the behavior of targeted proprietary models and then attempt a black-box transfer using the fine-tuned models. New jailbreak techniques continue to be developed, and they are readily accessible to low-resource attackers. More recent work has evolved the fine-tuning used for the jailbreak into prompts that appear as natural language text. Some of these jailbreak techniques include role assignment, where an LLM is asked to put itself into a certain role, such as a bad actor, and in that guise may reveal information otherwise protected using guardrails.

Model inversion and membership inference. It is possible for an adversary who has only limited access to a trained ML model (e.g., a website) to obtain elements of training data through querying a model? Early work has identified model inversion attacks that exploit confidence information produced by models. For example: Did a particular respondent to a lifestyle survey admit to cheating on their partner? Or: Is a particular person’s data in a dataset of Alzheimer’s disease patients? It is possible that an adversary might seek to re-create or reproduce a model that was expensive to create from scratch.

LLM memorization. In contrast with the hallucination problem cited above, memorization of training data takes place when LLM users expect synthesized new results but instead receive a reproduction of input data in exact fashion. This overfitting can create unexpected privacy breaches as well as unwanted intellectual property appropriation and copyright violations.

Black-box search. If a proprietary model exposes an API that provides probabilities for a set of potential outputs, then an enhanced black-box discrete search can effectively generate adversarial prompts that bypass training intended to improve alignment. This vulnerability may be accessible to an attacker with no GPU resources who only makes repeated calls to the API to identify successful prompts. Techniques called leakage prompts have also been documented to elicit confidence scores from models whose designers intend for those scores to be protected. These scores also facilitate model inversion, noted above.

Potential Mitigations

Differential privacy. Technical approaches to privacy protection such as differential privacy are forcing AI engineers to weigh tradeoffs between security and accuracy. The techniques of differential privacy are one tool in the toolkit of statistically-based techniques called privacy-preserving analytics (PPAs), which can be used to safeguard private data while supporting analysis. PPA techniques also include blind signatures, k-anonymity, and federated learning. PPA techniques are a subset of privacy-enhancing technologies (PETs), which also include zero-knowledge (ZK) proofs, homomorphic encryption (HE), and secure multiparty computation (MPC). Experiments are underway that integrate these ideas into LLM models for the purpose of enhancing privacy.

Differential privacy techniques involve perturbation of training data or the outputs of a model for the purpose of limiting the ability of model users to draw conclusions about particular elements of a model’s training data based on its observed outputs. However, this kind of defense has a cost in accuracy of results and illustrates a pattern in ML risk mitigation, which is that the defensive action may typically interfere with the accuracy of the trained models.

Unlearning techniques. A number of techniques have been advanced in support of removing the influence of certain training examples that may have harmful content or that might compromise privacy through membership inference. In an effort to accelerate this work, in June 2023 Google initiated a Machine Unlearning Challenge, as did the NeurIPS community. One well-known experiment in the literature involved attempting to get an LLM to unlearn Harry Potter. A year later, researchers concluded that machine unlearning remained challenging for practical use due to the extent to which models became degraded. This degradation is analogous to the effects of differential privacy techniques, as noted above.

Risks, Part 3: Governance and Accountability

Harmful incidents involving modern AI are amply documented through several AI incident repositories. Examples include the AI Incident Database from the Responsible AI Collaborative, the similarly-named AI Incident Database from the Partnership on AI, the Organisation for Economic Co-operation and Development (OECD) AI Incidents Monitor, and the AI, Algorithmic, and Automation Incidents and Controversies (AIAAIC) Repository of incidents and controversies. Success in mitigation requires an awareness of not just the kinds of weaknesses and vulnerabilities noted above, but also of the principles of AI governance, which is the practice by organizations of developing, regulating, and managing accountability of AI-supported operational workflows.

Stakeholders and accountability. Governance can involve an ecosystem that includes AI elements and systems as well as human and organizational stakeholders. These stakeholders are diverse and can include workflow designers, system developers, deployment teams, institutional leadership, end users and decision makers, data providers, operators, legal counsel, and evaluators and auditors. Collectively, they are responsible for decisions related to choices of capabilities assigned to particular AI technologies in a given application context, as well as choices regarding how an AI-based system is integrated into operational workflows and decision-making processes. They are also responsible for architecting models and curating training data, including alignment of training data with intended operational context. And, of course, they are responsible for metrics, risk tradeoffs, and accountability, informed by risk assessment and modeling. Allocating accountability among those involved in the design, development, and use of AI systems is non-trivial. In applied ethics, this is called the problem of many hands. This challenge is amplified by the opacity and inscrutability of modern AI models—often even to their own creators. As Sam Altman, founder of OpenAI, noted, “We certainly have not solved interpretability.” In the context of data science, more broadly, developing effective governance structures that are cognizant of the special features of modern AI is crucial to success.

Pacing. Governance challenges also derive from the speed of technology development. This includes not only core AI technologies, but also ongoing progress in identifying and understanding vulnerabilities and weaknesses. Indeed, this pacing is leading to a continuous escalation of aspirations for operational mission capability.

Business considerations. An additional set of governance complications derives from business considerations including trade secrecy and protection of intellectual property, such as choices regarding model architecture and training data. A consequence is that in many cases, information about models in a supply chain may be deliberately restricted. Importantly, however, many of the attacks noted above can succeed despite these black-box restrictions when attack surfaces are sufficiently exposed. Indeed, one of the conundrums of cyber risk is that, due to trade secrecy, adversaries may know more about the engineering of systems than the organizations that evaluate and operate those systems. This is one of many reasons why open source AI is widely discussed, including by proprietary developers.

Responsible AI. There are many examples of published responsible AI (RAI) guidelines, and certain principles commonly appear in these documents: fairness, accountability, transparency, safety, validity, reliability, security, and privacy. In 2022, the Defense Department published a well-regarded RAI strategy along with an associated toolkit. Many major firms are also developing RAI strategies and guidelines.

There are diverse technical challenges related to governance:

Deepfakes. Because they can operate in multiple modalities, generative AI tools can produce multimodal deepfake material online that could be, for example, convincing simulacra of newscasts and video recordings. There is considerable research and literature in deepfake detection as well as in generation augmented by watermarking and other kinds of signatures. ML and generative AI can be used both to generate deepfake outputs and to analyze inputs for deepfake signatures. This means that modern AI technology is on both sides of the ever-escalating battle of creation and detection of disinformation. Complicating this is that deepfakes are being created in multiple modalities: text, images, videos, voices, sounds, and others.

Overfitting. In ML models, it is possible to train the model in a manner that leads to overfitting when incremental improvements in the success rate on the training corpus eventually leads to incremental degradation in the quality of results on the testing corpus. The term overfitting derives from the broader context of mathematical modeling when models fail to robustly capture the salient characteristics of the data, for example by overcompensating for sampling errors. As noted above, memorization is a form of overfitting. We treat overfitting as a governance risk, since it involves choices made in the design and training of models.

Bias. Bias is often understood to result from the mismatch of training data with operational input data, where training data are not aligned with chosen application contexts. Additionally, bias can be built into training data even when the input sampling process is intended to be aligned with operational use cases, due to lack of availability of suitable data. For this reason, bias may be difficult to correct, due to lack of availability of unbiased training corpora. For example, gender bias has been observed in word embedding vectors of LLMs, where the vector distance of the word female is closer to nurse while male is closer to engineer. The issue of bias in AI system decisions is related to active conversations in industry around fair ranking of results in deployed search and recommender systems.

Toxic text. Generative AI models may be trained on both the best and the worst content of the Internet. Broadly accessible generative AI models may use tools to filter training data, but the filtering may be imperfect. Even when training data is not explicitly toxic, subsequent fine-tuning can enable generation of adverse material (as noted above). It is important to recognize also that there are no universal definitions, and the designation of toxicity is often highly dependent on audience and context—there are diverse kinds of contexts that influence decisions regarding appropriateness of toxic language. For example, distinctions of use and mention may bear significantly on decisions regarding appropriateness. Most remedies involve filters on training data, fine-tuning inputs, prompts, and outputs. The filters often include reinforcement learning with human feedback (RLHF). At this point, none of these approaches have been fully successful in eliminating toxicity harms, especially where the harmful signals are covert.

Traditional cyber risks. It is important to note, indeed it cannot be understated, that traditional cyber attacks involving supply chain modalities are a significant risk with modern ML models. This includes black-box and open source models whose downloads include unwanted payloads, just as other kinds of software downloads can include unwanted payloads. This also includes risks associated with larger cloud-based models accessed through poorly designed APIs. These are traditional software supply chain risks, but the complexity and opacity of AI models can create advantage for attackers. Examples have been identified, such as on the Hugging Face AI platform, including both altered models and models with cyber vulnerabilities.

Looking Ahead: AI Risks and Test and Evaluation for AI

The next installment in this series explores how frameworks for AI risk management can be conceptualized following the pattern of cyber risk. This includes some consideration of how we can develop T&E practices appropriate to AI that has potential for verifiable trustworthiness, which are the subject of the fourth installment. We consider how a practice of AI engineering can help address challenges in the near term and the ways it must incorporate software engineering and cybersecurity considerations.

Additional Resources

Read the first post in this series Weaknesses and Vulnerabilities in Modern AI: Why Security and Safety Are so Challenging.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed