Weaknesses and Vulnerabilities in Modern AI: Why Security and Safety Are so Challenging

In the excitement to create systems that build on modern AI, including neural-network-based machine learning (ML) and generative AI models, it is easy to overlook the weaknesses and vulnerabilities that make these models susceptible to misdirection, confidentiality breaches, and other kinds of failures. Indeed, weaknesses and vulnerabilities in ML and generative AI, including large language models (LLMs), create risks with characteristics that are different from those typically considered in software and cybersecurity analyses, and so they merit special attention in the design and evaluation of AI-based systems and their surrounding workflows. Even developing suitable definitions for safety and security that can guide design and evaluation is a significant challenge for AI-based systems. This challenge is amplified when we consider roles for modern AI in critical application domains where there will be mission-focused criteria related to effectiveness, safety, security, and resiliency, such as articulated in the NIST AI Risk Management Framework (RMF).

This is the first part of a four-part series of blog posts focused on AI for critical systems where trustworthiness—based on checkable evidence—is essential for operational acceptance. The four parts are relatively independent of each other, and address this challenge in stages:

Part 1: What are appropriate concepts of security and safety for modern neural-network-based AI, including ML and generative AI, such as LLMs? What are the AI-specific challenges in developing safe and secure systems? What are the limits to trustworthiness with modern AI, and why are these limits fundamental?
Part 2: What are examples of the kinds of risks specific to modern AI, including risks associated with confidentiality, integrity, and governance (the CIG framework), with and without adversaries? What are the attack surfaces, and what kinds of mitigations are currently being developed and employed for these weaknesses and vulnerabilities?
Part 3: How can we conceptualize test and evaluation (T&E) practices appropriate to modern AI? How, more generally, can frameworks for risk management (RMFs) be conceptualized for modern AI analogous to cyber risk? How can a practice of AI engineering address challenges in the near term, and how does it link in software engineering and cybersecurity considerations (noting that these are the three principal areas of competency at the SEI)?
Part 4: What are the benefits of looking beyond the purely neural network models of modern AI towards hybrid approaches? What are current examples that illustrate the potential benefits, and how, looking ahead, can these approaches advance us beyond the fundamental limits of modern AI? What are the prospects in the near and longer term?

A Taxonomy of Risks

This post focuses on security and safety in the context of AI applied to the development of critical systems, leading to an examination of specific examples of weaknesses and vulnerabilities in modern AI. We organize these following a taxonomy analogous to the confidentiality, integrity, and availability (CIA) attributes familiar in the context of cyber risks:

Integrity risks—Results from an AI model are incorrect, either unintentionally or through deliberate manipulation by adversaries.
Confidentiality risks—Results from an AI model reveal elements of input data that designers had intended to keep confidential.
Governance risks—Results from an AI model, or the usage of that model in a system, may have adverse impacts in the context of applications—often even when model results are correct with respect to training.

We recognize that risk management for AI encompasses modeling and assessment at three levels: (1) the core AI capabilities of individual neural network models, (2) choices made in how those core capabilities are incorporated in the engineering of AI-based systems and, importantly, (3) how those systems are integrated into application-focused operational workflows. These workflows can include both autonomous applications and those that have roles for human action-takers. This broad scoping is important because modern AI can lead not only to significant increases in productivity and mission effectiveness within established organizational frameworks but also to new capabilities based on transformative restructurings of mission- and operations-focused workplace activity.

Considerations Particular to Modern AI

The stochastically derived nature of modern AI models, combined with a near opacity with respect to interrogation and analysis, makes them difficult to specify, test, analyze, and monitor. What we perceive as similarity among inputs to a model does not necessarily correspond with closeness in the way the model responds. That is, in training, distinctions can be made based on details we see as accidental. A famous example is a wolf being distinguished from other dogs not because of morphology, but because there is snow in the background, as revealed by saliency maps. The metrology of modern AI, in other words, is barely nascent. Leading AI researchers acknowledge this. (A recent NeurIPS Test of Time award presentation, for example, describes the alchemy of ML.) The history of vehicle autonomy reflects this, where the combination of poor evaluation capabilities and strong business imperatives has led to entire fleets being approved and subsequently withdrawn from use due to unexpected behaviors. In commercial applications, bias has been reported in predictive algorithms for credit underwriting, recruiting, and health claims processing. These are all reasons why adversarial ML is so readily possible.

Mission Perspective

Modern AI models, trained on data, are most often included as subordinate components or services within mission systems, and, as noted, these systems are constituents of operational workflows supporting an application within a mission context. The scope of consideration in measurement and evaluation must consequently encompass all three levels: component, system, and workflow. Issues of bias, for example, can be a result of a mismatch of the scope of the data used to train the model with the reality of inputs within the mission scope of the application. This means that, in the context of T&E, it is essential to characterize and assess at the three levels of consideration noted earlier: (1) the characteristics of embedded AI capabilities, (2) the way those capabilities are used in AI-based systems, and (3) how those systems are intended to be integrated into operational workflows. The UK National Cyber Center has issued guidelines for secure AI system development that focus on security in design, development, deployment, and operation and maintenance.

Conflation of Code and Data

Modern AI technology is not like traditional software: The traditional separation between code and data, which is central to reasoning about software security, is absent from AI models, and, instead, all processed data can act as instructions to an AI model, analogous to code injection in software security. Indeed, the often hundreds of billions of parameters that control the behavior of AI models are derived from training data but in a form that is generally opaque to analysis. The current best practice of instilling this separation, for example by fine tuning in LLMs for alignment, has proved inadequate in the presence of adversaries. These AI systems can be controlled by maliciously crafted inputs. Indeed, safety guardrails for an LLM can be “jailbroken” after just 10 fine-tuning examples.

Unfortunately, developers do not have a rigorous way to patch these vulnerabilities, much less reliably identify them, so it is crucial to measure the effectiveness of systems-level and operational-level best-effort safeguards. The practice of AI engineering, discussed in the third post in this series, offers design considerations for systems and workflows to mitigate these difficulties. This practice is analogous to the engineering of highly reliable systems that are constructed from unavoidably less reliable components, but the AI-focused patterns of engineering are very different from traditional fault-tolerant design methodologies. Much of the traditional practice of fault-tolerant design builds on assumptions of statistical independence among faults (i.e., transient, intermittent, permanent) and typically employs redundancy in system elements to reduce probabilities as well as internal checking to catch errors before they propagate into failures, to reduce consequences or hazards.

The Importance of Human-system Interaction Design

Many familiar use cases involve AI-based systems serving entirely in support or advisory roles with respect to human members of an operational team. Radiologists, pathologists, fraud detection teams, and imagery analysts, for example, have long relied on AI assistance. There are other use cases where AI-based systems operate semi-autonomously (e.g., screening job applicants). These patterns of human interaction can introduce unique risks (e.g., the applicant-screening system may be more autonomous with regard to rejections, even as it remains more advisory with regard to acceptances). In other words, there is a spectrum of degrees of shared control, and the nature of that sharing must itself be a focus of the risk assessment process. A risk-informed intervention might involve humans evaluating proposed rejections and acceptances as well as employing a monitoring scheme to enhance accountability and provide feedback to the system and its designers.

Another element of human-system interaction relates to a human weakness rather than a system weakness, which is our natural tendency to anthropomorphize on the basis of the use of human language and voice. An early and well-known example is the Eliza program written in the 1960s by Joseph Weizenbaum at MIT. Roughly speaking, Eliza “conversed” with its human user using typed-in text. Eliza’s 10 pages of code mainly did just three things: respond in patterned ways to a few trigger words, occasionally reflect past inputs back to a user, and turn pronouns around. Eliza thus seemed to understand, and people spent hours conversing with it despite the extreme simplicity of its operation. More recent examples are Siri and Alexa, which—despite human names and friendly voices—are primarily pattern-matching gateways to web search. We nonetheless impute personality characteristics and grant them gendered pronouns. The point is that humans tend to confer meanings and depth of understanding to texts, whereas LLM texts are a sequence of statistically derived next-word predictions.

Attack Surfaces and Analyses

Another set of challenges in developing safe and secure AI-based systems is the rich and diverse set of attack surfaces associated with modern AI models. The exposure of these attack surfaces to adversaries is determined by choices in AI engineering as well as in the crafting of human-AI interactions and, more generally, in the design of operational workflows. In this context, we define AI engineering as the practice of architecting, designing, developing, testing, and evaluating not just AI components, but also the systems that contain them and the workflows that embed the AI capabilities in mission operations.

Depending on the application of AI-based systems—and how they are engineered—adversarial actions can come as direct inputs from malicious users, but also in the form of training cases and retrieval augmentations (e.g., uploaded files, retrieved websites, or responses from a plugin or subordinate tool such as web search). They can also be provided as part of the user’s query as data not meant to be interpreted as an instruction (e.g., a document given by the user for the model to summarize). These attack surfaces are, arguably, similar to other kinds of cyber exposures. With modern AI, the difference is that it is more difficult to predict the impact of small changes in inputs—through any of these attack surfaces—on outcomes. There is the familiar cyber asymmetry—adjusted for the peculiarities of neural-network models—in that defenders seek comprehensive predictability across the entire input domain, whereas an adversary needs predictability only for small segments of the input domain. With adversarial ML, that particular predictability is more readily feasible, conferring advantage to attackers. Ironically, this feasibility of successful attacks on models is achieved through the use of other ML models constructed for the purpose.

There are also ample opportunities for supply chain attacks exploiting the sensitivity of model training outcomes to choices made in the curation of data in the training process. The robustness of a model and its associated safeguards must be measured with regard to each of several types of attack. Each of these attack types calls for new methods for analysis, testing, and metrology generally. A key challenge is how to design evaluation schemes that are broadly encompassing in relation to the (rapidly evolving) state of the art in what is known about attack methods, examples of which are summarized below. Comprehensiveness in this sense is likely to remain elusive, since new vulnerabilities, weaknesses, and attack vectors continue to be discovered.

Innovation Tempo

Mission concepts are often in a state of rapid evolution, driven by changes both in the strategic operating environment and in the development of new technologies, including AI algorithms and their computing infrastructures, but also sensors, communications, etc. This evolution creates additional challenges in the form of ongoing pressure to update algorithms, computing infrastructure, corpora of training data, and other technical elements of AI capabilities. Rapidly evolving mission concepts also drive a move-to-the-left approach for test and evaluation, where development stakeholders are engaged earlier on in the process timeline (hence “move to the left”) and in an ongoing manner. This enables system designs to be selected to enhance testability and for engineering processes and tools to be configured to produce not just deployable models but also associated bodies of evidence intended to support an ongoing process of affordable and confident test and evaluation as systems evolve. Earlier engagement in the system lifecycle with T&E activity in defense systems engineering has been advocated for more than a decade.

Looking Ahead with Core AI

From the standpoint of designing, developing, and operating AI-based systems, the inventory of weaknesses and vulnerabilities is daunting, but even more so is the current state of mitigations. There are few cures, aside from careful attention to AI engineering practices and judicious choices to constrain operational scope. It is important to note, however, that the evolution of AI is continuing, and that there are many hybrid AI approaches that are emerging in specific application areas. These approaches create the possibility of core AI capabilities that can offer an intrinsic and verifiable trustworthiness with respect to particular categories of technical risks. This is significant because intrinsic trustworthiness is in general not possible with pure neural-network-based modern AI. We elaborate on these possibly controversial points in part 4 of this series where we examine benefits beyond the purely neural-network models of modern AI towards hybrid approaches.

A great strength of modern AI based on neural networks is exceptional heuristic capability, but, as noted, confident T&E is difficult because the models are statistical in nature, fundamentally inexact, and generally opaque to analysis. Symbolic reasoning systems, on the other hand, offer greater transparency, explicit repeatable reasoning, and the potential to manifest domain expertise in a checkable manner. But they are generally weak on heuristic capability and are sometimes perceived to lack flexibility and scalability.

Combining Statistical Models

A number of research teams have recognized this complementarity and successfully combined multiple statistical approaches for advanced heuristic applications. Examples include combining ML with game theory and optimization to support applications involving multi-adversary strategy, with multi-player poker and anti-poaching ranger tactics as exemplars. There are also now undergraduate course offerings on this topic. Physics Informed Neural Networks (PINNs) are another kind of heuristic hybrid, where partial differential equation models influence the mechanism of the neural-network learning process.

Symbolic-statistical Hybrids

Other teams have hybridized statistical and symbolic approaches to enable development of systems that can reliably plan and reason, and to do so while benefiting from modern AI as a sometimes-unreliable heuristic oracle. These systems tend to target specific application domains, including those where expertise needs to be made reliably manifest. Note that these symbolic-dominant systems are fundamentally different from the use of plug-ins in LLMs. Hybrid approaches to AI are routine for robotic systems, speech understanding, and game-playing. AlphaGo, for example, makes use of a hybrid of ML with search structures.

Symbolic hybrids where LLMs are subordinate are starting to benefit some areas of software development, including defect repair and program verification. It is important to note that modern symbolic AI has broken many of the scaling barriers that have, since the 1990s, been perceived as fundamental. This is evident from multiple examples in leading industry practice including the Google Knowledge Graph, which is heuristically informed but human-checkable; the verification of security properties at Amazon AWS using scaled-up theorem proving techniques; and, in academic research, a symbolic/heuristic combination has been used to develop mathematical proofs for long-standing open mathematical problems. These examples give a hint that similar hybrid approaches could deliver a level of trustworthiness for many other applications domains where trustworthiness is important. Advancing from these specific examples to more general-purpose trustworthy AI is a significant research challenge. These challenges are considered in greater depth in Part 4 of this blog.

Looking Ahead: Three Categories of Vulnerabilities and Weaknesses in Modern AI

The second part of this blog highlights specific examples of vulnerabilities and weaknesses for modern, neural-net AI systems including ML, generative AI, and LLMs. These risks are organized into categories of confidentiality, integrity, and governance, which we call the CIG model. The third post in this series focuses more closely on how to conceptualize AI-related risks, and the fourth and last part takes a more speculative look at possibilities for symbolic-dominant systems in support of critical applications such as faster-than-thought autonomy where trustworthiness and resiliency are essential.

Software Engineering Institute

SEI Blog