icon-carat-right menu search cmu-wordmark

Weaknesses and Vulnerabilities in Modern AI: AI Risk, Cyber Risk, and Planning for Test and Evaluation

Headshot of Bill Scherlis.

Modern artificial intelligence (AI) systems pose new kinds of risks, and many of these are both consequential and not well understood. Despite this, many AI-based systems are being accelerated into deployment. This is creating great urgency to develop effective test and evaluation (T&E) practices for AI-based systems.

This blog post explores potential strategies for framing T&E practices on the basis of a holistic approach to AI risk. In developing such an approach, it is instructive to build on lessons learned in the decades of struggle to develop analogous practices for modeling and assessing cyber risk. Cyber risk assessments are imperfect and continue to evolve, but they provide significant benefit nonetheless. They are strongly advocated by the Cybersecurity and Infrastructure Security Agency (CISA), and the costs and benefits of various approaches are much discussed in the business media. About 70% of internal audits for large firms include cyber risk assessments, as do mandated stress tests for banks.

Risk modeling and assessments for AI are less well understood from both technical and legal perspectives, but there is urgent demand from both enterprise adopters and vendor providers nonetheless. The industry-led Coalition for Secure AI launched in July 2024 to help advance industry norms around enhancing the security of modern AI implementations. The NIST AI Risk Management Framework (RMF) is leading to proposed practices. Methodologies based on the framework are still a work in progress, with uncertain costs and benefits, and so AI risk assessments are less often applied than cyber risk assessments.

Risk modeling and assessment are important not only in guiding T&E, but also in informing engineering practices, as we are seeing with cybersecurity engineering and in the emerging practice of AI engineering. AI engineering, importantly, encompasses not just individual AI elements in systems but also the overall design of resilient AI-based systems, along with the workflows and human interactions that enable operational tasks.

AI risk modeling, even in its current nascent stage, can have beneficial influence in both T&E and AI engineering practices, ranging from overall design choices to specific risk mitigation steps. AI-related weaknesses and vulnerabilities have unique characteristics (see examples in the prior blog posts), but they also overlap with cyber risks. AI system elements are software components, after all, so they often have vulnerabilities unrelated to their AI functionality. However, their unique and often opaque features, both within the models and in the surrounding software structures, can make them especially attractive to cyber adversaries.

This is the third installment in a four-part series of blog posts focused on AI for critical systems where trustworthiness—based on checkable evidence—is essential for operational acceptance. The four parts are relatively independent of each other and address this challenge in stages:

  • Part 1: What are appropriate concepts of security and safety for modern neural-network-based AI, including machine learning (ML) and generative AI, such as large language models (LLMs)? What are the AI-specific challenges in developing safe and secure systems? What are the limits to trustworthiness with modern AI, and why are these limits fundamental?
  • Part 2: What are examples of the kinds of risks specific to modern AI, including risks associated with confidentiality, integrity, and governance (the CIG framework), with and without adversaries? What are the attack surfaces, and what kinds of mitigations are currently being developed and employed for these weaknesses and vulnerabilities?
  • Part 3 (this part): How can we conceptualize T&E practices appropriate to modern AI? How, more generally, can frameworks for risk management (RMFs) be conceptualized for modern AI analogous to those for cyber risk? How can a practice of AI engineering address challenges in the near term, and how does it interact with software engineering and cybersecurity considerations?
  • Part 4: What are the benefits of looking beyond the purely neural-network models of modern AI towards hybrid approaches? What are current examples that illustrate the potential benefits, and how, looking ahead, can these approaches advance us beyond the fundamental limits of modern AI? What are prospects in the near and longer terms for hybrid AI approaches that are verifiably trustworthy and that can support highly critical applications?

Assessments for Functional and Quality Attributes

Functional and quality assessments help us gain confidence that systems will perform tasks correctly and reliably. Correctness and reliability are not absolute concepts, however. They must be framed in the context of intended purposes for a component or system, including operational limits that must be respected. Expressions of intent necessarily encompass both functionality—what the system is intended to accomplish—and system qualities—how the system is intended to operate, including security and reliability attributes. These expressions of intent, or systems specifications, may be scoped for both the system and its role in operations, including expectations regarding stressors such as adversary threats.

Modern AI-based systems pose significant technical challenges in all these aspects, ranging from expressing specifications to acceptance evaluation and operational monitoring. What does it mean, for example, to specify intent for a trained ML neural network, beyond inventorying the training and testing data?

We must consider, in other words, the behavior of a system or an associated workflow under both expected and unexpected inputs, where those inputs may be particularly problematic for the system. It is challenging, however, even to frame the question of how to specify behaviors for expected inputs that are not exactly matched in the training set. A human observer may have an intuitive notion of similarity of new inputs with training inputs, but there is no assurance that this aligns with the actual featuring—the salient parameter values—internal to a trained neural network.

We must, additionally, consider assessments from a cybersecurity perspective. An informed and motivated attacker may deliberately manipulate operational inputs, training data, and other aspects of the system development process to create circumstances that impair correct operation of a system or its use within a workflow. In both cases, the absence of traditional specifications muddies the notion of “correct” behavior, further complicating the development of effective and affordable practices for AI T&E. This specification difficulty suggests another commonality with cyber risk: side channels, which are potential attack surfaces that are accidental to implementation and that may not be part of a specification.

Three Dimensions of Cyber Risk

This alignment in the emerging requirements for AI-focused T&E with techniques for cybersecurity evaluation is evident when comparing NIST’s AI risk management playbook with the more mature NIST Cybersecurity Framework, which encompasses a huge diversity of methods. At the risk of oversimplification, we can usefully frame these methods in the context of three dimensions of cyber risk.

  • Threat concerns the potential access and actions of adversaries against the system and its broader operational ecosystem.
  • Consequence relates to the magnitude of impact on an organization or mission should an attack on a system be successful.
  • Vulnerability relates to intrinsic design weaknesses and flaws in the implementation of a system.

Both threat and consequence closely depend on the operational context of use of that system, though they can be largely extrinsic to the system itself. Vulnerability is characteristic of the system, including its architecture and implementation. The modeling of attack surface—apertures into a system that are exposed to adversary actions—encompasses threat and vulnerability, because access to vulnerabilities is a consequence of operational environment. It is a particularly useful element of cyber risk analysis.

Cyber risk modeling is unlike traditional probabilistic actuarial risk modeling. This is primarily due to the generally nonstochastic nature of each of the three dimensions, especially when threats and missions are consequential. Threat, for example, is driven by the operational significance of the system and its workflow, as well as potential adversary intents and the state of their knowledge. Consequence, similarly, is determined by choices regarding the placement of a system in operational workflows. Adjustments to workflows—and human roles—is a mitigation strategy for the consequence dimension of risk. Risks can be elevated when there are hidden correlations. For cyber risk, these could include common elements with common vulnerabilities buried in supply chains. For AI risk, these could include common sources within large bodies of training data. These correlations are part of the reason why some attacks on LLMs are portable across models and providers.

CISA, MITRE, OWASP, and others offer convenient inventories of cyber weaknesses and vulnerabilities. OWASP, CISA, and the Software Engineering Institute also provide inventories of safe practices. Many of the commonly used evaluation criteria derive, in a bottom-up manner, from these inventories. For weaknesses and vulnerabilities at a coding level, software development environments, automated tools, and continuous-integration/continuous-delivery (CI/CD) workflows often include analysis capabilities that can detect insecure coding as developers type it or compile it into executable components. Because of this fast feedback, these tools can enhance productivity. There are many examples of standalone tools, such as from Veracode, Sonatype, and Synopsys.

Importantly, cyber risk is just one element in the overall evaluation of a system’s fitness for use, whether or not it is AI-based. For many integrated hardware-software systems, acceptance evaluation will also include, for example, traditional probabilistic reliability analyses that model (1) kinds of physical faults (intermittent, transient, permanent), (2) how those faults can trigger internal errors in a system, (3) how the errors may propagate into various kinds of system-level failures, and (4) what kinds of hazards or harms (to safety, security, effective operation) could result in operational workflows. This latter approach to reliability has a long history, going back to John von Neumann’s work in the 1950s on the synthesis of reliable mechanisms from unreliable components. Interestingly, von Neumann cites research in probabilistic logics that derive from models developed by McCulloch and Pitts, whose neural-net models from the 1940s are precursors of the neural-network designs central to modern AI.

Applying These Ideas to Framing AI Risk

Framing AI risk can be considered as an analog to framing cyber risk, despite major technical differences in all three aspects—threat, consequence, and vulnerability. When adversaries are in the picture, AI consequences can include misdirection, unfairness and bias, reasoning failures, and so on. AI threats can include tampering with training data, patch attacks on inputs, prompt and fine-tuning attacks, and so on. Vulnerabilities and weaknesses, such as those inventoried in the CIG categories (see Part 2), generally derive from the intrinsic limitations of the architecture and training of neural networks as statistically derived models. Even in the absence of adversaries, there are a variety of consequences that can arise due to the particular weaknesses intrinsic to neural-network models.

From the perspective of traditional risk modeling, there is also the difficulty, as noted above, of unexpected correlations across models and platforms. For example, there can be similar consequences due to diversely sourced LLMs sharing foundation models or just having substantial overlap in training data. These unexpected correlations can thwart attempts to apply techniques such as diversity by design as a means to improve overall system reliability.

We must also consider the specific attribute of system resilience. Resilience is the capacity of a system that has sustained an attack or a failure to nonetheless continue to operate safely, though perhaps in a degraded manner. This characteristic is sometimes called graceful degradation or the ability to operate through attacks and failures. In general, it is extremely challenging, and often infeasible, to add resilience to an existing system. This is because resilience is an emergent property consequential of system-level architectural decisions. The architectural goal is to reduce the potential for internal errors—triggered by internal faults, compromises, or inherent ML weaknesses—to cause system failures with costly consequences. Traditional fault-tolerant engineering is an example of design for resilience. Resilience is a consideration for both cyber risk and AI risk. In the case of AI engineering, resilience can be enhanced through system-level and workflow-level design decisions that, for example, limit exposure of vulnerable internal attack surfaces, such as ML inputs, to potential adversaries. Such designs can include imposing active checking on inputs and outputs to neural-network models constituent to a system.

As noted in Part 2 of this blog series, an additional challenge to AI resilience is the difficulty (or perhaps inability) to unlearn training data. If it is discovered that a subset of training data has been used to insert a vulnerability or back door into the AI system, it becomes a challenge to remove that trained behavior from the AI system. In practice, this continues to remain difficult and could necessitate retraining without the malicious data. A related issue is the opposite phenomenon of unwanted unlearning—called catastrophic forgetting—which refers to new training data unintentionally impairing the quality of predictions based on previous training data.

Industry Concerns and Responses Regarding AI Risk

There is a broad recognition among mission stakeholders and firms of the dimensionality and difficulty of framing and evaluating AI risk, despite rapid growth in AI-related business activities. Researchers at Stanford University produced a 500-page comprehensive business and technical analysis of AI-related activities that states that funding for generative AI alone reached $25.2 billion in 2023. This is juxtaposed against a seemingly endless inventory of new kinds of risks associated with ML and generative AI. Illustrative of this is a joint study by the MIT Sloan Management Review and the Boston Consulting Group that indicates that firms are having to expand organizational risk management capabilities to address AI-related risks, and that this situation is likely to persist due to the pace of technological advance. A separate survey indicated that only 9 percent of firms said they were prepared to handle the risks. There are proposals to advance mandatory assessments to assure guardrails are in place. This is stimulating the service sector to respond, with independent estimates of a market for AI model risk management worth $10.5 billion by 2029.

Enhancing Risk Management within AI Engineering Practice

As the community advances risk management practices for AI, it is important take into account both the diverse aspects of risk, as illustrated in the previous post of this series, and also the feasibility of the different approaches to mitigation. It is not a straightforward process: Evaluations need to be done at multiple levels of abstraction and structure as well as multiple stages in the lifecycles of mission planning, architecture design, systems engineering, deployment, and evolution. The many levels of abstraction can make this process difficult. At the highest level, there are workflows, human-interaction designs, and system architectural designs. Choices made regarding each of these aspects have influence over the risk elements: attractiveness to threat actors, nature and extent of consequences of potential failures, and potential for vulnerabilities due to design decisions. Then there is the architecting and training for individual neural-network models, the fine-tuning and prompting for generative models, and the potential exposure of attack surfaces of these models. Below this are, for example, the specific mathematical algorithms and individual lines of code. Finally, when attack surfaces are exposed, there can be risks associated with choices in the supporting computing firmware and hardware.

Although NIST has taken initial steps toward codifying frameworks and playbooks, there remain many challenges to developing common elements of AI engineering practice—design, implementation, T&E, evolution—that could evolve into beneficial norms—and wide adoption driven by validated and usable metrics for return on effort. Arguably, there is a good opportunity now, while AI engineering practices are still nascent, to quickly develop an integrated, full-lifecycle approach that couples system design and implementation with a shift-left T&E practice supported by evidence production. This contrasts with the practice of secure coding, which was late-breaking in the broader software development community. Secure coding has led to effective analyses and tools and, indeed, many features of modern memory-safe languages. These are great benefits, but secure coding’s late arrival has the unfortunate consequence of an enormous legacy of unsafe and often vulnerable code that may be too burdensome to update.

Importantly, the persistent difficulty of directly assessing the security of a body of code hinders not just the adoption of best practices but also the creation of incentives for their use. Developers and evaluators make decisions based on their practical experience, for example, recognizing that guided fuzzing correlates with improved security. In many of these cases the most feasible approaches to assessment relate not to the actual degree of security of a code base. Instead they focus on the extent of compliance with a process of applying various design and development techniques. Actual outcomes remain difficult to assess in current practice. As a consequence, adherence to codified practices such as the secure development lifecycle (SDL) and compliance with the Federal Information Security Modernization Act (FISMA) has become essential to cyber risk management.

Adoption can also be driven by incentives that are unrelated but aligned. For example, there are clever designs for languages and tools that enhance security but whose adoption is driven by developers’ interest in improving productivity, without extensive training or preliminary setup. One example from web development is the open source TypeScript language as a safe alternative to JavaScript. TypeScript is nearly identical in syntax and execution performance, but it also supports static checking, which can be done almost immediately as developers type in code, rather than surfacing much later when code is executing, perhaps in operations. Developers may thus adopt TypeScript on the basis of productivity, with security benefits along for the ride.

Potential positive alignment of incentives will be important for AI engineering, given the difficulty of developing metrics for many aspects of AI risk. It is challenging to develop direct measures for general cases, so we must also develop useful surrogates and best practices derived from experience. Surrogates can include degree of adherence to engineering best practices, careful training strategies, tests and analyses, choices of tools, and so on. Importantly, these engineering techniques include development and evaluation of architecture and design patterns that enable creation of more trustworthy systems from less trustworthy elements.

The cyber risk realm offers a hybrid approach of surrogacy and selective direct measurement via the National Information Assurance Partnership (NIAP) Common Criteria: Designs are evaluated in depth, but direct assays on lower-level code are done by sampling, not comprehensively. Another example is the more broadly scoped Building Security In Maturity Model (BSIMM) project, which includes a process of ongoing enhancement to its norms of practice. Of course, any use of surrogates must be accompanied by aggressive research both to continually assess validity and to develop direct measures.

Evaluation Practices: Looking Ahead

Lessons for AI Red Teaming from Cyber Red Teaming

The October 2023 Executive Order 14110 on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence highlights the use of red teaming for AI risk evaluation. In the military context, a typical approach is to use red teams in a capstone training engagement to simulate highly capable adversaries. In the context of cyber risks or AI risks, however, red teams will often engage throughout a system lifecycle, from initial mission scoping, concept exploration, and architectural design through to engineering, operations, and evolution.

A key question is how to achieve this kind of integration when expertise is a scarce resource. One of the lessons of cyber red teaming is that it is better to integrate security expertise into development teams—even on a part-time or rotating basis—than to mandate attention to security issues. Studies suggest that this can be effective when there are cross-team security experts directly collaborating with development teams.

For AI red teams, this suggests that larger organizations could maintain a cross-team body of experts who understand the inventory of potential weaknesses and vulnerabilities and the state of play regarding measures, mitigations, tools, and associated practices. These experts would be temporarily integrated into agile teams so they could influence operational choices and engineering decisions. Their goals are both to maximize benefits from use of AI and also to minimize risks through making choices that support confident T&E outcomes.

There may be lessons for the Department of Defense, which faces particular challenges in integrating AI risk management practices into the systems engineering culture, as noted by the Congressional Research Service.

AI red teams and cyber red teams both address the risks and challenges posed by adversaries. AI red teams must also address risks associated with AI-specific weaknesses, including all three CIG categories of weaknesses and vulnerabilities: confidentiality, integrity, and governance. Red team success will depend on full awareness of all dimensions of risk as well as access to appropriate tools and capabilities to support effective and affordable assessments.

At the current stage of development, there is not yet a standardized practice for AI red teams. Tools, training, and activities have not been fully defined or operationalized. Indeed, it can be argued that the authors of Executive Order 14110 were wise not to await technical clarity before issuing the EO! Defining AI red team concepts of operation is an enormous, long-term challenge that combines technical, training, operational, policy, market, and many other aspects, and it is likely to evolve rapidly as the technology evolves. The NIST RMF is an important first step in framing this dimensionality.

Potential Practices for AI Risk

A broad diversity of technical practices is needed for the AI red team toolkit. Analogously with security and quality evaluations, AI stakeholders can expect to rely on a mix of process compliance and product examination. They can also be presented with diverse kinds of evidence ranging from full transparency with detailed technical analyses to self-attestation by providers, with choices complicated by business considerations relating to intellectual property and liability. This extends to supply chain management for integrated systems, where there may be varying levels of transparency. Liability is a changing landscape for cybersecurity and, we can expect, also for AI.

Process compliance for AI risk can relate, for example, to adherence to AI engineering practices. These practices can range from design-level evaluations of how AI models are encapsulated within a systems architecture to compliance with best practices for data handling and training. They can also include use of mechanisms for monitoring behaviors of both systems and human operators during operations. We note that process-focused regimes in cyber risk, such as the highly mature body of work from NIST, can involve hundreds of criteria that may be applied in the development and evaluation of a system. Systems designers and evaluators must select and prioritize among the many criteria to develop aligned mission assurance strategies.

We can expect that with a maturing of techniques for AI capability development and AI engineering, proactive practices will emerge that, when adopted, tend to result in AI-based operational capabilities that minimize key risk attributes. Direct analysis and testing can be complex and costly, so there can be real benefits to using validated process-compliance surrogates. But this can be challenging in the context of AI risks. For example, as noted in Part 1 of this series, notions of test coverage and input similarity criteria familiar to software developers do not transfer well to neural-network models.

Product examination can pose significant technical difficulties, especially with increasing scale, complexity, and interconnection. It can also pose business-related difficulties, due to issues of intellectual property and liability. In cybersecurity, certain aspects of products are now becoming more readily accessible as areas for direct evaluation, including use of external sourcing in supply chains and the management of internal access gateways in systems. This is in part a consequence of a cyber-policy focus that advances small increments of transparency, what we could call translucency, such as has been directed for software bills of materials (SBOM) and zero trust (ZT) architectures. There are, of course, tradeoffs relating to transparency of products to evaluators, and this is a consideration in the use of open source software for mission systems.

Ironically, for modern AI systems, even full transparency of a model with billions of parameters may not yield much useful information to evaluators. This relates to the conflation of code and data in modern AI models noted at the outset of this series. There is significant research, however, in extracting associational maps from LLMs by looking at patterns of neuron activations. Conversely, black box AI models may reveal far more about their design and training than their creators may intend. The perceived confidentiality of training data can be broken through model inversion attacks for ML and memorized outputs for LLMs.

To be clear, direct evaluation of neural-network models will remain a significant technical challenge. This gives additional impetus to AI engineering and the application of appropriate principles to the development and evaluation of AI-based systems and the workflows that use them.

Incentives

The proliferation of process- and product-focused criteria, as just noted, can be a challenge for leaders seeking to maximize benefit while operating affordably and efficiently. The balancing of choices can be highly particular to the operational circumstances of a planned AI-based system as well as to the technical choices made regarding the internal design and development of that system. This is one reason why incentive-based approaches can often be desirable over detailed process-compliance mandates. Indeed, incentive-based approaches can offer more degrees of freedom to engineering leaders, enabling risk reduction through adaptations to operational workflows as well as to engineered systems.

Incentives can be both positive and negative, where positive incentives could be offered, for example, in development contracts, when assertions relating to AI risks are backed with evidence or accountability. Evidence could relate to a wide range of early AI-engineering choices ranging from systems architecture and operational workflows to model design and internal guardrails.

An incentive-based approach also has the advantage of enabling confident systems engineering—based on emerging AI engineering principles—to evolve in particular contexts of systems and missions even as we continue to work to advance the development of more general techniques. The March 2023 National Cybersecurity Strategy highlights the importance of accountability regarding data and software, suggesting one important possible framing for incentives. The challenge, of course, is how to develop reliable frameworks of criteria and metrics that can inform incentives for the engineering of AI-based systems.

Here is a summary of lessons for current evaluation practice for AI risks:

  1. Prioritize mission-relevant risks. Based on the specific mission profile, identify and prioritize potential weaknesses and vulnerabilities. Do this as early as possible in the process, ideally before systems engineering is initiated. This is analogous to the Department of Defense strategy of mission assurance.
  2. Identify risk-related goals. For those risks deemed relevant, identify goals for the system along with relevant system-level measures.
  3. Assemble the toolkit of technical measures and mitigations. For those same risks, identify technical measures, potential mitigations, and associated practices and tools. Track the development of emerging technical capabilities.
  4. Adjust top-level operational and engineering choices. For the higher priority risks, identify adjustments to first-order operational and engineering choices that could lead to likely risk reductions. This can include adapting operational workflow designs to limit potential consequences, for example by elevating human roles or reducing attack surface at the level of workflows. It could also include adapting system architectures to reduce internal attack surfaces and to constrain the impact of weaknesses in embedded ML capabilities.
  5. Identify methods to assess weaknesses and vulnerabilities. Where direct measures are lacking, surrogates must be employed. These methods could range from use of NIST-playbook-style checklists to adoption of practices such as DevSecOps for AI. It could also include semi-direct evaluations at the level of specifications and designs analogous to Common Criteria.
  6. Look for aligned attributes. Seek positive alignments of risk mitigations with possibly unrelated attributes that offer better measures. For example, productivity and other measurable incentives can drive adoption of practices favorable to reduction of certain categories of risks. In the context of AI risks, this could include use of design patterns for resilience in technical architectures as a way to localize any adverse effects of ML weaknesses.

The next post in this series examines the potential benefits of looking beyond the purely neural-network models towards approaches that link neural-network models with symbolic methods. Put simply, the goal of these hybridizations is to achieve a kind of hybrid vigor that combines the heuristic and linguistic virtuosity of modern neural networks with the verifiable trustworthiness characteristic of many symbolic approaches.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed