OpenAI Collaboration Yields 14 Recommendations for Evaluating LLMs for Cybersecurity

Large language models (LLMs) have shown a remarkable ability to ingest, synthesize, and summarize knowledge while simultaneously demonstrating significant limitations in completing real-world tasks. One notable domain that presents both opportunities and risks for leveraging LLMs is cybersecurity. LLMs could empower cybersecurity experts to be more efficient or effective at preventing and stopping attacks. However, adversaries could also use generative artificial intelligence (AI) technologies in kind. We have already seen evidence of actors using LLMs to aid in cyber intrusion activities (e.g., WormGPT, FraudGPT, etc.). Such misuse raises many important cybersecurity-capability-related questions including:

Can an LLM like GPT-4 write novel malware?
Will LLMs become critical components of large-scale cyber-attacks?
Can we trust LLMs to provide cybersecurity experts with reliable information?

The answer to these questions depends on the analytic methods chosen and the results they provide. Unfortunately, current methods and techniques for evaluating the cybersecurity capabilities of LLMs are not comprehensive. Recently, a team of researchers in the SEI CERT Division worked with OpenAI to develop better approaches for evaluating LLM cybersecurity capabilities. This SEI Blog post, excerpted from a recently published paper that we coauthored with OpenAI researchers Joel Parish and Girish Sastry, summarizes 14 recommendations to help assessors accurately evaluate LLM cybersecurity capabilities.

The Challenge of Using LLMs for Cybersecurity Tasks

Real cybersecurity tasks are often complex and dynamic and require broad context to be assessed fully. Consider a traditional network intrusion where an attacker seeks to compromise a system. In this scenario, there are two competing roles: attacker and defender, each with different goals, capabilities, and expertise. Attackers may repeatedly change tactics based on defender actions and vice versa. Depending on the attackers’ goals, they may emphasize stealth or attempt to quickly maximize damage. Defenders may choose to simply observe the attack to learn adversary tendencies or gather intelligence or immediately expel the intruder. All the variations of attack and response are impossible to enumerate in isolation.

There are many considerations for using an LLM in this type of scenario. Could the LLM make suggestions or take actions on behalf of the cybersecurity expert that stop the attack more quickly or more effectively? Could it suggest or take actions that do unintended harm or prove to be ruinous?

These types of concerns speak to the need for thorough and accurate assessment of how LLMs work in a cybersecurity context. However, understanding the cybersecurity capabilities of LLMs to the point that they can be trusted for use in sensitive cybersecurity tasks is hard, partly because many current evaluations are implemented as simple benchmarks that tend to be based on information retrieval accuracy. Evaluations that focus only on the factual knowledge LLMs may have already absorbed, such as having artificial intelligence systems take cybersecurity certification exams, may skew results towards the strengths of the LLM.

Without a clear understanding of how an LLM performs on applied and realistic cybersecurity tasks, decision makers lack the information they need to assess opportunities and risks. We contend that practical, applied, and comprehensive evaluations are required to assess cybersecurity capabilities. Realistic evaluations reflect the complex nature of cybersecurity and provide a more complete picture of cybersecurity capabilities.

Recommendations for Cybersecurity Evaluations

To properly judge the risks and appropriateness of using LLMs for cybersecurity tasks, evaluators need to carefully consider the design, implementation, and interpretation of their assessments. Favoring tests based on practical and applied cybersecurity knowledge is preferred to general fact-based assessments. However, creating these types of assessments can be a formidable task that encompasses infrastructure, task/question design, and data collection. The following list of recommendations is meant to help assessors craft meaningful and actionable evaluations that accurately capture LLM cybersecurity capabilities. The expanded list of recommendations is outlined in our paper.

Define the real-world task that you would like your evaluation to capture.

Starting with a clear definition of the task helps clarify decisions about complexity and assessment. The following recommendations are meant to help define real-world tasks:

Consider how humans do it: Starting from first principles, think about how the task you would like to evaluate is accomplished by humans, and write down the steps involved. This process will help clarify the task.
Use caution with existing datasets: Current evaluations within the cybersecurity domain have largely leveraged existing datasets, which can influence the type and quality of tasks evaluated.
Define tasks based on intended use: Carefully consider whether you are interested in autonomy or human-machine teaming when planning evaluations. This distinction will have significant implications for the type of assessment that you conduct.

Represent tasks appropriately.

Most tasks worth evaluating in cybersecurity are too nuanced or complex to be represented with simple queries, such as multiple-choice questions. Rather, queries need to reflect the nature of the task without being unintentionally or artificially limiting. The following guidelines ensure evaluations incorporate the complexity of the task:

Define an appropriate scope: While subtasks of complex tasks are usually easier to represent and measure, their performance does not always correlate with the larger task. Ensure that you do not represent the real-world task with a narrow subtask.
Develop an infrastructure to support the evaluation: Practical and applied tests will generally require significant infrastructure support, particularly in supporting interactivity between the LLM and the test environment.
Incorporate affordances to humans where appropriate: Ensure your assessment mirrors real-world affordances and accommodations given to humans.
Avoid affordances to humans where inappropriate: Evaluations of humans in higher education and professional-certification settings may ignore real-world complexity.

Make your evaluation robust.

Use care when designing evaluations to avoid spurious results. Assessors should consider the following guidelines when creating assessments:

Use preregistration: Consider how you will grade the task ahead of time.
Apply realistic perturbations to inputs: Changing the wording, ordering, or names in a question would have minimal effects on a human but can result in dramatic shifts in LLM performance. These changes must be accounted for in assessment design.
Beware of training data contamination: LLMs are frequently trained on large corpora, including news of vulnerability feeds, Common Vulnerabilities and Exposures (CVE) websites, and code and online discussions of security. These data may make some tasks artificially easy for the LLM.

Frame results appropriately.

Evaluations with a sound methodology can still misleadingly frame results. Consider the following guidelines when interpreting results:

Avoid overgeneralized claims: Avoid making sweeping claims about capabilities from the task or subtask evaluated. For example, strong model performance in an evaluation measuring vulnerability identification in a single function does not mean that a model is good at discovering vulnerabilities in a real-world web application where resources, such as access to source code may be restricted.
Estimate best-case and worst-case performance: LLMs may have wide variations in evaluation performance due to different prompting strategies or because they use additional test-time compute techniques (e.g., Chain-of-Thought prompting). Best/worst case scenarios will help constrain the range of outcomes.
Be careful with model selection bias: Any conclusions drawn from evaluations should be put into the proper context. If possible, run tests on a variety of contemporary models, or qualify claims appropriately.
Clarify whether you are evaluating risk or evaluating capabilities. A judgment about the risk of models requires a threat model. In general, however, the capability profile of the model is only one source of uncertainty about the risk. Task-based evaluations can help understand the capability of the model.

Wrapping Up and Looking Ahead

AI and LLMs have the potential to be both an asset to cybersecurity professionals and a boon to malicious actors unless risks are managed properly. To better understand and assess the cybersecurity capabilities and risks of LLMs, we propose developing evaluations that are grounded in real and complex scenarios with competing goals. Assessments based on standard, factual knowledge skew towards the type of reasoning LLMs are inherently good at (i.e., factual information recall).

To get a more complete sense of cybersecurity expertise, evaluations should consider applied security concepts in realistic scenarios. This recommendation is not to say that a basic command of cybersecurity knowledge is not valuable to evaluate; rather, more realistic and robust assessments are required to judge cybersecurity expertise accurately and comprehensively. Understanding how an LLM performs on real cybersecurity tasks will provide policy and decision makers with a clearer sense of capabilities and the risks of using these technologies in such a sensitive context.

Additional Resources

Considerations for Evaluating Large Language Models for Cybersecurity Tasks by Jeffrey Gennari, Shing-hon Lau, Samuel Perl, Joel Parish (Open AI), and Girish Sastry (Open AI)

Software Engineering Institute

SEI Blog