Considerations for Evaluating Large Language Models for Cybersecurity Tasks
• White Paper
Publisher
Software Engineering Institute
Abstract
Generative artificial intelligence (AI) and large language models (LLMs) have taken the world by storm. The ability of LLMs to perform tasks seemingly on par with humans has led to rapid adoption in a variety of different domains, including cybersecurity. However, caution is needed when using LLMs in a cybersecurity context due to the impactful consequences and detailed particularities. Current approaches to LLM evaluation tend to focus on factual knowledge as opposed to applied, practical tasks. But cybersecurity tasks often require more than just factual recall to complete. Human performance on cybersecurity tasks is often assessed in part on their ability to apply concepts to realistic situations and adapt to changing circumstances. This paper contends the same approach is necessary to accurately evaluate the capabilities and risks of using LLMs for cybersecurity tasks. To enable the creation of better evaluations, we identify key criteria to consider when designing LLM cybersecurity assessments. These criteria are further refined into a set of recommendations for how to assess LLM performance on cybersecurity tasks. The recommendations include properly scoping tasks, designing tasks based on real-world cybersecurity phenomena, minimizing spurious results, and ensuring results are not misinterpreted.