Improving Assessments for Cybersecurity Training

The CERT Cyber Workforce Development Directorate conducts training in cyber operations for the DoD and other government customers as part of its commitment to strengthen the nation's cybersecurity workforce. A part of this work is to develop capabilities that better enable DoD cyber forces to "to train as you fight" such as setting up high-fidelity simulation environments for cyber forces to practice skills including network defense, incident response, digital forensics, etc. However, cybersecurity is a challenging domain in which to train, because it is a dynamic discipline that changes rapidly and requires those working in the field to regularly learn and practice new skills.

We build simulation environments that allow people to practice skills such as setting up and defending networks. If we can record informative traces of activity in these online environments and draw accurate inferences about trainee capabilities, then we can provide evidence-based guidance on performance, assess mission readiness, optimize training schedules, and refine training modules. In this blog post, I discuss efforts by CERT to develop this new approach to assessing the skills of the cybersecurity workforce.

In our research, we are looking at the actions that users take while they are working through practice activities and what those actions can tell us about the user's acquisition of skills. We believe that by watching the choices that a person makes during a simulation activity, we can infer whether he or she is pursuing an expert strategy, a novice strategy, or something in between. Our goal in this research is to improve the fidelity of assessments that are intended to gauge the level of understanding and future success in the skills being trained.

Currently, the most common way of assessing how good operators are at the skills they acquire in training is to ask them multiple-choice questions at different points during the practice. However, the quality of a multiple-choice question depends on the quality of the wrong answers, or distractors. In a standard SAT question, for example, there are four options and one of them is usually clearly wrong. So, there may be only three viable options, and in many cases only two that are good ones. If there are only two good options, test takers may not be sure which answer is better, but even if they have no idea what the correct answer is, they can guess and have a 50-50 chance of being correct.

This example illustrates that the administrators of the test do not get nearly as much information about the test taker's skills from a multiple-choice or short answer test, as they would from observing the steps taken by the test taker to arrive at an answer. A well-known study from several years ago looked at a math tutor program for middle-school students in an online computer-math environment. In the study, the researchers kept data not just on whether students solved problems correctly, but also on whether students asked for help, consulted a glossary, etc. The study showed that using an extra-assistance score in addition to the simple scores on solving or failing to solve problems was more predictive of final performance and final learning in the class. That study demonstrated the value in using extra available information in addition to information about right and wrong test answers.

Challenges of Assessing Cybersecurity Training

Cybersecurity training has unique qualities that make assessing skills challenging. Assessing the viability of training is not a problem that is unique to the field of cybersecurity, but the dynamism of the field is. Algebra, for example, has not changed much in the past 100 years, and the field of calculus has been relatively stable for the past 60. Cybersecurity, on the other hand, requires that those working in the field regularly and periodically learn and practice new skills. As a result, researchers in cybersecurity don't have 20 years to research optimal ways of teaching people how to do things.

We cannot directly measure knowledge inside a trainee's head. Instead, we must infer what knowledge they possess, based on things that we can observe. Drawing accurate inferences from performance has three requirements:

a domain model. A domain model is the critical mapping between what we can observe (e.g., what commands the trainee entered) and what we want to infer (e.g., that they can secure a network in the future). The domain model is a prerequisite to inform what relevant data looks like. Without an explicit and accurate domain model for cybersecurity task performance, we risk drawing inaccurate conclusions, possibly leading to inaccurate assessments of mission readiness.
relevant data. We cannot distinguish between expert and novice performance if we do not record data on the ways in which they differ. Both novices and experts might be able to complete a task, but the experts are generally faster and use more efficient strategies. If we record only right/wrong, we cannot distinguish performance. If we can record solution time and other features of performance, however, we can then start to model what expert performance looks like.
statistical modeling to compare performance data to expert patterns. Everyone will occasionally "slip" and not give a correct response, even when they know the answer to a problem. (Have you ever made a typo?) To account for these kinds of errors, it is important to probabilistically compare actual performance to expected performance patterns. A good statistical model will explicitly capture the idea that an expert should be more likely than a novice to arrive at a quality solution.

SEI's Simulation, Training & Exercise Platform (STEP), is designed to be a high-fidelity environment where trainees can practice cybersecurity skills. However, one challenge is that the current data-collection method from our simulation environment has limited flexibility in recording events. Specifically, the system can run sequential checks to see if certain events have occurred and store that information; but we must specify in advance what events are of interest, and if events happen out of order, they may not be recorded. Having an accurate domain model will allow us to refine the training modules to cover the right activities and allow us to record the events necessary to assess an operative's skill accurately.

Another challenge in this research is that the existing tools for building domain models are inadequate for our needs. Think-aloud verbal protocols, used to build domain models in algebra, physics, and other domains, are time and labor intensive, and cannot be deployed with the rapid rate of change in cybersecurity. Cultural consensus theory (CCT) is a model designed to discover the "right" answers when the truth is unknown, as in anthropological studies; but CCT cannot handle the complex data available through logs of interactions within a learning environment. It can only handle binary "Yes/No" questions. Deep learning and neural networks are not viable for small data sets; they need tens of thousands of observations to produce reliable results. But gaining access to large numbers of data points is not feasible for us given that we are training smaller numbers of people in specific and specialized skills. Our goal in this research is therefore to craft a method that can find patterns of behavior in relatively small sample sizes.

To this end, we are developing an unsupervised statistical-learning model to analyze the actions people take to solve a task. We will then use this method to build domain models from a small sample of performance traces as students work through the cybersecurity training exercise. We are building on mixed membership models (MMMs), a class of model that includes latent Dirichlet allocation (LDA) as probably the best-known example. MMMs can discover performance profiles in some situations with data from as few as 50 people. MMMs have also been used for natural-language processing and for modeling genetic migration.

I have worked with MMMs for other data and succeeded in finding misconception profiles in a couple of K-12 cases. For example, for physics data, there is a test called force concept inventory that measures mastery of concepts commonly taught in a first semester of physics. When we tested MMM on physics data, we were able not only to discover an expert profile that recovered the answer key for the test, but we also found a profile of common misconceptions. One of the key topics in Physics I is the laws of motion, and it is well known that students struggle and have many misconceptions regarding momentum and force; we showed that these misconceptions are all related to each other. The fact that MMM was able to find the knowledgeable profile, the answer key, and a set of misconceptions that were common gives us confidence that we can do the same thing with our cybersecurity test data.

People may try lots of different approaches during the cybersecurity exercises, so our research involves judgments about what is expert behavior. The algorithm can find partial matches to patterns, so there may be a person who does some things that match the novice pattern and others that match the expert pattern. For example, if we're thinking about using a bash command line, novices might use the up-arrow or tab-completion to speed up their work, while experts are much more likely to use history expansion features (e.g., ls -l !tar:2). No one will use a ! or a $ in every command, but some usage of these features is probabilistically an indicator of higher expertise. In general, the results of this algorithm allow us to characterize expert performance as "by and large" looking one way, while novice performance "by and large" looks another way, while individual people might be somewhere in between.

The intent of our research is to find patterns in what people did and then use those patterns as a basis of comparison for anyone who goes through that exercise in the future. The model we develop will enable us to identify clusters of behavior patterns that distinguish expert performance from novice performance, providing an empirical basis for the domain models of different tasks.

Goals of Our Research

We believe that our research can move us toward being able to conduct better assessments. It will build our capability for making inferences from the acts that people take when performing tasks. If people go through training and can answer a couple of questions correctly but can't really perform the task when called on to do so, our assessment of whether they were mission ready will have been incorrect. Our goal is to be more accurate and to have greater confidence in saying that someone is mission ready. This research will also help us to make training more efficient by enabling us to identify and fix trainee misconceptions when they arise.

Our next step will be to incorporate this methodology into the development of new learning modules and platforms, so that instruction and assessment targets the key differences between experts and novices. After that, the model could be implemented as a supervised learning method to assess mission readiness and provide detailed feedback during training.

Additional Resources

Learn about SEI on-demand training.

Read about Cyber Guard and Cyber Flag exercises.

Read about Cyber Command's annual exercise, Cyber Flag.

Read U.S. Cyber Command Conducts Tactical Cyber Exercise.

Software Engineering Institute

SEI Blog