Generating Realistic Non-Player Characters for Training Cyberteams

Since 2010, researchers in the SEI CERT Division have emphasized the crucial need for realism within cyberteam training and exercise events. Our approach to the construction and execution of these events has led to publication of a design framework for cyberwarfare exercises that we call Realistic - Environment, Adversary, Communications, Tactics, and Roles (R-EACTR), which provides guidance on how to produce realistic training and exercise events.

In this blog post, we describe efforts underway to improve the realism of non-player characters (NPCs) in training exercises with new software we have created called ANIMATOR. The ability of ANIMATOR to increase the realism of NPCs will be relevant and useful to anyone who is tasked with developing training for cyberteams. Moreover, as we describe below, the generation of highly realistic non-player characters could also be beneficially applied for use in machine-learning algorithms, honeypot payloads, insider-threat modeling, and social-network and relationship modeling.

Unrealistic scenarios that do not match real-world operations are unengaging for participants. To construct a comprehensive and optimally beneficial exercise, we want participants to work in an environment that resembles situations they will encounter in the real world. Realism extends beyond network topology to include other areas, such as scenarios, workflows, and behaviors. Building this experience requires replicating many things for the purposes of training and exercise—networks, workstations, organizations, groups, users, events, intelligence, reports, etc.

For many of these things, we have proper DevOps processes in place to create the necessary artifacts, documents, and otherwise that we need for any sized engagement. This automation spans the construction of the range network itself to routers, switches, servers, workstations, and other machines. It also includes components of the scenario that participants will operate or interact with, such as the road to war, intel specific to scenario-threat types, and the NPCs that have a role to play within the exercise.

One existing platform that we use often is CERT GHOSTS, which is an NPC simulation-and-orchestration platform for realistic network behavior and resulting traffic. We use this software to bring users to life on a computer network and have them perform the activities that our cybersecurity-professional participants see in their work networks. In practice, however, we have always been a bit disappointed in the tools available to generate the personas of these NPCs from names, addresses, email addresses, and other datapoints. The results never feel quite real enough, and often what is generated for one datapoint does not correspond in any way to another already-generated one where the two certainly weigh on one another. For example, we started to ask questions such as

If we generate a six-foot-tall NPC, how much should they weigh?
What is the probability of their having blood type O positive?
How many social-media accounts should they have?
Based on their age, what types of career positions have they had?
If an NPC is in the military, what rank could they be?
What unit would they serve with, and in what capacity?

ANIMATOR Software for Generating Realistic NPC Data

To better address the questions we found ourselves asking, we set out to build our own software that generates more realistic NPC data for use in simulation, training, and exercises—ANIMATOR. One of our early ideas was to add robust support for military personnel with regard to rank, units, billets, and military occupation specialty (MOS) code. Another idea was to factor in education, career, and events history that would enable the detailed analysis of insider-threat potential. Moreover, we added types of accounts and security measures (such as PGP keys and certificates) that we might use during an exercise.

For each datapoint that ANIMATOR generates, we tried to follow some public reference for matching the output of the engine to how one would find percentage breakdowns of this metric in the real world. For example, if we generate an individual at random within the U.S. military branch, how do we determine the branch in which they are likely to be a member? Here we follow guidance from the Department of Defense directly. Each NPC has more than 25 categories of associated details and more than 100 pieces of metadata defining who they are. Each piece of information is generated using sourced datasets to distribute characteristics realistically.

Applying ANIMATOR Data Beyond Cyberteam Training

The data generated by ANIMATOR can be leveraged in many ways, but is particularly applicable in four key areas:

Training machine-learning algorithms—ANIMATOR creates large sets of realistic user data and could easily be leveraged to generate datasets used for training machine-learning (ML) algorithms. This capability enables the rapid training of anthropology-related ML algorithms leveraging one or more of the 100-plus datapoints generated by ANIMATOR.
Honeypot payloads—NPC details generated by ANIMATOR make the user data convincingly real while still being completely fabricated. Therefore, the data is ideal for use in applications like honeypots, where the goal is to trick attackers into thinking they are compromising an asset with real user data.
Insider-threat modeling—Each ANIMATOR NPC is given an insider-threat profile. This profile determines how likely an NPC is to be an insider threat by incorporating the Center for Development of Security Excellence's (CDSE’s) insider-threat potential risk indicators. As we continue developing ANIMATOR, it will be possible to configure NPCs so they are more or less likely to be insider threats according to factors like finance, criminal history, foreign contact, and mental health.
Social-network and relationship modeling—ANIMATOR can establish relationships between the NPCs it generates. As we increase the fidelity of relationships, ANIMATOR NPCs create larger and more realistic social networks. By leveraging ANIMATOR’S ability to generate thousands of interrelated NPCs quickly, it can easily be used to perform social-networking modeling and research.

From a technical perspective, we layered our approach in hopes that others could choose the use case that suited their own projects best. ANIMATOR provides a C# dotnetcore common library for other projects to leverage its generation capabilities. Moreover, individual NPCs can be connected to others and to a larger group-of-groups NPC chain via an API that is distributed as a buildable web application or directly as a Docker container.

For example, for a request to create a new NPC, ANIMATOR does the following:

Once ANIMATOR receives a request to create NPCs, it starts by creating an empty NPC profile.
ANIMATOR then iterates through all 100+ datapoints for the NPC and generates synthetic data to associate with that NPC. Example datapoints are name, address, mental health, career, finances, and family members. Datapoints are generated either at random or using weighted randomization. Weighted randomization involves leveraging verified datasets to influence the distribution of randomly generated datapoints to match much more closely to reality. Our primary goal in ANIMATOR is to make our data as realistic as possible by using weighted randomization for as many datapoints for which we can find datasets.
ANIMATOR will complete this process for as many users as were selected by the request. This information can be exported through the API or stored in a local database. ANIMATOR currently stores NPC data in a local Mongo database, and this feature is still being actively improved and expanded.

We continue to work on ANIMATOR, to fix and improve issues, and add new features as they are identified. Part of these enhancements is driven by our own internal use of ANIMATOR for the many exercises we build and execute for our customers, but we strive to respond to requests from the community of users or potential users of the CERT GHOSTS platform quickly and proactively and to ask for feedback and improvements from the community as well. This strategy has served us well for other GHOSTS projects hosted through Github. We welcome your feedback as we continue to move forward on this and other projects in the exercise-realism space.