search menu icon-carat-right cmu-wordmark

Improving Data Extraction from Cybersecurity Incident Reports

Samuel J. Perl

This post is also authored by Matt Sisk, the lead author of each of the tools detailed in this post (bulk query, autogeneration, and all regex).

The number of cyber incidents affecting federal agencies has continued to grow, increasing about 1,300 percent from fiscal year 2006 to fiscal year 2015, according to a September 2016 GAO report. For example, in 2015, agencies reported more than 77,000 incidents to US-CERT, up from 67,000 in 2014 and 61,000 in 2013. These incident reports come from a diverse community of federal agencies, and each may contain observations of problematic activity by a particular reporter. As a result, reports vary in content, context, and in the types of data they contain. Reports are stored in the form of 'tickets' that assign and track progress toward closure.

This blog post is the first in a two-part series on our work with US-CERT to discover and make better use of data in cyber incident tickets, which can be notoriously diverse. Specifically, this post focuses on work we have done to improve useful data extraction from cybersecurity incident reports.

Current State of Cyber Incident Reporting

US-CERT is responsible for "analyzing and reducing cyber threats, vulnerabilities, disseminating cyber threat warning information, and coordinating incident response activities" for more than 100 civilian government agencies. Each of these organizations has its own procedures and, in some cases, unique processes for reporting cyber incidents. Each federal agency served by US-CERT has a different organizational structure and is involved in different business activities. For example, the Department of Commerce and the Department of Interior each have a unique online footprint in terms of their internet activity and the types of systems they have. The unique nature of each organization increases the challenge of trying to compare problems across the whole spectrum.

Agencies usually report cyber incidents one at a time and each report varies in content and reporting style. Information that could help analysts identify signs of a coordinated or related attacks across multiple federal agencies is sometimes locked inside of the workflow system (i.e., the system used to track and manage the ticket), making such information hard to identify when it is spread over long periods of time.

Information Extraction Goals

In prior work, we used a set of regular expressions (regex) to find common incident data types in the free text of incident reports, including IP addresses (v4 and v6), domain names ("original" top-level domains (TLDs)), email addresses, file names, file paths, and file hash values. This set allowed us to extract many values in the text of incident tickets but we were never entirely sure how many values we were missing in our extraction process.

In our current work, we wanted to improve extraction of new data types from the free text of incident reports and conversations. We also wanted to improve the readability of our current regex and establish extraction benchmarks to measure performance.

Our goals can be summarized as follows:

  • Identify useful information in the cyber incident reports that we do not currently collect and write new regular expressions to extract them. Specifically we looked for data types that were already being reported in free text (even if they were not on the incident reporting form) that are valuable for situational awareness and cross-report attack pattern recognition.
  • Make existing extraction methods, and our library of expressions, more readable and manageable.
  • Measure the performance of each modified and each new expression that we write.

Establishing Ground Truth

Our first step was to explore the reports for new information and to establish a ground truth test set containing all the values we would want to extract from a given sample of reports. For our test set, we randomly collected 50 sample ­­cyber incident reports that were rich in content and usually contained one or two data types and values. We then added atypical or edge-case reports (such as a report with multiple IP addresses, filenames, log file extracts, and a threat analysis by the cyber incident reporter).

For each sample, we manually reviewed the text and recorded data types and values of interest into a separate test dataset. We used this test set in a variety of ways, including evaluating our results for false positives and false negatives.

Building this sample was an iterative process, and we realized we needed a tool to explore the data faster and find the right reports for our sample. The question we had, What data types are present in the entire free text corpus? was not supported by existing tools. To build a sample of reports with unknown data types or values for types that we were missing (a false negative), we needed to use more permissive searching expressions.

Developing a Bulk Query Tool

To address this need, our team developed a bulk query tool for targeted sample selection, discovering unknown edge cases, and testing regex changes. The bulk query tool tests regexes over a random selection of reports from the corpus. To start building this tool, we manually reviewed our sample reports and noted the many different ways that people express the same data type. We then wrote an automated process to recognize those different notations and return them in a standardized way. Having a more permissive search allowed us to find new reports repeatedly, with variations on how artifacts of the same type were being expressed.

The development of a bulk query tool allowed us to go through the entire corpus of incident reports and pass in arbitrary regexes. We could adjust our original regexes (for instance, to make them more permissive) and see the resulting impact on performance. Usually, that would produce many more false positives, but at the same time we discovered edge cases we missed on our initial pass. Then we could tweak our regexes to keep only the true positives.

Our bulk query tool also measures our extraction statistics between runs. If we make a change to our set of regular expressions and then run them against the corpus, the tool will save and compare the extraction results against the previous run. This feature allowed us to immediately determine whether

  • we caught the edge cases that we were hoping to catch with our regex modification
  • our modifications dropped artifacts that were being captured on previous runs

Using our bulk query tool, we identified many new data types that were frequently in the reports but which we were not yet extracting. Here are some examples of new categories that we now extract (see full list of extraction types is below):

  • top-level domains (other than .com, .edu, etc.) such as .pineapple, .biz, .academy, .bank, .aero, .museum, .mobi and many more .
  • threat and malware names such as Trojan.Win32.VBKrypt.ovip, Win32.FAREIT.A/Zeus
  • countries and their adjectivals, for instance, 'Switzerland' and 'Swiss'

The Challenge of Extracting Defanged Items

One challenge we encountered in our work is defanging. The computer security community has adopted this term to help defenders avoid accidentally running malicious code while performing analysis or when passing information about attacks to other defenders. Defanging obfuscates indicators into a safer representation so that a user doesn't accidentally click on a malicious URL or accidentally run malicious code.

There is no universal standard for defanging, although there are some common methods. There is even a python module to defang certain data types. Typical types of defanged data include IP addresses, fully qualified domain names (FQDN), email, and file extensions. Some samples of defanging we have observed in our reports include

While tackling the challenge of recognizing different defanging notations, we realized our goal of regex modularity and readability was particularly important. Modularity makes regex, in particular, much easier to read, and modular regexes are also easier to reuse and maintain by others.

Here is an example of a regex for an IPv4 address that is not modular, followed by that same expression written in modular form below it.

Not modular:

ipv4 = r""" (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9] (?:[\[\(<{](?:\.|dot|DOT)[\]\)>}]?|[\[\(<{]?(?:\.|dot|DOT)[\]\)>}]|[\[\(<{][dD][\]\)>}]|\s(?:\.|dot|DOT)\s|(?:\.|dot|DOT)) (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]) (?:[\[\(<{](?:\.|dot|DOT)[\]\)>}]?|[\[\(<{]?(?:\.|dot|DOT)[\]\)>}]|[\[\(<{][dD][\]\)>}]|\s(?:\.|dot|DOT)\s|(?:\.|dot|DOT)) (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]) (?:[\[\(<{](?:\.|dot|DOT)[\]\)>}]?|[\[\(<{]?(?:\.|dot|DOT)[\]\)>}]|[\[\(<{][dD][\]\)>}]|\s(?:\.|dot|DOT)\s|(?:\.|dot|DOT) (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]) """


ipv4 = r""" %(quad)s %(dot)s %(quad)s %(dot)s %(quad)s %(dot)s %(quad)s """ % primitives

Our bulk query tool also helped us identify many variations of defanging notation.

Optimized Regular Expressions

We also made use of a solution for category types that autogenerates optimized regexes from a list of raw tokens. The resulting regex is based on a trie (prefix tree) data structure. Here is an example of how an optimized regular expression is constructed:

Given a list of tokens you want to find and extract, for example:

dog dingo cat doggo

A naive regex (manually created):


If your list is small, regex optimization can be done manually, especially if you are good with regexes, but it is not so easily when the list comprises hundreds of tokens

For example, here is a small portion of the regex for matching upon approximately 1300 top level domains:

primitives["tld"] = """ (?:(?:a(?:c(?:c(?:ountants?|enture)|t(?:ive|or)|ademy|o)?|l(?:i(?:baba|pay)|l(?:finanz|y)|sace)?|b(?:b(?:ott|vie)?|udhabi|ogado)|u(?:t(?:hor|os?)|ction|dio?)?|n(?:alytics|droid|quan)|r(?:amco|chi|my|pa|te)?|i(?:r(?:force|tel)|g)?|p(?:artments|p(?:le)?)|m(?:sterdam|ica)?|s(?:sociates|ia)?|g(?:akhan|ency)?|d(?:ult|ac|s)?|q(?:uarelle)?|t(?:torney)?|e(?:ro|g)?|a(?:rp|a)|z(?:ure)?|vianca|fl?|kdn|ws?|xa?|o))|(?:b(?:a(?:r(?:c(?:lay(?:card|s)|elona)|efoot|gains)?|n(?:d|k)|uhaus|yern|idu|by)?|o(?:s(?:tik|ch)|o(?:ts|k)?|ehringer|utique|ats|nd|m|t)?|r(?:o(?:adway|ther|ker)|idgestone|adesco|ussels)?|u(?:ild(?:ers)?|dapest|siness|gatti|zz|y)|l(?:ack(?:friday)?|oomberg|ue)|e(?:ntley|rlin|ats|er|st|t)?|i(?:ngo?|ble|ke|d|o|z)?|n(?:pparibas|l)?|b(?:va|c)?|h(?:arti)?|m(?:s|w)?|c(?:g|n)|zh?|d|f|g|j|s|t|v|w|y))|(?:c(?:o(?:m(?:p(?:a(?:ny|re)|uter)|m(?:unity|bank)|sec)?|n(?:s(?:truction|ulting)|t(?:ractors|act)|dos)|u(?:pons?|ntry|rses)|l(?:lege|ogne)|o(?:king|l|p)|rsica|ffee|ach|des)?|a(?:r(?:e(?:ers?)?|avan|tier|ds|s)?|n(?:cerresearch|on)|p(?:etown|ital)|s(?:ino|a|h)|t(?:ering)?|m(?:era|p)|ll?|fe)?|l(?:i(?:ni(?:que|c)|ck)|o(?:thing|ud)|ub(?:med)?|eaning|aims)?|h(?:a(?:n(?:nel|el)|se|t)|r(?:istmas|ome)|urch|eap|loe)?|r(?:edit(?:union|card)?|icket|uises|own|s)?|i(?:t(?:y(?:eats)?|ic)|priani|rcle|sco)?|e(?:nter|rn|b|o)|u(?:isinella)?|y(?:mru|ou)?|f(?:a|d)?|b(?:a|n)|sc|c|d|g|k|m|n|v|w|x|z))|(?:d(?:e(?:l(?:ivery|oitte|ta|l)|nt(?:ist|al)|al(?:er|s)|si(?:gn)?|mocrat|gree|v)?|i(?:rect(?:ory)?|amonds|scount|gital|et)|a(?:t(?:ing|sun|e)|bur|nce|d|y)|o(?:wnload|mains|cs|ha|g)?|u(?:rban|bai)|rive|clk|vag|ds|np|j|k|m|z))|(?:e(?:x(?:p(?:osed|ress|ert)|traspace|change)|n(?:gineer(?:ing)?|terprises|ergy)|d(?:u(?:cation)?|eka)|u(?:rovision|s)?|ve(?:rbank|nts)|m(?:erck|ail)|s(?:tate|q)?|a(?:rth|t)|quipment|r(?:ni)?|pson|c|e|g|t))|(?:f(?:i(?:na(?:nc(?:ial|e)|l)|r(?:estone|mdale)|sh(?:ing)?|t(?:ness)?|lm)?|a(?:i(?:rwinds|th|l)|s(?:hion|t)|mily|ns?|ge|rm)|o(?:r(?:sale|ex|um|d)|o(?:tball)?|undation|x)?|l(?:i(?:ghts|ckr|r)|o(?:rist|wers)|smidth|y)|r(?:o(?:ntier|gans)|esenius|l)?|u(?:rniture|tbol|nd)|e(?:edback|rrero)|tr|yi|j|k|m))|(?:g(?:o(?:l(?:d(?:point)?|f)|o(?:g(?:le)?)?|p|t|v)|r(?:a(?:inger|phics|tis)|een|ipe|oup)?|u(?:i(?:tars|de)|ardian|cci|ge|ru)?|a(?:l(?:l(?:ery|up|o))?......

Using the optimized versions of the regexes yielded far greater performance than using traditional baseline regexes. Based on our metrics we gained 40 percent more efficiency.

Combining auto-generation with our modular approach allows us to write expressions that

  • Match upon large lists of input but remain relatively easy to read
  • Cover multiple defanging use cases, including ".", "dot", "[.]", "[.", " .]" and more

For example, here's how we can combine the top-level domain regex with a regex for defanged dots in order to match fully qualified domain names (fqdn):

primitives["dot"] = """ (?: # asymmetric brackets ok [\[\(<{] (?: \. | dot | DOT ) [\]\)>}]? | [\[\(<{]? (?: \. | dot | DOT ) [\]\)>}] | [\[\(<{] [dD] [\]\)>}] # spaces must both be present | \s (?: \. | dot | DOT ) \s # plain dot has to come last | (?: \. | dot | DOT ) ) """
 primitives["tld"] = # See statement above
 fqdn = """ (?: # subdomains (?: [a-zA-Z0-9][a-zA-Z0-9\-_]* %(dot)s )+ # TLD (?: %(tld)s ) ) """ % primitives

Final Results

To measure our improvement, we ran our bulk query tools on the set of pre-selected and manually parsed reports. We measured false positives and false negatives between runs and documented the following improvements in our work. Here's what we found:

  • We identified and successfully extracted 14 new incident data types, bringing our new total to 24 data types.
  • The number of observables extracted from three years of report data rose from 380,000 to 1,800,000.
  • We now recognize and extract many cases of defanging in our data.
  • Our regex tools are now much more modular and readable. They are easier to maintain and add to in the future.
  • We also developed tools to better manually interrogate our corpus (of incident reports) and have repeatable methods for identifying new incident data types in the future.

The 24 types of incident data that we extract now include the following:

  • IPv4
  • IPv4 CIDR
  • IPv4 range
  • IPv6
  • IPv6 CIDR
  • IPv6 range
  • MD5
  • SHA1
  • SHA256
  • ssdeep
  • FQDN
  • email address
  • URL
  • user agent
  • filename
  • filepath
  • registry key
  • ASN
  • CVE
  • country
  • ISP
  • ASN owner
  • malware
  • attack type

Looking Ahead

The work described in this post is one piece of a larger information discovery project aimed at determining 1. ) what type of metrics a computer security incident response team (CSIRT) should collect, and 2.) how to apply statistical analysis techniques to incident management (to identify metrics to collect as part of an organization's data analysis operations). This broader project focuses on the following three activities:

  • exploratory analysis. As outlined in this post, using time-tested analytical methods from various fields, we are working to demonstrate methods for improving the quality of incident data and corresponding collection and analysis processes. Specifically, we want to improve incident/indicator analysis capabilities. (See the upcoming SEI Blog post by Zach Kurtz and Sam Perl on Measuring Similarity Between Cyber Security Incident Reports.)
  • general metrics research. We also perform research to identify the current use and effectiveness of incident management metrics and measurements, taxonomies, and incident classifications within communities, such as CSIRTs and general security teams.
  • collaboration and support of other organizations external to US-CERT. These tools are useful to any organization receiving, storing, and analyzing high volumes of cyber security incident reports.

This post has focused on information extraction work to better enable statistical analysis techniques on incident management, which is part of the exploratory analysis task.

We have only scratched the surface of creating a catalog of data types for categories such as malware and attack types. There is a lot of opportunity to extract more structure about those items when they are already present in the reports, likewise for copy/paste of parts of log files, or other contextual analysis.

These reports are not only a rich source of observables (which we extract using regex) but they often contain significant other useful context from the cyber incident reporter and their analysis teams. Sometimes reporting organizations have dedicated reverse malware engineers, network response experts, or other security professionals and they already include very high quality notes in their reports. But this information is often reported in a unique way by each team. Normalizing this information into a common structure for analysis that supports key objectives for defense and situational awareness is challenging.

Applying new methods to extracting existing knowledge that is already reported over long periods of time allows us to directly link analysis already performed to a set of threat observables. We hope this work can help analysts identify information about the extent of attacks quickly, enabling better and faster response and mitigation. This work also provides new tools and methods for collecting indicators to help prevent, detect, and contain malicious activity.

Additional Resources

We will be presenting on this work at FloCon 2018, which will be held January 8-11, in Tucson, Ariz.

We plan to publish a follow-up blog post from Zach Kurtz and Sam Perl on Measuring Similarity Between Cyber Security Incident Reports.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed