New SEI CERT Tool Extracts Artifacts from Free Text for Incident Report Analysis

The CERT Division of the Software Engineering Institute (SEI) at Carnegie Mellon University recently released the Cyobstract Python library as an open source tool. You can use it to quickly and efficiently extract artifacts from free text in a single report, from a collection of incident reports, from threat assessment summaries, or any other textual source.

Cybersecurity teams need to collect and process data from incident reports, blog posts, news feeds, threat reports, IT tickets, incident tickets, and more. Often incident analysts are looking for technical artifacts to help in their investigation of a threat and development of a mitigation. This activity is often done manually by cutting and pasting between sources and tools. Cyobstract helps extract key artifact values from any kind of text, allowing incident analysts to focus their attention on processing the data once it's extracted. Once artifact values are extracted, they can be more easily correlated across multiple datasets.

After using Cyobstract to extract security-relevant data types, the data can be used for higher level downstream analysis and investigation of source incident reports and data. The resulting data can also be loaded into a database, such as an indicator database. Using Cyobstract, the extracted artifacts can be used to quickly and easily find commonality and similarity across reports, thereby potentially revealing patterns that might otherwise have remained hidden.

At its core, the Cyobstract library is built around a collection of robust regular expressions that target 24 specific security-relevant data types of potential interest, including the following:

IP addresses: IPv4, IPv4 CIDR, IPv4 range, IPv6, IPv6 CIDR, and IPv6 range
hashes: MD5, SHA1, SHA256, and ssdeep
Internet and system-related strings: FQDN, URL, user agent strings, email addresses, filenames, filepath, and registry keys
Internet infrastructure values: ASN, ASN owner, country, and ISP
security analysis values: CVE, malware, and attack type

Cyobstract is capable of extracting these artifacts even if they are malformed in some way. For example, in the incident response community, it's often standard practice for teams to "defang" indicators of compromise before storing them or sending them to another team. Defanged indicator values can be difficult for automated solutions to extract. There is no standard practice for defanging, so there are many ways it can be done. Cyobstract was built from a large collection of real incident reports from many organizations so it can handle many ways of defanging that CERT researchers have observed in the field.

In addition to the core extraction library, Cyobstract includes developer tools that teams can use to craft their own regular expressions and capture custom security data types. Analysts typically do this by using a list of names to extract; however, over time, that list can be large and slow to use in matching. Using Cyobstract, analysts can input their lists of terms to match (such as a list of common malware names) and get an optimized regex as output. This expression is significantly faster than trying to match directly to a name on a list.

The library also includes benchmarking tools that can track the effect of changes to regular expressions and present the analyst with feedback on the overall effectiveness of the individual change (e.g., this regex change found three more reports than the previous version).

The Cyobstract library can be downloaded from GitHub at https://github.com/cmu-sei/cyobstract.

Software Engineering Institute

SEI Blog

New SEI CERT Tool Extracts Artifacts from Free Text for Incident Report Analysis

Matthew Sisk and Samuel J. Perl

September 26, 2018

PUBLISHED IN

CITE

TAGS

SHARE

Written By

Matthew Sisk

Digital Library Publications

Send a Message

Samuel J. Perl

Digital Library Publications

Send a Message

More By The Authors

OpenAI Collaboration Yields 14 Recommendations for Evaluating LLMs for Cybersecurity

February 21, 2024 • By Jeff Gennari, Shing-hon Lau, Samuel J. Perl

Improving Data Extraction from Cybersecurity Incident Reports

September 29, 2017 • By Samuel J. Perl

Stress Management and Mistake Minimization (Part 8 of 20: CERT Best Practices to Mitigate Insider Threats Series)

June 1, 2017 • By Samuel J. Perl

More In CERT/CC Vulnerabilities

The Threat of Deprecated BGP Attributes

June 3, 2024 • By Leigh B. Metcalf, Timur D. Snoke

UEFI: 5 Recommendations for Securing and Restoring Trust

June 26, 2023 • By Vijay S. Sarvepalli

Vultron: A Protocol for Coordinated Vulnerability Disclosure

September 26, 2022 • By Allen D. Householder

UEFI – Terra Firma for Attackers

August 1, 2022 • By Vijay S. Sarvepalli

Probably Don’t Rely on EPSS Yet

June 6, 2022 • By Jonathan Spring