search menu icon-carat-right cmu-wordmark

A Fighting Chance: Arming the Analyst in the Age of Big Data

The 2017 SEI Year in Review highlights the work of the institute undertaken from October 1, 2016, to September 30, 2017. This blog post, which was published in the 2017 Year in Review, highlights the work of three SEI researchers who work to help military analysts in the age of big data.

Security and defense often come down to a numbers game. The Department of Defense (DoD) needs more trained analysts, and the analysts it already fields must process an exploding volume of sensor and intelligence data. The SEI is working on several fronts to help develop qualified analysts, to help the DoD get more out of every analyst it fields, and to help DoD analysts operate at the pace of the adversary.

Tactical Analytics

An SEI team led by Edwin Morris is developing an edge analytics pipeline for streaming situational awareness. The team first created a platform for building and testing data analytics for streaming textual data. It then tested this platform by analyzing social media and other communications in large public-safety settings, such as multi-day music festivals and sporting events.

Building on this work, Morris's team began developing algorithms for extracting patterns of life (or "scripts") from video and streaming data. A script represents a stereotypical sequence of events and interactions in a particular context. Scripts help analysts relate emerging situations, captured in large volumes of streaming data, to what is already known (such as the typical sequence of events observed when ISIS takes over a village). "Our long-term goal," said Morris, "is to build the pipeline to recognize events in streaming data, determine the credibility of those events, and extract the scripts for interpolation and extrapolation by analysts."

Prioritizing Alerts from Static Analysis

The SEI's Lori Flynn leads a project that uses classification models to help analysts and coders prioritize which static analysis alerts to address. This is a tough problem for analysts who must validate alerts generated by one or more static analysis tools used to identify many potential code flaws. The effort required to manually audit all alerts and repair all confirmed code flaws is often too much for one analyst (or even a group of analysts) to perform, and doing so would exceed a project's budget and schedule.

Flynn's approach to this problem draws on past work in the field in areas such as code contextual information, alert type selection, data fusion, machine learning, and mathematical methods for sorting true and false alerts. It also builds on work in the areas of dynamic detection, graph theory, and model checking. The goal is to produce a statistical classifier that will enable software analysts and coders to prioritize which alerts to addressed by automatically

    • calculating the confidence that an alert is true or false
    • partitioning alerts into three categories: expected true positive, expected false positive, and indeterminate
    • ordering the alerts in the indeterminate category using a confidence metric

To test their approach, Flynn's team is collaborating with three DoD organizations that must address static analysis alerts generated for their codebases, which exceed 100 million significant lines of code. "We expect these organizations to collectively generate approximately 662,000 alerts," said Flynn. "Our goal is to classify 90 percent of flagged anomalies as true and false positives with 95 percent accuracy."

Education and Training

In 2016, the U.S. Army Cyber Command (ARCYBER) tasked the SEI's Cyber Workforce Development (CWD) team with creating training courses for the Defense Information Systems Agency (DISA)'s Big Data Platform (BDP). The BDP consists of multiple services and tools, including popular open source big data and cloud computing solutions, such as Hadoop and Spark. It also includes custom services and applications. Hands-on interaction with the BDP interface is required to properly train ARCYBER computer protection teams and operations research systems analysts.

The CWD hosts and maintains its own multi-node training instance of the BDP. The services on the BDP are closely integrated and must be monitored to ensure that the BDP is running properly. The SEI's training instance of the BDP and the course material are updated to align with the latest version of the BDP.

"The BDP gives the Army and other branches of the military resources for leveraging big data for cybersecurity applications, but the Army has a gap in personnel trained in big data analytics and data science," said the SEI's Sarah Vinksi. "Our modules and training instance of the BDP are being used to address this issue."

The first course developed by the SEI is an introduction to the BDP. It consists of video lessons, hands-on labs that use the training instance of the BDP, and an analyst workstation students can use for each lesson. A second course focused on data science, R Shiny applications, and Spark analytics is in development.

By developing these tools, methods, and educational resources, the SEI is doing its part to help the DoD get the most out of its analyst resources by helping them keep up with an ever-changing environment and growing volumes of data.

Download the 2017 Year in Review.

CITE

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed