Posted on by CERTin
In early 2012, a backdoor Trojan malware named Flame was discovered in the wild. When fully deployed, Flame proved very hard for malware researchers to analyze. In December of that year, Wired magazine reported that before Flame had been unleashed, samples of the malware had been lurking, undiscovered, in repositories for at least two years. As Wired also reported, this was not an isolated event. Every day, major anti-virus companies and research organizations are inundated with new malware samples.
Although estimates vary, according to an article published in the October 2013 issue of IEEE Spectrum, approximately 150,000 new malware strains are released each day. Not enough manpower exists to manually address the sheer volume of new malware samples that arrive daily in analysts' queues. Malware analysts instead need an approach that allows them to sort out samples in a fundamental way so they can assign priority to the most malicious of binary files. This blog post describes research I am conducting with fellow researchers at the Carnegie Mellon University (CMU) Software Engineering Institute (SEI) and CMU's Robotics Institute. This research is aimed at developing an approach to prioritizing malware samples in an analyst's queue (allowing them to home in on the most destructive malware first) based on the file's execution behavior.
Existing Approaches to Prioritizing Malware Analysis
Before beginning work on developing an approach for prioritizing malware analysis, our team of researchers examined existing approaches and found very few. Most institutions in academia, government, and industry analyze malware by randomly selecting samples, ordering them alphabetically, or analyzing a binary file in response to a request for a specific file based on its MD5 , SHA-1, or SHA-2 cryptographic hash value.
Our team decided to take a systematic approach to malware analysis that takes incoming samples and analyze them using runtime analysis. At a high-level, this approach collects and categorizes salient features. Using a clustering algorithm, our approach then ideally prioritizes the malware sample appropriately in an analyst's queue based on their description of the type of malware they want to analyze upon its arrival to their repository.
A key idea in our approach involved the use of dynamic analysis to measure the maliciousness of a malware sample based on its execution behavior. The assumption is that all malware samples have certain malicious events they must perform to carry out nefarious deeds. The implementation of these deeds is captured at runtime and may prove to be useful assessment characteristics.
Salient and Inferred Features
When we initially began this research, we extracted more than two dozen features that were a mix of
It's important to note that malware doesn't typically commit all of its nefarious deeds with just one running process. In examining features, we also created malware infection trees to gain a better understanding of the processes and files created by the malware. The paper, "Building Malware Infection Trees," which I co-authored, allowed us to view the malware sample as "a directed tree structure" (see Figure 1 below) with each node representing a file or process that the malware had created and each edge representing file creation, process creation, self-replication or dynamic code injection behavior.
Figure 1. A Malware Infection Tree for Poison Malware.
Our research focused on the following three areas of suspicious behavior, all of these behaviors were recorded for the executing malware process and any related processes in its malware infection tree. We assume these behaviors being performed by a member of the malware infection tree can lead to suspicious behaviors usable in assessing the malware sample:
A Technical Dive into Our Approach
Once we determined the features we wanted to look at, we turned our attention to creating a training set. To do so, our team compiled a set of malware samples classified as advanced persistent threats (APTs), botnets, and Trojans. We also included known malware that was ranked in the Top 5 most dangerous and most persistent in the wild in the past five years, according to Kaspersky's securelist.org website. We submitted, executed, and analyzed each sample in our runtime analysis framework and extracted our chosen features.
Once we created our training set, we assembled 11,000 malware samples that we clustered based on the training set. The training set allowed us to gain an initial understanding about what our algorithms tell us about how we could prioritize the malware. We examined execution behavior at the user and kernel levels for a three-minute, run-time analysis.
We then determined, for the whole set, if behaviors were identified mostly from user- or kernel-level collected data. Next, we created a set of characteristics from the most pervasive observed execution behaviors and repeated our experiment using a larger mixed malware sample set and clustered results based on our set of characteristics. At the end of analysis we collected the feature sets for our training sets and our test set of 11,000 samples, and we submitted them for analysis with various machine learning algorithms to determine which one is best for prioritizing malware samples based on our features.
Dr. Jeff Schneider, a researcher at the Robotics Institute in CMU's School of Computer Science and an expert on classification and clustering, agreed to analyze our feature sets and create a classifier to prioritize malware samples. We are also working with software engineers within the CERT Division's Digital Intelligence and Investigations Directorate Software Engineering group who wrote code for us including the feature extraction code.
The goal of our research was to allow analysts greater efficiency in establishing a priority queue for malware analysis. These priorities can vary based on whether the malware analyst works in the financial industry and is interested in Distributed Denial of Service (DDoS) attacks, botnets, or Trojans, or whether the analyst works in the DoD's cyber command and is interested in APTs and protecting national assets.
If our approach works as planned, we will accelerate efforts to formalize an automated prioritization system using our established set of salient and inferred features that could be integrated into current analysis frameworks using as input a live continuous malware feed.
We welcome your feedback on our approach in the comments section below.
To read the paper, "Building Malware Infection Trees," by Jose Andre Morales, Michael Main, Weiliang Luo. Shouhuai Xu, and Ravi Sandhu, please visit
To read about other malware research initiatives at the SEI, please visit
Visit the SEI Digital Library for other publications by Jose