search menu icon-carat-right cmu-wordmark

Using Machine Learning to Detect Malware Similarity

Sagar Chaki

Malware, which is short for "malicious software," consists of programming aimed at disrupting or denying operation, gathering private information without consent, gaining unauthorized access to system resources, and other inappropriate behavior. Malware infestation is of increasing concern to government and commercial organizations. For example, according to the Global Threat Report from Cisco Security Intelligence Operations, there were 287,298 "unique malware encounters" in June 2011, double the number of incidents that occurred in March. To help mitigate the threat of malware, researchers at the SEI are investigating the origin of executable software binaries that often take the form of malware. This posting augments a previous posting describing our research on using classification (a form of machine learning) to detect "provenance similarities" in binaries, which means that they have been compiled from similar source code (e.g., differing by only minor revisions) and with similar compilers (e.g., different versions of Microsoft Visual C++ or different levels of optimization).

Evidence shows that a majority of malware families generate from the same origin. For example, a 2006 Microsoft Security Intelligence report revealed that the 25 most common families of malware account for more than 75 percent of the detected malware instances. Compounding this problem is the fact that the current cadre of malware analysis tools consists of either manual techniques (requiring extensive time and effort on the part of malware analysts) or automated techniques that are not as accurate (they produce high false-positive or false-negative rates) or are inefficient. In contrast, our approach involves

  1. creating a training set using a sample of binaries
  2. using the training set to learn (or train) a classifier
  3. using the classifier to predict similarity of other binaries

I, along with my colleagues--Arie Gurfinkel, who works with me in the SEI's Research, Technology, and System Solutions Program, and Cory Cohen, a malware analyst with CERT--felt that classification was appropriate for evaluating a binary similarity checker because this form of machine learning is particularly appropriate in instances where closed-form solutions are hard to develop, and a solver can be "trained" using a training set composed of positive and negative examples.

While malware classification is a major aim of provenance-similarity research, there are two main hurdles to applying classification directly to malware binary similarity checking:

  • Classification must be applied to parts of the malware where similarity is expected to manifest most directly. For this research, we decided to apply classification to functions. Intuitively, a function is a fragment of a binary obtained by compiling a source-level procedure or method. Functions are the smallest externally identifiable units of behavior generated by compilers. Similarity at the function level is an indicator of overall similarity between two binaries. For example, malware that originated from the same family will rarely be identical everywhere. Instead they will share important functions.
  • It is hard to develop training sets from malware due to the lack of information on source code and generative compilers. Our research therefore focuses on evaluating open-source software. We believe that a classifier that effectively detects provenance-similarity in open-source functions will also be effective on malware functions because the variation we are targeting (due to changes in source code and compilers) is largely independent of the software itself. For example, the variation introduced by a different compiler version (e.g., introducing stack canaries to detect buffer overflows at runtime) is the same, regardless of whether the source code being compiled is malware or open-source.

More specifically, we selected approximately a dozen C/C++ open-source projects from and compiled them to binaries using Microsoft Visual C++ 2005, 2008, and 2010. We then extracted functions from the binaries using Idapro, which is a state-of-the-art dissembler, and constructed a training set and a testing set from the functions using a tool that we developed atop the Rose compiler infrastructure. Next, we learned a classifier from the training set using the Weka framework. When it comes to classification, the following two main decisions must be considered:

  • What classifier are you going to use?
  • What kind of attribute are you going to use?

We measured the effectiveness of a classifier in terms of two quantities: (1) its F-measure, which is a real number between 0 and 1 that indicates the overall classifier accuracy, and (2) the time required to train the classifier. There is a tradeoff between the two quantities: an F-measure can be increased by using a larger training set, but the training time also increases. We empirically found that the RandomForest classifier was the most effective Weka classifier for our purposes since it has the best F-measure for the same training time.

We repeated the experiment several times with different randomly constructed training and testing sets. To determine the robustness of our results, we repeated our experiments using a different set of open-source software and different versions of Microsoft Visual C++. The results were consistent in all cases, with the F-measures being around 0.95 for RandomForest. This finding is encouraging since it indicates that a provenance-similarity detector based on RandomForest will produce the correct result in more than 95 percent of the cases. We believe that this accuracy is sufficient for use in practical malware analysis situations.

Next, we experimented with various parameters of RandomForest to observe how these parameters affect the tradeoff between its F-measure and its training time. In particular, we focused on two important parameters: the number of trees and the number of attributes. With each attribute, we experimented with different values and measured how the F-measure vs. training time tradeoff changed.

To further improve and evaluate our approach, we developed a suite of the following types of attributes:

  • Semantic attributes, which capture the effect of a binary's execution on specific components of the hardware state, register, and memory locations.
  • Syntactic attributes, which are derived from n-grams and n-perms and represent groups of instruction opcodes that occur contiguously in the library.

We re-evaluated the effectiveness of the classifier using these two types of attributes and concluded that semantic attributes yield better F-measures, but are more expensive to compute than syntactic attributes. Attribute extraction is inherently parallelizable, however, since it is done independently for each function. A rough estimate is that a modern CPU can extract semantic attributes from about 10,000 functions in the CERT catalog every day. Based on this estimate, extracting attributes from malware samples as they are discovered each day is feasible with a modestly sized CPU farm.

We had several false steps along the way. For example, we originally used text files for all of our input and output, which was slow and unwieldy. We therefore decided to store inputs and outputs in a database, which simplified our tools and accelerated our experiments. Another lesson learned was to handle statistical issues and randomness carefully. Since the set of all possible training and testing samples is large, we had to pick random subsets for our experiments. In some cases, we also had to label the samples in a random--yet deterministic--manner so that each sample had a randomly assigned label that stayed the same across all experiments. Constructing a labeling scheme that was both random and deterministic required extra care.

While determining the similarities between binary functions remains a challenge, the preliminary results from our research were presented in a well-received paper at the 2011 Knowledge, Discovery, and Data Mining Conference. Our malware research has also studied fuzzy hashing and sparse representation. Our future research will explore other ways of detecting similarities between functions, including the use of static analysis.

Additional Resources:

For additional details, or to download benchmarks and tools that we have developed and are using as part of our project, please visit

To listen to the CERT podcast, Building a Malware Analysis Capability, please visit

To read other SEI blog posts relating to our malware research, please visit

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed