search menu icon-carat-right cmu-wordmark

Learning a Portfolio-Based Checker for Provenance-Similarity of Binaries

Sagar Chaki

As software becomes an ever-increasing part of our daily lives, organizations find themselves relying on software that originates from unknown and untrusted sources. The vast majority of such software is available only as executables, known as "binaries." Many binaries--such as malware or different versions and builds of a software package--are simply minor variants of old programs (or in some cases exact copies) that have been run through a different compiler.

This blog post explains how the ability to detect similarities among binaries is an important tool in malware detection and a growing area of research. For example, the Defense Advanced Research Project Agency (DARPA)'s Cyber Genome Project recently requested new research that would identify the lineage and provenance of digital artifacts. Binary similarity has also become a key objective for national defense and law enforcement officials who use it to discover if a suspected virus is malicious by determining its commonalities with confirmed malware. This information helps analysts safeguard the cyber infrastructure and prevent future attacks.

There are various techniques for checking the similarities between binaries, including:

  • Signatures - digital fingerprints that uniquely define the binary
  • Feature Vectors - popular in the machine-learning community and allow users to extract the functions from a program
    • Semantic Meaning of Binaries - examines function versus form, e.g., there are many ways to erase a file from a computer, but semantically they all perform the same function

I, along with my colleagues--Arie Gurfinkel, who works with me in the SEI's Research Technology and System Solutions program, and Cory Cohen, a malware analyst with CERT--believe that no one technique is completely robust and foolproof, i.e., none can claim the absence of false positives or negatives with absolute certainty all the time. We hypothesize that a better approach may be a portfolio-based binary provenance similarity checker that combines a suite of existing checkers to examine the origins of two strings of code for similar lineage.

To validate our hypothesis, we are creating a framework for the portfolio of techniques that analysts can use to more accurately determine if a binary contains fragments compiled from a specific source code. Similar to an investment portfolio, in which a single investment instrument uses other investment instruments, we will develop a single tool for malware analysis that uses other tools.

In our portfolio-based framework, each technique may not perform perfectly when used in isolation. If we combine several of them so that their deficiencies are canceled out, however, the portfolio should perform better than each of the parts. Our framework is based on "supervised learning," which uses classification techniques to allow the framework to learn and adapt to the changing environment and fruitfully apply the many programs that exist to detect binary similarity.

What we plan to produce is a tool that analysts can use to make predictions about binaries that is statistically better than any in existence. What purpose could such a tool serve? First and foremost, it could help CERT malware analysts develop the source artifacts of a large collection of malicious software that they have collected. This portfolio-based checker will not only help them to catalog malware, but also more effectively determine its origin. Improved binary similarity comparison techniques can also help malware analysts prioritize how to spend limited analysis resources by leveraging existing knowledge about previously analyzed malware. In addition to determining the origin, this tool will also help analysts determine if there has been a violation of intellectual property law.

This research is one of eight exploratory research projects funded in fiscal year 2011 by the SEI, the results of which will help determine what areas of research should become priorities for the SEI. I will be blogging about the progress of this project throughout the year.

For additional details, or to download benchmarks and tools that we have developed, and are using as part of our project, please visit

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed