Learning a Portfolio-Based Checker for Provenance-Similarity of Binaries

As software becomes an ever-increasing part of our daily lives, organizations find themselves relying on software that originates from unknown and untrusted sources. The vast majority of such software is available only as executables, known as "binaries." Many binaries--such as malware or different versions and builds of a software package--are simply minor variants of old programs (or in some cases exact copies) that have been run through a different compiler.

This blog post explains how the ability to detect similarities among binaries is an important tool in malware detection and a growing area of research. For example, the Defense Advanced Research Project Agency (DARPA)'s Cyber Genome Project recently requested new research that would identify the lineage and provenance of digital artifacts. Binary similarity has also become a key objective for national defense and law enforcement officials who use it to discover if a suspected virus is malicious by determining its commonalities with confirmed malware. This information helps analysts safeguard the cyber infrastructure and prevent future attacks.

There are various techniques for checking the similarities between binaries, including:

Signatures - digital fingerprints that uniquely define the binary
Feature Vectors - popular in the machine-learning community and allow users to extract the functions from a program
- Semantic Meaning of Binaries - examines function versus form, e.g., there are many ways to erase a file from a computer, but semantically they all perform the same function

I, along with my colleagues--Arie Gurfinkel, who works with me in the SEI's Research Technology and System Solutions program, and Cory Cohen, a malware analyst with CERT--believe that no one technique is completely robust and foolproof, i.e., none can claim the absence of false positives or negatives with absolute certainty all the time. We hypothesize that a better approach may be a portfolio-based binary provenance similarity checker that combines a suite of existing checkers to examine the origins of two strings of code for similar lineage.

To validate our hypothesis, we are creating a framework for the portfolio of techniques that analysts can use to more accurately determine if a binary contains fragments compiled from a specific source code. Similar to an investment portfolio, in which a single investment instrument uses other investment instruments, we will develop a single tool for malware analysis that uses other tools.

In our portfolio-based framework, each technique may not perform perfectly when used in isolation. If we combine several of them so that their deficiencies are canceled out, however, the portfolio should perform better than each of the parts. Our framework is based on "supervised learning," which uses classification techniques to allow the framework to learn and adapt to the changing environment and fruitfully apply the many programs that exist to detect binary similarity.

What we plan to produce is a tool that analysts can use to make predictions about binaries that is statistically better than any in existence. What purpose could such a tool serve? First and foremost, it could help CERT malware analysts develop the source artifacts of a large collection of malicious software that they have collected. This portfolio-based checker will not only help them to catalog malware, but also more effectively determine its origin. Improved binary similarity comparison techniques can also help malware analysts prioritize how to spend limited analysis resources by leveraging existing knowledge about previously analyzed malware. In addition to determining the origin, this tool will also help analysts determine if there has been a violation of intellectual property law.

This research is one of eight exploratory research projects funded in fiscal year 2011 by the SEI, the results of which will help determine what areas of research should become priorities for the SEI. I will be blogging about the progress of this project throughout the year.

For additional details, or to download benchmarks and tools that we have developed, and are using as part of our project, please visit http://www.contrib.andrew.cmu.edu/~schaki/binsim/index.html.

Software Engineering Institute

SEI Blog

Learning a Portfolio-Based Checker for Provenance-Similarity of Binaries

Sagar Chaki

February 14, 2011

PUBLISHED IN

CITE

TAGS

SHARE

Written By

Sagar Chaki

Digital Library Publications

Send a Message

More By The Author

Verifying Software with Timers and Clocks (STACs)

December 12, 2016 • By Sagar Chaki, Dionisio de Niz

Verifying Distributed Adaptive Real-Time Systems

October 10, 2016 • By James Edmondson, Sagar Chaki

Using Machine Learning to Detect Malware Similarity

September 19, 2011 • By Sagar Chaki

More In Reverse Engineering for Malware Analysis

The Great Fuzzy Hashing Debate

April 22, 2024 • By Edward J. Schwartz

Comparing the Performance of Hashing Techniques for Similar Function Detection

April 15, 2024 • By Edward J. Schwartz

Detecting and Grouping Malware Using Section Hashes

June 5, 2023 • By Timur D. Snoke, Michael Jacobs

Two Tools for Malware Analysis and Reverse Engineering in Ghidra

November 1, 2021 • By Jeff Gennari

GhiHorn: Path Analysis in Ghidra Using SMT Solvers

October 18, 2021 • By Jeff Gennari