A New Approach to Modeling Malware using Sparse Representation

Malicious software (known as "malware") is increasingly pervasive with a constant influx of new, increasingly complex strains that wreak havoc by exploiting computers or personal and business information stored therein for malicious or criminal purposes. Examples include code that is designed to pilfer personal and digital credentials; plunder sensitive information from government or business enterprises; or interrupt, misdirect, or render inoperable computer hardware and computer-controlled equipment. This post describes our work to create a rapid search capability that allows analysts to quickly analyze a new piece of malware.

Through our work in cyber security, we have amassed approximately 13 million pieces of malicious software for analysis in a large malware database called the Artifact Catalog. This blog will explore our research efforts to use Suffix Trees, ZDDs, and Sparse Representation Modeling to better represent the corpus of malware for search and retrieval so it can be accessed more easily by analysts. One recent addition to the database includes a grouping of 60,000 pieces of code recovered from the Zeus/Zbot malware, a type of Trojan horse designed to steal banking information. We are conducting a large-scale analysis to identify similarities, but it's a difficult and time-consuming task that involves massive amounts of resources including examinations and assessments by analysts.

Ideally, with each new piece of malware, analysts would learn more about the data and interpret the software's functionality via reverse engineering, an analytic process that provides an understanding of how the code works logically and mechanically by interpreting coded functions. Reverse engineering a single piece of software can take weeks. Reverse engineering 60,000 pieces of software is best suited for an analysis process that automates all repetitive tasks including the identification of similarities in the data. This frees the human analyst to focus on interpretation, decision making and communication of findings. Ideally a single analyst can comprehend and track the activity of a malware author even when the number of pieces may be very large and obfuscation techniques are used.

While the design of any one piece of malware is often unknown, we know that many pieces share functionality. In fact, a majority is derived from a limited number of sources, for example in the case of the Zeus/Zbot data set it appears that all pieces may be generated from a small number of builder programs. In recent years, malware analysis has focused on identifying the source of the data code, zeroing in on the similarities within existing sources.

Along with fellow senior researchers, Jeffrey S. Havrilla and Charles Hines, I am interested in trying to create a rapid search capability that allows analysts to quickly compare a new piece of malware against the millions of entries in the Artifact Catalog database. The goal of our research is to reduce the time needed to identify similar pieces and relevant findings down to minutes, and we've identified the following three techniques to help us achieve this goal:

Suffix Trees, which compactly represent the data by collapsing recurring features found in the dataset. This technique provides a map of the malware landscape and is a novel representation

ZDDs, which compactly represent a set of sets and support quick operations for set algebras. We will develop an application of the ZDD to support the modeling of the software corpora as a set of distinct families

Sparse Representation Modeling, which is a model for data storage that exploits principle features of the data that can be informed by analyst findings and may support multiple and potentially competing data-driven speculations

Although some of these techniques have been used in other domains, such as bioinformatics and genomic research, our project is the first to combine them to automate large-scale malware analysis. As part of our research, we are collaborating with Ravi Sachidanandam's laboratory at Mt. Sinai School of Medicine. Sachidanandam and fellow researcher James Gurtowski have developed applications of suffix trees to organize deep sequencing datasets for bio-informatics research. Their expertise with the use of suffix arrays has advanced the field of bio-informatics research, and we hope to have similar results with malware analysis.

If successful, when CERT malware analysts investigate Zeus-related malware they will be able to use this tool to winnow out code that has been already confirmed as malware and pinpoint blocks that are different and may require deeper analysis. The end result is a more effective response, including assigning a signature to a piece or family to mitigate it or prevent further deployment. In recent years, CERT has established a reputation of excellence in the field of malware analysis, and the enhanced analysis would give us one more tool to help in cyber-crime investigations.

To send us feedback directly about this post, or to obtain information on accessing the limited distribution reports listed below, please email mc-feedback+blog@cert.org.

For more information:

Diversity Characteristics in the Zeus Family of Malware
By William Casey, Cory Cohen, David French, Chuck Hines, Jeff Havrilla, & Ross Kinder
December 2010
SPECIAL REPORT (Limited Distribution)

Application of Code Comparison Techniques Characterizing the Aliser Malware Family
By William Casey, Charles Hines, David French, Cory Cohen & Jeffrey Havrilla
July 28, 2010
SPECIAL REPORT (Limited Distribution)

Function Hashing for Malicious Code Analysis
By Cory Cohen & Jeffrey Havrilla
2009 CERT Research Annual Report , pages 26-27
www.cert.org/research/2009research-report.pdf

Malware Clustering Based on Entry Points
By Cory Cohen & Jeffrey Havrilla
2008 CERT Research Annual Report, page 80
www.cert.org/research/2008/research-report.pdf

Software Engineering Institute

SEI Blog

A New Approach to Modeling Malware using Sparse Representation

William Casey

March 21, 2011

PUBLISHED IN

CITE

TAGS

SHARE

Written By

William Casey

Digital Library Publications

Send a Message

More By The Author

What Ant Colonies Can Teach Us About Securing the Internet

April 4, 2016 • By William Casey

Provenance Inference in Software

February 3, 2014 • By William Casey

Deterrence for Malware: Towards a Deception-Free Internet

September 23, 2013 • By William Casey

Modeling Malware with Suffix Trees

January 9, 2012 • By William Casey

More In Reverse Engineering for Malware Analysis

The Great Fuzzy Hashing Debate

April 22, 2024 • By Edward J. Schwartz

Comparing the Performance of Hashing Techniques for Similar Function Detection

April 15, 2024 • By Edward J. Schwartz

Detecting and Grouping Malware Using Section Hashes

June 5, 2023 • By Timur D. Snoke, Michael Jacobs

Two Tools for Malware Analysis and Reverse Engineering in Ghidra

November 1, 2021 • By Jeff Gennari

GhiHorn: Path Analysis in Ghidra Using SMT Solvers

October 18, 2021 • By Jeff Gennari