search menu icon-carat-right cmu-wordmark

A New Approach to Modeling Malware using Sparse Representation

William Casey

Malicious software (known as "malware") is increasingly pervasive with a constant influx of new, increasingly complex strains that wreak havoc by exploiting computers or personal and business information stored therein for malicious or criminal purposes. Examples include code that is designed to pilfer personal and digital credentials; plunder sensitive information from government or business enterprises; or interrupt, misdirect, or render inoperable computer hardware and computer-controlled equipment. This post describes our work to create a rapid search capability that allows analysts to quickly analyze a new piece of malware.

Through our work in cyber security, we have amassed approximately 13 million pieces of malicious software for analysis in a large malware database called the Artifact Catalog. This blog will explore our research efforts to use Suffix Trees, ZDDs, and Sparse Representation Modeling to better represent the corpus of malware for search and retrieval so it can be accessed more easily by analysts. One recent addition to the database includes a grouping of 60,000 pieces of code recovered from the Zeus/Zbot malware, a type of Trojan horse designed to steal banking information. We are conducting a large-scale analysis to identify similarities, but it's a difficult and time-consuming task that involves massive amounts of resources including examinations and assessments by analysts.

Ideally, with each new piece of malware, analysts would learn more about the data and interpret the software's functionality via reverse engineering, an analytic process that provides an understanding of how the code works logically and mechanically by interpreting coded functions. Reverse engineering a single piece of software can take weeks. Reverse engineering 60,000 pieces of software is best suited for an analysis process that automates all repetitive tasks including the identification of similarities in the data. This frees the human analyst to focus on interpretation, decision making and communication of findings. Ideally a single analyst can comprehend and track the activity of a malware author even when the number of pieces may be very large and obfuscation techniques are used.

While the design of any one piece of malware is often unknown, we know that many pieces share functionality. In fact, a majority is derived from a limited number of sources, for example in the case of the Zeus/Zbot data set it appears that all pieces may be generated from a small number of builder programs. In recent years, malware analysis has focused on identifying the source of the data code, zeroing in on the similarities within existing sources.

Along with fellow senior researchers, Jeffrey S. Havrilla and Charles Hines, I am interested in trying to create a rapid search capability that allows analysts to quickly compare a new piece of malware against the millions of entries in the Artifact Catalog database. The goal of our research is to reduce the time needed to identify similar pieces and relevant findings down to minutes, and we've identified the following three techniques to help us achieve this goal:

  • Suffix Trees, which compactly represent the data by collapsing recurring features found in the dataset. This technique provides a map of the malware landscape and is a novel representation

  • ZDDs, which compactly represent a set of sets and support quick operations for set algebras. We will develop an application of the ZDD to support the modeling of the software corpora as a set of distinct families

  • Sparse Representation Modeling, which is a model for data storage that exploits principle features of the data that can be informed by analyst findings and may support multiple and potentially competing data-driven speculations

Although some of these techniques have been used in other domains, such as bioinformatics and genomic research, our project is the first to combine them to automate large-scale malware analysis. As part of our research, we are collaborating with Ravi Sachidanandam's laboratory at Mt. Sinai School of Medicine. Sachidanandam and fellow researcher James Gurtowski have developed applications of suffix trees to organize deep sequencing datasets for bio-informatics research. Their expertise with the use of suffix arrays has advanced the field of bio-informatics research, and we hope to have similar results with malware analysis.

If successful, when CERT malware analysts investigate Zeus-related malware they will be able to use this tool to winnow out code that has been already confirmed as malware and pinpoint blocks that are different and may require deeper analysis. The end result is a more effective response, including assigning a signature to a piece or family to mitigate it or prevent further deployment. In recent years, CERT has established a reputation of excellence in the field of malware analysis, and the enhanced analysis would give us one more tool to help in cyber-crime investigations.

To send us feedback directly about this post, or to obtain information on accessing the limited distribution reports listed below, please email

For more information:

Diversity Characteristics in the Zeus Family of Malware
By William Casey, Cory Cohen, David French, Chuck Hines, Jeff Havrilla, & Ross Kinder
December 2010
SPECIAL REPORT (Limited Distribution)

Application of Code Comparison Techniques Characterizing the Aliser Malware Family
By William Casey, Charles Hines, David French, Cory Cohen & Jeffrey Havrilla
July 28, 2010
SPECIAL REPORT (Limited Distribution)

Function Hashing for Malicious Code Analysis
By Cory Cohen & Jeffrey Havrilla
2009 CERT Research Annual Report , pages 26-27

Malware Clustering Based on Entry Points
By Cory Cohen & Jeffrey Havrilla
2008 CERT Research Annual Report, page 80

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed