search menu icon-carat-right cmu-wordmark

Fuzzy Hashing Techniques in Applied Malware Analysis

David French

Malware--generically defined as software designed to access a computer system without the owner's informed consent--is a growing problem for government and commercial organizations. In recent years, research into malware focused on similarity metrics to decide whether two suspected malicious files are similar to one another. Analysts use these metrics to determine whether a suspected malicious file bears any resemblance to already verified malicious files. Using these metrics allows analysts to potentially save time, by identifying opportunities to leverage previous analysis. This post will describe our efforts to develop a technique (known as fuzzy hashing) to help analysts determine whether two pieces of suspected malware are similar.

When investigating a security incident involving malware, analysts will create a report documenting their findings. To denote the identity of a malicious binary or executable, analysts often use cryptographic hashing, which computes a hash value on a block of data, such that an accidental or intentional change to the data will change the hash value. In communication science, cryptographic hashes are frequently used to determine the integrity of a message sent through a communication channel. In malware research, they are useful for positively identifying a piece of software. If a suspected file has the same cryptographic hash as a known file, an analyst is reasonably confident that the files are identical. Modifying even a single bit of a malicious file, however, will alter its cryptographic hash. The result is that inconsequential changes to malicious files will prevent analysts from rapidly observing that a suspected file is identical to a file they have already seen.

To counter this behavior, analysts seek improved ways of assessing whether two files are similar. One such method is known as fuzzy hashing. Fuzzy hashes and other block/rolling hash methods provide a continuous stream of hash values for a rolling window over the binary. These methods produce hash values that allow analysts to assign a percentage score that indicates the amount of content that the two files have in common. A recent type of fuzzy hashing, known as context triggered piecewise hashing, has gained enormous popularity in malware detection and analysis in the form of an open-source tool called ssdeep.

One reason for ssdeep's broad appeal is it helps analysts quickly determine whether a suspected piece of malware is similar to a known malware sample. It also allows analysts to compare a program on disk with the program as it has been loaded into memory (for example, by Microsoft Windows operating systems). Since the operating system changes portions of the executable program upon loading into memory (for example, modifying memory references), if that program's memory image is then dumped to disk and compared to the original file, the cryptographic hashes of the memory image and the original file will most likely be different, whereas fuzzy hashes might show a significant amount of similarity.

For these scenarios, fuzzy hashing is generally useful, though not without problems. One problem is that fuzzy hashes interpret changes differently than cryptographic hashes. If two files have the same cryptographic hash, it's reasonably certain they are the same file. If two files have a relation indicated by their fuzzy hashes, there's less certainty. Moreover, it's hard to identify what differences there may be unless each and every byte in the two files is compared, which is extremely time consuming and may prove fruitless if the amount of similarity indicated by the fuzzy hash comparison is relatively low. Analyzing fuzzy hashes, therefore, becomes more expensive and less precise.

Another problem is that ssdeep was derived from a technique used to detect spam in email messages. As a result, the ssdeep hash generation and hash comparison algorithms have some properties that make sense when applied to generative textual content (i.e., spam email), but may not translate well when applied to the binary or executable files (i.e., malware). For example, the ssdeep hash comparison algorithm is dependent on the block sizes for the hashes of two binary files, which are derived from the overall size of the file being hashed. It'strivial to make the size of a particular executable file far larger than another that shares identical header and section data by simply appending data to the end of the file, which can force the hashing algorithm to adopt a different block size and thus prevent meaningful comparison. Such an attack would not alter the execution of such a modified program. While ssdeep is a valuable tool for malware analysis, published literature on this approach makes it clear that an examination of this technique would benefit the community of malware analysts. Working with William Casey, also a senior scientist at CERT, I plan to identify and describe the particular cases in which fuzzy hashing is applicable in malware analysis and what significance hashes play in those cases. Conversely, we also plan to identify the instances in which fuzzy hashes do not work and should not be applied in malware analysis. We are interested in comparing fuzzy hash techniques in the broader context of approximate string matching and in discovering best practices. We will publish these findings in a document and share it with the broader malware analysis community. As part of our research, we will collaborate with author of ssdeep, Jesse Kornblum, and work with him to improve the accuracy and effectiveness of fuzzy hashing for malware analysis.

This research is one of eight exploratory research projects funded in fiscal year 2011 by the SEI. The results will help determine what areas of work should become priorities for future SEI research and development.

I will be blogging about the progress of this project throughout the year.

This post is the first installment in a series exploring David's research in fuzzy hashing. To read the second post in the series, Fuzzy Hashing Against Different Types of Malware, plesae click here.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed