search menu icon-carat-right cmu-wordmark

Fuzzy Hashing Against Different Types of Malware

David French

Malware, which is short for "malicious software," is a growing problem for government and commercial organizations since it disrupts or denies important operations, gathers private information without consent, gains unauthorized access to system resources, and other inappropriate behaviors. A previous blog postdescribed the use of "fuzzy hashing" to determine whether two files suspected of being malware are similar, which helps analysts potentially save time by identifying opportunities to leverage previous analysis of malware when confronted with a new attack. This posting continues our coverage of fuzzy hashing by discussing types of malware against which similarity measures of any kind (including fuzzy hashing) may be applied.

Fuzzy hashes provide a continuous stream of hash values for a rolling window over the malware binary, thereby allowing analysts to assign a percentage score that indicates the degree of similarity between two malware programs. When considering how fuzzy hashing works against malware, it is useful first to consider why malware programs would ever be similar to each other. For the purposes of this discussion we focus on prevalent Microsoft Portable Executable (PE) formatted files, although this description can be generalized to any executable code stored in any format. We further consider similarity as a measure of file structure--rather than program behavior--since fuzzy hashing generally applies to the bytes comprising a file, rather than an observation of the semantics of a program in some other space.

Malware is software combining three elements: (1) code, whether compiled source code written in a high-level language or hand-crafted assembly, (2) data, which is some set of numerical, textual, or other types of discrete values intended to drive the logic of the code in specific ways, and (3) process, which is loosely a set of operations (for example, compiling and linking) applied to the code and data that ultimately produce an executable sequence of bytes in a particular format, subject to specific operating constraints. Given a distinct set of code, data, and consistent processes applied thereto, it is reasonable to conclude that--barring changes to any of these--we will produce an identical executable file every time we apply the process to the code and data (where identity is measured using a cryptographic hash, such as MD5). We now consider how the permutation of any of these components will affect the resulting executable file.

First, let us consider the effect of modifying the data used to drive a particular executable. With respect to malicious software, such data may include remote access information (such as IP addresses, hostnames, usernames and passwords, commands, etc.), installation and configuration information (such as registry keys, temporary filenames, mutexes, etc.), or any other values which cause the malware to execute in specific ways. Generally speaking, changing the values of these data may cause different behavior in the malware at runtime but should have little impact on the structure of the malware.

Malware authors may modify their source code to use different data values for each new program instance or may construct their program to access these data values outside the context of the compiled program (for example, by embedding the data within or at the end of the PE file). In the case of malicious code, data may also include bytes whose presence does not alter the behavior of the code in any way, and whose purpose is to confuse analysis. Regardless, the expected changes to the resulting executable file are directly proportional to the amount of data changed. Since we only changed the values of data--not the way in they are referenced (in particular, we have not changed the code)--we can expect that the structure of the output file is modified only to support any different storage requirements for the new data.

Similarly, let us consider the effect of modifying the code found in a particular executable. The code defines the essential logic of the malware and describes the behavior of the program under specified conditions. To modify program behavior, the code must generally be modified. The expected changes to the resulting executable file are proportional to the amount of code changed, much as we expect when changing data. However, code--especially compiled code--differs from data in that the representation of the code in its final form is often drastically different from its original form. Compiling and linking source code represents a semantic transformation, with the resulting product intended for consumption by a processor, not a human reader.

To accomplish semantic transformation most effectively, the compiler and linker may perform all manner of permutations, such as rewriting blocks of code to execute more efficiently, reordering code in memory to take up less space, and even removing code that is not referenced within the original source. If we assume that the process to create the executable remains constant (for example, that optimization settings are not changed between compilations), we must still allow that minor changes in the original source code may have unpredictably large changes in the resulting executable. As a consequence, code changes are more likely to produce executables with larger structural differences between revisions than executables where only data changes.

Thus, we have described two general cases in which structurally different files (measured by cryptographic hashing, such as MD5) may be produced from a common source. We refer to malware families whose primary or sole permutation is in their data as generative malware, and use the analogy of a malware factory cranking out different MD5s by modifying data bytes in some way. We refer to malware families whose primary permutation is in their code as evolutionary malware, in that the behavior of the program evolves over time. When considering the effects of similarity measurements such as fuzzy hashing, we may expect that fuzzy hashing will perform differently against these different general types of malware.

As an example of using fuzzy hashing against generative malware, consider the malware family BackDoor-DUG.a (also referenced here) also known as Trojan.Scraze by ClamAV and W32/ScreenBlaze.A2 by F-Prot (ClamAV and F-Prot are antivirus vendors and it's important to note that the same family is known by several different names). The two files referenced from the McAfee site are Delphi programs, comprising 4,185 functions at distinct addresses as observed by disassembling each program using IDA-Pro v6.1. If we consider each function as a sequence of bytes and consider the cryptographic hashes of each function's bytes using a technique called function hashing, when we observe that these programs have approximately 3,321 unique functions each, per their position independent code (PIC) function hashes. Of these 3,321 functions distinct to each program, we observe that 3,292 of these functions are shared (meaning their bytes are exactly the same) between the programs, and that each program has 29 functions not shared with the other program.

Inspecting each of the 29 functions in each of two files (for a total of 58 functions) in IDA-Pro, we discover that for all 29 pairs of functions found at the same address across the two files, the functions at the same addresses only differ by large blocks of seemingly non-executed data, which is jumped around by the code bytes. Otherwise, code bytes for each of the 29 function pairs at corresponding addresses are identical. In this way, we can observe that the two programs are materially identical, except for seemingly non-executed bytes, which we generically call data. By performing ssdeep comparison of these two files we produce the following fuzzy hashes and their associated comparison score:



70212f8f88865f4f9bb919383aabc029.ex_ matches 6f83ac65223e2ac7837bfe3068da411c.ex_ (85)

Matching these files using ssdeep corroborates our findings using analysis of these files by function data, in that they are highly similar. These two files thus provide a good example of generative malware.

When considering how code changes can affect fuzzy hashing, we consider briefly non-malicious software for which we have full source code. The Nullsoft Scriptable Installation System (NSIS) is an open-source installation system used to create Windows-based installation programs. Although NSIS is not malicious software it can be used to install many different types of programs on Windows computers, including malicious and non-malicious programs alike.

The project page for NSIS provides several revisions; we examined the two most recent Versions 2.45 (MD5 sum af193ccc547ca83a63eedf6a2d9d644d) and 2.46 (MD5 sum 0e5d08a1daa8d7a49c12ca7d14178830), for which Windows binaries are available. The two files comprise 6,038 and 6,040 functions at distinct addresses, respectively, with 2,564 unique functions (as measured by their PIC function hashes). These two programs have 2,544 identical functions, with 20 different functions each. The differing functions have changes that range from identical functions using different constants to entirely new functions with no overlapping behavior. Regardless, the vast majority of the behavior of these two programs is identical. We perform ssdeep comparison of these two files, and produce the following fuzzy hashes, and their associated comparison score:



nsis-2.45/NSIS.exe matches nsis-2.46/NSIS.exe (0)

As seen from the score of zero from ssdeep, fuzzy hashing does not detect any relationship between these two files even though function analysis revealed that the majority of the behavior of these two files is the same. This result is borne out by reading the release notes for V2.46 from the NSIS website, which documents relatively minor changes. When we compare two files whose known changes are relatively few, we can see that, although the evolution of these two programs is relatively minor in terms of absolute number of changes to functionality, their structure is clearly different enough that fuzzy hashing such as ssdeep was completely unable to detect similarity. This highlights the challenging problem of similarity measurements in malicious code, and underscores the need to understand the underlying reasons that similarity would ever present to any particular technique.

Future blog entries will consider alternate fuzzy hashing approaches and tools, and discuss some of the challenges of performing fuzzy hashing at scale.

This post is the second in a series exploring David's research in fuzzy hashing. To read the first post in the series, Fuzzy Hashing Techniques in Applied Malware Analysis, please click here.

Additional Resources:

More information about CERT research in malicious code and development is available in the 2010 CERT Research Report, which may be viewed online at

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed