search menu icon-carat-right cmu-wordmark

Writing Effective YARA Signatures to Identify Malware

David French

In previous blog posts, I have written about applying similarity measures to malicious code to identify related files and reduce analysis expense. Another way to observe similarity in malicious code is to leverage analyst insights by identifying files that possess some property in common with a particular file of interest. One way to do this is by using YARA, an open-source project that helps researchers identify and classify malware. YARA has gained enormous popularity in recent years as a way for malware researchers and network defenders to communicate their knowledge about malicious files, from identifiers for specific families to signatures capturing common tools, techniques, and procedures (TTPs). This blog post provides guidelines for using YARA effectively, focusing on selection of objective criteria derived from malware, the type of criteria most useful in identifying related malware (including strings, resources, and functions), and guidelines for creating YARA signatures using these criteria.

Benefits of Reverse Engineering for Malware Analysis

Reverse engineering is arguably the most expensive form of analysis to apply to malicious files. It is also the process by which the greatest insights can be made against a particular malicious file. Since analysis time is so expensive, however, we constantly seek ways to reduce this cost or to leverage the benefits beyond the initially analyzed file. When classifying and identifying malware, therefore, it is useful to group related files together to cut down on analysis time and leverage analysis of one file against many files. To express such relationships between files, we use the concept of a "malware family", which is loosely defined as "a set of files related by objective criteria derived from the files themselves." Using this definition, we can apply different criteria to different sets of files to form a family.

For example, as I described in my last blog post, Scraze is considered a malware family. In this case, the objective criteria forming the family are the functions that Scraze files have in common. Likewise, NSIS files can also be considered a family, though a reasonable person might conclude that a common installation method for multiple different programs (such as NSIS provides) is an irrelevant relationship between those programs. In this case, we may also use function sharing as the objective criteria forming the family.

Whatever the criteria selected to identify a malware family, we generally find that it is desirable for these criteria to have the following properties:

  • The criteria should be necessary to the behavior of the malware. Ideal candidates are structural properties (such as a particular section layout or resources needed for the program to run) or behavioral properties (such as function bytes for important malware behavior).
  • The criteria should also be sufficient to distinguish the malware family from other families. We often find that malicious files accomplish similar things but perhaps in different ways. For example, many malicious files detect that they are running in a virtual environment, and there have been many published techniques on how to implement this behavior. Since these techniques are published on the Internet, it is trivial for a malware author to incorporate them into his own programs, and we expect this to occur. Thus, using these published examples as criteria indicative of one particular malware family is probably not sufficient to distinguish that family.

Applying YARA Signatures to Malware

Analyst experience and intuition can guide the selection of good criteria, as discussed in my prior work on the process of selecting and applying criteria. The goal of developing these criteria is to use them to identify related files. In practice, we generally find that good criteria for distinguishing and identifying malware families are excellent targets for creating signatures that identify the families.

One way to encode signatures that identify malware families is by using the open source tool YARA. YARA provides a robust language (based on Perl Compatible Regular Expressions) for creating signatures with which to identify malware. These signatures are encoded as text files, which makes them easy to read and communicate with other malware analysts. Since YARA applies static signatures to binary files, the criteria statically derived from malicious files are the easiest and most effective criteria to convert into YARA signatures. We have found three different types of criteria are most suitable for YARA signature development: strings, resources, and function bytes.

The simplest usage of YARA is to encode strings that appear in malicious files. The usefulness of matching strings, however, is highly dependent on which strings are chosen. For example, selecting strings that represent unique configuration items, or commands for a remote access tool, are likely to be indicative of, and specific to, a particular malware family. Conversely, strings in a malicious file that result from the way the file was created (such as version information stored by the Microsoft MSVC++ compiler) are generally poor candidates for YARA signatures.

Here is an example of a YARA signature for the malware family Scraze, based on strings derived from the malware:

rule Scraze
$strval1 = "C:\Windows\ScreenBlazeUpgrader.bat"
$strval2 = "\ScreenBlaze.exe "
all of them

Another effective use of YARA is to encode resources that are stored in malicious files. These resources may include things like distinctive icons, configuration information, or even other files. To encode these resources as YARA signatures, we first extract the resources (using any available tool, for example Resource Hacker) and then convert the bytes of the resource (or a portion thereof) to a hexadecimal string that can be represented directly in a YARA signature. Icons in particular make excellent targets for such signatures. However, resources are easily modified in a program. For example, Microsoft Windows allows a program's icon to change by simple copy/paste operations in Windows Explorer. The same care must be taken in selecting program resources as is taken when selecting strings.

Finally, an effective use of YARA signatures is to encode bytes implementing a function called by the malicious program. Functions tend to satisfy our desire for criteria both necessary and sufficient to describe the malware family. The best functions to encode are those that perform some action deemed indicative of the overall character of the malware.

For example, the best function for malware whose primary purpose is to download another file may be one that performs the download or processes downloaded files. Likewise, for remote access tools, it may be a function to encrypt network communications or to process received commands. For viruses, it may be code (which may not even be an actual function) involved with decrypting a payload or re-infecting additional files. We may also identify packers by encoding bytes representing the unpack stub.

Here is an example of a YARA signature encoding part of a function used to check the integrity of an NSIS installer created with NSIS version 2.46. This basic block has had address bytes wildcarded based on the PIC algorithm (described in the article "Function Hashing for Malicious Code Analysis" in the 2009 CERT research report).

rule NSIS_246
$NSIS_246_CheckIntegrity = { 57 53 E8 ?? ?? ?? ?? 85 C0 0F 84 ?? ?? ?? ?? 83 3D ?? ?? ?? ?? 00 75 ?? 6A 1C 8D 45 D8 53 50 E8 ?? ?? ?? ?? 8B 45 D8 A9 F0 FF FF FF 75 ?? 81 7D DC EF BE AD DE 75 ?? 81 7D E8 49 6E 73 74 75 ?? 81 7D E4 73 6F 66 74 75 ?? 81 7D E0 4E 75 6C 6C 75 ?? 09 45 08 8B 45 08 8B 0D ?? ?? ?? ?? 83 E0 02 09 05 ?? ?? ?? ?? 8B 45 F0 3B C6 89 0D ?? ?? ?? ?? 0F 8F ?? ?? ?? ?? F6 45 08 08 75 ?? F6 45 08 04 75 ?? }

Caveats in Creating YARA Signatures

Regardless of the criteria used to create YARA signature, there are always caveats, especially for criteria derived from program data, such as strings. For example, malware with statically coded password lists may have a large number of strings, including those that may seem to unique to a family. Moreover, since strings are easily mutable, a string (such as the filename or path into which a malicious file is installed, or common encoding strings used in HTTP requests) considered as indicative may change at any time. The analyst must assess the significance of the presence or absence of a particular string, rather than delegate responsibility of understanding to a YARA signature.

Likewise, when using code instead of data to create YARA signatures, we must also be aware of the caveats. It is important to use functions that are not provided from a common implementation detail, such as common libraries that have been statically linked into the program. It is also important to account for differences in malicious files due to process changes, such as recompilation. One way to sift out these differences is to use algorithms that normalize address references (for example, the PIC algorithm) when selecting function bytes.

Malicious code may change over time, so particular functions may come and go. It is therefore better to select a number of functions to encode as signatures and derive the "best" signatures by surveying matching files and refining the analysis as the importance of the signatures is revealed. This method is the essence of family analysis, and is an important application of YARA to malware analysis.

Open Issues and Future Work

While YARA continues to gain in popularity, there aren't many guidelines on how to use it most effectively. The issue of false positives continues to vex malware analysts. For example, YARA signatures may match files in which the specified criteria exist and yet do not possess the same semantics as expressed in the original file.

Another issue with using YARA signatures is that they can be fragile in the face of changing codebases. A malware author can and does change his/her code to suit an ever-shifting set of goals, and keeping up with these changes is particularly challenging. While YARA signatures are a powerful means of capturing and communicating analyst insights, care must be taken that they do not drift too far away from the current reality of a particular malware family.

To address these open issues, our future work is to continue refining the process by which malware families may be most reliably identified. This work includes developing metrics for the best type of criteria to define families, measuring the resilience of these criteria in the face of changes, and evaluating the cost of developing the criteria vs. the cost to change the malware. We are also developing techniques to prioritize malware analysis based on such metrics, as well as triage systems to associate related files by automatically producing and testing high-quality signatures.

Additional Resources:

Function Hashing for Malicious Code Analysis, CERT Research Report, pp 26-29

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed