Semantic Code Analysis for Malware Code Deobfuscation

In 2012, Symantec blocked more than 5.5 billion malware attacks (an 81 percent increase over 2010) and reported a 41 percent increase in new variants of malware, according to January 2013 Computer World article. To prevent detection and delay analysis, malware authors often obfuscate their malicious programs with anti-analysis measures. Obfuscated binary code prevents analysts from developing timely, actionable insights by increasing code complexity and reducing the effectiveness of existing tools. This blog post describes research we are conducting at the SEI to improve manual and automated analysis of common code obfuscation techniques used in malware.

Obfuscation Example and Impact on Malware Analysis

When shown the following example of obfuscated assembly code, an experienced malware analyst at CERT took 550 seconds (more than 9 minutes) to determine its basic functionality.

While this example was created explicitly for this demonstration, it shows several techniques commonly encountered in malware and represents a realistic scenario. In particular, the net effect of this code is equivalent to a single instruction that is easily understood:

In the Department of Defense (DoD) and other large analysis environments, several hundred-fold increases in the effort required to analyze malware raises computer and network defense costs and delays the necessary insights from being generated within reasonable time and budget constraints.

Our Approach to Deobfuscation

Deobfuscation is a method for reverse-engineering obfuscated code. Our approach to the deobfuscation problem applies multiple semantic transformations to an obfuscated program to produce a new program that is functionally equivalent to the obfuscated program but easier to analyze. Some example obfuscation techniques and rough descriptions of the transformations include

Complex hexadecimal arithmetic. Malware authors frequently add extraneous addition, subtraction, and bitwise logical operations to hide important program addresses and constants. When possible, the instructions performing these operations should be replaced with immediate values in simpler instructions.
Stack pointer abuses. Malware authors will often have many unnecessary PUSH and POP instructions to make the code harder to reason about. When used in conjunction with CALL and RET instructions, control flow can be obscured as well. Such instructions should be removed to minimize stack changes.
Control flow obfuscation. Unnecessary jumps in intra-procedural flow control make it hard for a human analyst to follow overall program flow control. Unconditional and always-true conditional JMP instructions should be removed.
Dead stores. Instructions that write to registers and memory that are never subsequently read do not contribute meaningfully to the net effect of the function. Instructions performing dead stores can be detected by definition and usage analysis and subsequently removed.

These types of transformations may be run in multiple passes and in varying orders to make it easier to defeat obfuscations that involve the application of techniques in sequence. A modular approach to the transformation should improve code maintenance and simplify the addition of new techniques.

Prototype Deobfuscation Tool

Early in this research effort, we used the ROSE compiler infrastructure to build a functioning prototype of a general deobfuscation tool that was specifically targeted at removing dead code. ROSE provides facilities for disassembly, instruction emulation, and control flow graph representation. We have applied ROSE in our prior research on Using Machine Learning to Detect Malware Similarity and Semantic Comparison of Malware Functions.

In our current effort, we expanded upon the core ROSE capabilities by adding definition and usage analysis, dead code elimination, and other deobfuscation-specific techniques. We also implemented a rudimentary method for generating new executables as an output format by building on ROSE's assembly features. This prototype was tested against several members of the Ramnit family, using a transformation to eliminate dead code.

Applying Our Prototype Tool to String Deobfuscation

While working on the prototype tool, our team was presented with an operational need to combat a specific type of string obfuscation. String obfuscation is used by various malware families to prevent important strings from being easily recovered from the executable. This technique involves moving immediate bytes into a local stack variable to construct a normal C-style null terminated string.

By emulating instructions and collecting memory writes to the stack, we were able to extract the deobfuscated strings from the code. This transformation allowed us to automatically process many more files than would have been possible using a manual approach. The deobfuscated strings assisted in the development of a malware family report. We also produced a catalog of common obfuscation techniques, with specific reference to the addresses of several files demonstrating each of the techniques.

At this point in our research, we had demonstrated a working prototype and operational relevance, but needed to implement more transformations to have a complete and generally applicable tool. The primary question we still faced was "which transformations should be implemented next?" We decided to conduct an obfuscation technique prevalence study to gain some basic facts about the prevalence and distribution of the techniques in our catalog.

Analyzing the Prevalence and Distribution of Obfuscation Techniques

We chose to implement six tests for our initial study. Each test looked for the occurrence of a common obfuscated code pattern, such as a dead store, an opaque predicate, or an obfuscated control flow. We counted how often these patterns appeared in several datasets, containing a total of 150,000 malware files. The number of detected obfuscations was fairly low, with only about 12 percent of functions having a detected pattern, and we detected only a single test in most of those functions.

This distribution suggested we might be encountering an occasional single false positive in many functions, while truly obfuscated functions were routinely positive for many detections in several tests. Adjusted to account for these probable false positives, the percentage of functions in which we detected obfuscated code declined to about 1 percent. Using the adjusted criteria, only about 23 percent of files contain an obfuscated function.

Some of the likely false positives filtered by the revised criteria appear to have been caused by incorrect disassembly. In several cases we examined manually, we found that arbitrary bytes had generated nonsense instructions that legitimately contained the obfuscated code patterns that we were searching for. It is hard to build firm program analysis conclusions on unreliable disassembly foundations, so we've started a new research effort to develop improved disassembly methods and objective metrics for the assessment of disassembly correctness.

Collaborations

Charles Hines from the CERT Division and Wesley Jin, a Ph.D. candidate in CMU's Department of Electrical and Computer Engineering, have been actively involved in this research effort. Sagar Chaki and Arie Gurfinkel, both senior members of the technical staff in the Software Solutions Division have contributed significantly as well.

Additional Resources

To read more about this research and the results of other exploratory research projects, download the SEI technical report, Results of SEI Line-Funded Exploratory New Starts Projects: FY 2012, at
https://resources.sei.cmu.edu/library/asset-view.cfm?assetID=57592

Software Engineering Institute

SEI Blog