search menu icon-carat-right cmu-wordmark

Big-Data Malware: Preparation and Messaging

Headshot of Brent Frye

Part one of this series of blog posts on the collection and analysis of malware and storage of malware-related data in enterprise systems reviewed practices for collecting malware, storing it, and storing data about it. This second post in the series discusses practices for preparing malware data for analysis and discuss issues related to messaging between big data framework components.

To those familiar with processing data and data warehousing, the extract, transform, load (ETL) process is probably already well known. The preparation module is where "transform" begins, possibly to be extended into the analysis module.

Data Cleaning and Conditioning

Some standard data-cleaning and conditioning steps that can be performed within the preparation module include

  • data validation (e.g., checksum validation, formatting, etc.)
  • cleansing (e.g., removing or correcting bad records or fields)
  • addressing missing values and clerical errors
  • deduplication
  • schema transformation and standardization
  • outlier identification/removal
  • reformatting/normalizing
  • encapsulating business rules
  • handling special values
  • handling overloaded database fields

Data validation in most datasets involves making sure each field for each data row is in the proper format. Malware analysis is adversarial in nature, so we must be even more vigilant for malicious data. SQL injection, command Injection, and cross-site scripting attacks are easy to incorporate in something normally innocent such as a file name. For example, readme.txt; rm -rf * is a valid filename that, if not handled correctly by a shell script that uses it, can lead to the removal of data. However tempting it might be to remove this record because of potential future harm, if the goal of your collection is to accurately reflect the data provided, you will need to handle maliciously formed records, as well as the typical ones. The OWASP Guide Project and the SEI CERT Coding Standards can help provide guidance to avoid these and other similar data-poisoning attacks.

Cleansing is removing or correcting bad records or fields found during the data-validation process. The data may not be bad, it may just be unexpected. For example, the filename might be in a character set that doesn't present well in ASCII, or which is an invalid character if stored directly as UTF-8.

Your predefined business rules should help determine what to do with records that are malformed. You might need a process to convert to a different character set or to convert all names into hexadecimal values to address these issues. Alternatively, it might be desirable to remove the field or record if there is not enough value in the remaining data to keep it. If expected values are missing, then either the business rules must provide a method to acquire or regenerate the data, the record can be kept in its incomplete state, or the record can be dropped.

Handling duplicate records was mentioned in Part One in the discussion of collection and storage, but it shows up here as a typical step in the preparation phase.

Schema transformation and standardization is a typical ETL action: transforming the data from the input format into the format of the system where you are storing the data. This transformation may involve normalization of the data by splitting the data into multiple tables that are connected by means of a foreign key. However, most big-data systems eschew data normalization in favor of keeping all data for a record within the record as a logical document. Many NoSQL systems do not have a fast SQL JOIN method to merge normalized data back into a single record.

Outliers are data values that fall outside some predefined range. There's nothing inherently wrong with including outliers, but they may indicate there is a problem with the file compared with more typical data. In the case of malware, two common outlier ranges are the filesize and the archival/obfuscation depth. If a file is too big or too small to be interesting, or is not a type that the organization wants to handle, then it is an outlier. Similarly, if the file is an archive that contains another archive, which contains another archive, and so on, then at some point the archives must stop, but you need to decide how deep you want to go and what you want to do with the intermediate archives. With malware, there are obfuscation techniques that can be used to make an executable harder to analyze. UPX is a commonly used packer for this purpose, and just as a file may be several layers deep in an archive, a packer may be used multiple times to further complicate unpacking.

Special values appear in many forms. In some datasets, something like "-1" (when real results are always non-negative) or "\N" or "None" may indicate that the data value is empty. There are certain "magic values" that help determine the type of file being examined. Portable executable (PE) files, the kind typically run on a Microsoft Windows OS, have a timestamp representing compile time. One particular special value that should be considered when dealing with time duration is "0". Any activity performed on a computer system takes non-zero time, but if the smallest time unit available is the second, then a "0" for this time may cause problems later.

Finally, database fields may be overloaded, so that the value in one column may have a different meaning depending on the value of another column. This overloading can happen in an attempt to save space in the output format. As long as the conditions are known and addressed in the business logic of the preparation phase, overloading shouldn't present a problem.

While not all these steps may be relevant to a particular data set, more are relevant to malware data than might be immediately apparent. If all you are able to collect from a source is the malware file and a cryptographic hash, you might think that many of these don't apply. But there are many tasks that can be performed under in consideration of business rules. For example, every file has non-changing characteristics that can be acquired, including filesize, filetype, mimetype, and hashes (e.g., MD5, SHA1, SHA256, and ssdeep are common).

Additional Preparation Steps

In addition to cleaning and conditioning the data, the preparation module may include efforts to assist future data-analysis efforts. Such tasks may include

  • data aggregation
  • surface analysis and hashing
  • indexing to support fast lookup
  • archive extraction

Data Aggregation

Data aggregation involves merging data sets, possibly from different data providers, to enhance the data set beyond what each original data source provided. For example, it is possible to use a hash (typically MD5) of a malware file to look up results from anti-virus scans performed by a third party, or to see if the file is associated with a CVE, which would indicate that it is associated with a known vulnerability. This aggregation is a form of enrichment, and in this phase enrichment is usually limited to simple lookups or counting features. Further enrichment may be performed during the analysis phase as some analysis processes are completed.

Surface Analysis and Hashing

Surface analysis collects information about the file that is relatively easy to determine and that does not require executing the files. While some might think that all analysis should happen during the analysis phase, surface analysis is a form of enrichment that doesn't take many resources, time, or CPU cycles. The information it provides is limited to an identification of what a file is.

Hashing algorithms, such as MD5, SHA1, and SHA256, identify malware according to their entire contents; these hashes will change if even a single bit is changed. A fuzzy hashing system, such as ssdeep, can discern that a file is close to some other file. The filetype and mimetype are usually identified by heuristics based on what a file looks like, and the filesize is always going to be the same no matter what; so filetype, mimetype, and filesize all provide useful information.

Looking for known strings can also be included here. For example, a tag can be added to note the existence of strings of interest. Pushing the definition a bit are actions like section hashing, where each section in a PE executable file (or other files that contain clearly defined sections, like PDF) is hashed individually.


After preparation is completed, the next step typically involves searching the data that have been collected for information of interest. When looking for a specific file or set of files, a list of MD5s or SHA256s can be presented to the storage system or searched in a database that stores metadata information about acquisition or from surface analysis. It is useful to index commonly queried fields to reduce the search time for that data. The tradeoff with any index is that adding new records will take slightly longer and require additional disk space, but the improvements to search speed are generally worth the cost.

Metadata indexing is not always a workable solution, especially when searching for arbitrary strings in files that may not be known at the time of collection. When there are hundreds of millions of files, as in the results of a network scan infecting thousands of files, full text searching may be the only option. For example, if there is a long string that might be an array of data used for encryption or for doing a hash or something unique to a particular family of malware, the files could be searched for that string.

Searching of this kind by using something like the UNIX command grep would take days or eventually weeks to go through terabytes or hundreds of terabytes of data. At CERT, we created a system called CERT BigGrep that will index files more quickly when searching for specific strings. BigGrep uses a probabilistic N-gram-based approach to balance index size and search speed. On a single host, this index can reduce search time to minutes or hours; if used in parallel with multiple processes performing the search this time can be correspondingly reduced.

Archive Extraction

Archive extraction is particular to file data. Many users are familiar with file archives that have been collected and compressed to make it easy to transfer them. These files are often called "tarballs" or "zip files" and may include one or more files merged and compressed in a way that a standard tool can reconstruct the files to their original formats.

In addition to compression, some malware is packed to obfuscate what it does. Before the malware can be analyzed or reverse engineered, the system needs to have some way of unpacking it so that its behavior can be observed and mitigated. In a compressed state, it is harder to reverse engineer malware. UPX is the most prevalent packer for malware. It is a simple matter to use UPX-D and restore the original file, although it is possible that the file may have been packed more than once.

Messaging Module

The messaging module is one of the Big Data Framework Provider modules that forms the backbone of the Big Data system. It is involved with reliably queuing, transmitting, and delivering data between the other modules, or to various system components used by a module. Big data solutions consist of collections of different products that are often distributed and decentralized, and the nature of big data requires handling of a large volume of data. It is therefore important to get the underlying messaging system performing optimally and resilient in the face of errors.

A point-to-point transfer model or a store-and-forward model are two of the most common architectural patterns used for messaging. One important detail with big data systems is that you should always assume something will go wrong with transmission. Therefore, some form of delivery guarantee and error handling is essential.

With a point-to-point transfer system, data can flow either as a push or a pull from the source component to the target component. The difference is in which component initiates the transfer transaction. With a data push, the source component determines which target component to send the data to; if there is a choice then it often involves either sharding the data using a key or using a basic load balancing system. With a data pull, the target component(s) request the next data object(s) from the source.

The store-and-forward pattern works similarly. However, it adds an intermediate broker system to mediate the data transfer. A publication and subscription model (aka pub/sub) or multicast system can be used for data transfer as long as support for recovery (retransmit) in the case of a failure is included.

Now that I have reviewed practices for preparing malware files for analysis and supporting messaging between components, I'll turn next to analysis in part three, coming soon.

Additional Resources

Read the first post in this series, Big Data Malware: Collection and Storage.

Watch my webinar Building and Scaling a Malware Analysis System.

Listen to the podcast DNS Blocking to Disrupt Malware.

Read other blog posts about malware.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed