search menu icon-carat-right cmu-wordmark

Big-Data Malware: Collection and Storage

Headshot of Brent Frye

The growth of big data has affected many fields, including malware analysis. Increased computational power and storage capacities have made it possible for big-data processing systems to handle the increased volume of data being collected. In addition to collecting the malware, new ways of analyzing and visualizing malware have been developed. In this blog post--the first in a series on using a big-data framework for malware collection and analysis--I will review various options and tradeoffs for dealing with malware collection and storage at scale.


Since 2001, the CERT Division of the Software Engineering Institute (SEI) has collected malware in a repository called the Artifact Catalog. This data supports malware analysis research that helps government sponsors understand the threats posed by individual malware samples, as well as families of malicious code.

In 2005, after a few years of gentle growth, the volume of data collected in the Artifact Catalog began growing at an exponential rate. We quickly went from collecting hundreds of files a week to collecting millions of files a week. Manual collection and manual analysis were no longer feasible because these methods just didn't scale. Such rapid growth fueled efforts by the SEI to automate our malware collection and analysis processes.

Although the primary focus of these posts is on malware, most concepts presented here apply to any type of file collection, including images, sound recordings, video, security logs, or text.

Big Data Platform

Several reference architectures exist for big-data frameworks, including NIST SP 1500-6. An extension of the NIST framework was described in a 2016 paper entitled A Reference Architecture for Big Data Systems in the National Security Domain by John Klein at the SEI and others. This paper organized various components into three categories: application provider, framework provider, and cross-cutting modules.

The application-provider modules provide application-level business logic and functionality, split into modules:

  • collection
  • preparation
  • analysis
  • visualization
  • access
  • application orchestration

The framework-provider modules include the software middleware, storage, and compute platform that are used by the application provider; these modules are defined as

  • processing
  • messaging
  • data storage
  • infrastructure

The third group includes cross-cutting modules that address concerns that impact nearly all of the other modules

  • security
  • management
  • federation

The activities covered by the various big-data platform modules may already exist within traditional data-processing systems. The data volume, velocity, variety, and variability are what set the big-data platform activities apart.

File Collection

In the big-data platform, the collection module is an application-provider module that handles the interface with the data provider.

There are a number of ways to acquire malware for analysis, which are subject to organizational policies on email services, file downloads, and web hosting:

  • Set up and monitor a honeypot or a honeynet to collect data.
  • Import all incoming files quarantined as malware by the enterprise anti-virus system.
  • Solicit user submissions, either from a submission form or from requested email submission.
  • Sign up with a commercial vendor that provides malware samples.

When collecting malware, no matter the source, you need to have a clear understanding about what to do under all possible circumstances. Likewise, when dealing with big data, there are many possibilities you must account for. For example, what do you do if you get the same sample multiple times from the same source? Should you keep a record of all attempts to acquire the file or only the first? Should the same file acquired from another source be treated as a new file, or should the earlier insertion be treated as the authoritative record, adding a note to the record that the new source also submitted the record? By noting collection of the same file from multiple independent sources, statistical methods can be used to approximate the size of the malware population.

What about partial records? It is possible that a file submission is either incomplete or damaged. Should the partial data be treated as a full record? Should there be a record of the collection attempt and the problem experienced during collection? Just how persistent should a bad record be?

Storing the malware and dealing with issues associated with record duplication and truncation all assume that the stream of incoming malware is designed to keep up with the source. Ideally, acquiring files and metadata from a source should allow the collection system to have multiple parallel processes, possibly on multiple hosts, acquiring data to maximize bandwidth utilization. Alternatively, if the process is forced to use a single thread for part of the acquisition (e.g., a periodic archive that contains all of the files for a specified period), follow-up collection processes may be parallelized to improve the rate at which records are extracted from that archive and added to the collection system.

Tradeoffs exist for each method used for responding to collection issues. Only the collectors can adequately address these tradeoffs.

Acquisition Metadata

The process of collecting or acquiring malware files generates acquisition metadata: data about the collection of the sample that cannot be derived from sample itself, such as

  • What process collected it?
  • When did we get it?
  • Where or How did we get it (e.g., service, website)?
  • How many did we get?
  • Which samples were part of a set?

Stakeholders must decide what acquisition metadata will be retained during the collection process. If this metadata is not collected as part of the collection process, there is generally no way to reacquire it later. The tradeoff, however, is that not all acquisition metadata will be useful to all (or possibly any) analysts. If the decision is made to keep some or all of the possible acquisition metadata, this data should be treated with some suspicion since it is a possible attack vector, so standard input-validation methods should be applied.

In addition to these collection-time metadata elements, the data provider of the file samples may include additional metadata that may also require storing. Information, such as original file name, prior analysis results, anti-virus detection or other signature matched, and other source-supplied data, should be stored (or not) according to predefined rules. Just like the acquisition metadata, these data values may not be available after the file has been added to the collection, so the decision about what to store and what to drop must be made ahead of time. The existence (or absence) of this supporting data might vary over time. Consequently, decisions about data collection must be made up front. Failure to do so risks the loss of potentially important information or might result in the collection of information having no future value.

File Storage

The Cloud

It is possible to store metadata and files using the same system. It is often better, however, to treat them differently when dealing with large-scale systems. For example, use one system for files and another for structured (or unstructured) metadata. First, let's look at storing files.

One option is to use a cloud-based system to collect and store malware samples. Use of the cloud takes the network overhead away from internal facilities and may be the only way to gather all the malware to analyze in one place at an acceptable data rate. This option makes sense if the intended storage location is also the cloud-based system on which the analysis will be performed.

Cloud-based storage solutions are not for everyone, however. Security concerns arise whenever a system is not under full control of the organization or individual using it. Moreover, if the analysis is not done on the cloud system itself, moving files into and out of the storage system could increase latency and impose a performance penalty. In addition, there is often a cost associated with moving data into and out of a cloud-based system.

NAS and SAN Products

A variety of network attached storage (NAS) or storage area network (SAN) products exist for large systems. Unfortunately, such systems generally aren't built for storing many small files, which is typically what is needed when storing malware. NAS and SAN systems also incur a performance penalty once they start getting full. Some systems, such as Gluster or NetApp, can be configured as either SAN or NAS, enabling different levels of quality of service.

Distributed Data Store

A third category for data storage is the distributed data store, which includes several options, such as Hadoop Distributed File System (HDFS), Ceph, and S3. One advantage with Hadoop is that the data-storage nodes also serve as compute nodes, so increasing storage also keeps processing time relatively consistent and may improve over time. These systems, and others like them, have their own set of quirks, so there is no silver bullet that solves all data needs.

When not using cloud storage, server-room constraints can be a concern. Machine rooms and data centers have limits in terms of the square footage available, the power distribution, and cooling. These factors impose physical constraints on how much the size of a data collection can be increased over time.

Metadata Storage

When considering storage, system designs must account for how to store the information collected about the files--the metadata--in a way that allows users to access the processed results. Flat files can be useful for storing information as one or more files alongside the file being described. While getting information for a known file is easy with a flat-file system, finding data can be problematic after the system grows to hundreds of thousands of files, much less millions or billions.

A traditional relational database system may be a reasonable approach for storing this metadata. PostgreSQL or MySQL will easily handle hundreds of millions of records, hundreds of millions of rows for each file, or for all the files combined. Such an approach is valuable when the structure of the incoming data is already known and tables can be built around that structure.

Relational databases, however, are poorly suited for unstructured data, which is data that arrives in multiple variations, even if it is individually well-structured. A NoSQL solution is typically better at handling unstructured data because it provides different rows and labels for each data point without needing to know ahead of time what those labels are. By not being restricted to a predefined schema, new fields can be added to the database without additional maintenance effort.

Whatever solution you choose, the file name or primary key used to actually store the files must be easy to determine so that the file can be associated with its metadata. Relying on the source of the data to name the file can cause problems, especially if the naming scheme conflicts with some other source or uses some character set that is not wanted on the system. For that reason, the SHA-256 hash is often used as the key to store files.

Looking Ahead

In Part Two of this blog series, I will discuss processing the collected malware so that it can be analyzed.

Additional Resources

Watch my webinar, Building and Scaling a Malware Analysis System.

Listen to the podcast, DNS Blocking to Disrupt Malware.

Read other blog posts about malware.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed