Part one of this series of blog posts on the collection and analysis of malware and storage of malware-related data in enterprise systems reviewed practices for collecting malware, storing it, and storing data about it. This second post in the series discusses practices for preparing malware data for analysis and discuss issues related to messaging between big data framework components.
The growth of big data has affected many fields, including malware analysis. Increased computational power and storage capacities have made it possible for big-data processing systems to handle the increased volume of data being collected. In addition to collecting the malware, new ways of analyzing and visualizing malware have been developed. In this blog post--the first in a series on using a big-data framework for malware collection and analysis--I will review various options and tradeoffs for dealing with malware collection and storage at scale.
Much of the malware that we analyze includes some type of remote access capability. Malware analysts broadly refer to this type of malware as a remote access tool (RAT). RAT-like capabilities are possessed by many well-known malware families, such as DarkComet. As described in this series of posts, CERT researchers are exploring ways to automate common malware analysis activities. In a previous post, I discussed the Pharos Binary Analysis Framework and tools to support reverse engineering object-oriented code. In this post, I will explain how to statically characterize program behavior using application programming interface (API) calls and then discuss how we automated this reasoning with a malware analysis tool that we call ApiAnalyzer.
Department of Defense (DoD) systems are becoming increasingly software reliant, at a time when concerns about cybersecurity are at an all-time high. Consequently, the DoD, and the government more broadly, is expending significantly more time, effort, and money in creating, securing, and maintaining software-reliant systems and networks. Our first post in this series provided an overview of the SEI's five-year technical strategic plan, which aims to equip the government with the best combination of thinking, technology, and methods to address its software and cybersecurity challenges. This blog post, the second in the series, looks at ongoing and new research we are undertaking to address key cybersecurity, software engineering and related acquisition issues faced by the government and DoD.
Object-oriented programs present considerable challenges to reverse engineers. For example, C++ classes are high-level structures that lead to complex arrangements of assembly instructions when compiled. These complexities are exacerbated for malware analysts because malware rarely has source code available; thus, analysts must grapple with sophisticated data structures exclusively at the machine code level. As more and more object-oriented malware is written in C++, analysts are increasingly faced with the challenges of reverse engineering C++ data structures. This blog post is the first in a series that discusses tools developed by the Software Engineering Institute's CERT Division to support reverse engineering and malware analysis tasks on object-oriented C++ programs.
The SEI Blog continues to attract an ever-increasing number of readers interested in learning more about our work in agile metrics, high-performance computing, malware analysis, testing, and other topics. As we reach the mid-year point, this blog posting highlights our 10 most popular posts, and links to additional related resources you might find of interest (Many of our posts cover related research areas, so we grouped them together for ease of reference.)
Before we take a deeper dive into the posts, let's take a look at the top 10 posts (ordered by number of visits, with #1 being the highest number of visits):
Every day, analysts at major anti-virus companies and research organizations are inundated with new malware samples. From Flame to lesser-known strains, figures indicate that the number of malware samples released each day continues to rise. In 2011, malware authors unleashed approximately 70,000 new strains per day, according to figures reported by Eugene Kaspersky. The following year, McAfee reported that 100,000 new strains of malware were unleashed each day. An article published in the October 2013 issue of IEEE Spectrum, updated that figure to approximately 150,000 new malware strains. Not enough manpower exists to manually address the sheer volume of new malware samples that arrive daily in analysts' queues. In our work here at CERT, we felt that analysts needed an approach that would allow them to identify and focus first on the most destructive binary files. This blog post is a follow up of my earlier post entitled Prioritizing Malware Analysis. In this post, we describe the results of the research I conducted with fellow researchers at the Carnegie Mellon University (CMU) Software Engineering Institute (SEI) and CMU's Robotics Institutehighlighting our analysis that demonstrated the validity (with 98 percent accuracy) of our approach, which helps analysts distinguish between the malicious and benign nature of a binary file.
Code clones are implementation patterns transferred from program to program via copy mechanisms including cut-and-paste, copy-and-paste, and code-reuse. As a software engineering practice there has been significant debate about the value of code cloning. In its most basic form, code cloning may involve a codelet (snippets of code) that undergoes various forms of evolution, such as slight modification in response to problems. Software reuse quickens the production cycle for augmented functions and data structures. So, if a programmer copies a codelet from one file into another with slight augmentations, a new clone has been created stemming from a founder codelet. Events like these constitute the provenance or historical record of all events affecting a codelet object. This blog posting describes exploratory research that aims to understand the evolution of source and machine code and, eventually, create a model that can recover relationships between codes, files, or executable formats where the provenance is not known.
As part of our mission to advance the practice of software engineering and cybersecurity through research and technology transition, our work focuses on ensuring that software-reliant systems are developed and operated with predictable and improved quality, schedule, and cost. To achieve this mission, the SEI conducts research and development activities involving the Department of Defense (DoD), federal agencies, industry, and academia. As we look back on 2013, this blog posting highlights our many R&D accomplishments during the past year.
In 2012, Symantec blocked more than 5.5 billion malware attacks (an 81 percent increase over 2010) and reported a 41 percent increase in new variants of malware, according to January 2013 Computer World article. To prevent detection and delay analysis, malware authors often obfuscate their malicious programs with anti-analysis measures. Obfuscated binary code prevents analysts from developing timely, actionable insights by increasing code complexity and reducing the effectiveness of existing tools. This blog post describes research we are conducting at the SEI to improve manual and automated analysis of common code obfuscation techniques used in malware.
In previous blog posts, I have written about applying similarity measures to malicious code to identify related files and reduce analysis expense. Another way to observe similarity in malicious code is to leverage analyst insights by identifying files that possess some property in common with a particular file of interest. One way to do this is by using YARA, an open-source project that helps researchers identify and classify malware. YARA has gained enormous popularity in recent years as a way for malware researchers and network defenders to communicate their knowledge about malicious files, from identifiers for specific families to signatures capturing common tools, techniques, and procedures (TTPs). This blog post provides guidelines for using YARA effectively, focusing on selection of objective criteria derived from malware, the type of criteria most useful in identifying related malware (including strings, resources, and functions), and guidelines for creating YARA signatures using these criteria.
Through our work in cyber security, we have amassed millions of pieces of malicious software in a large malware database called the CERT Artifact Catalog. Analyzing this code manually for potential similarities and to identify malware provenance is a painstaking process. This blog post follows up our earlier post to explore how to create effective and efficient tools that analysis can use to identify malware.
Malware, which is short for "malicious software," is a growing problem for government and commercial organizations since it disrupts or denies important operations, gathers private information without consent, gains unauthorized access to system resources, and other inappropriate behaviors. A previous blog postdescribed the use of "fuzzy hashing" to determine whether two files suspected of being malware are similar, which helps analysts potentially save time by identifying opportunities to leverage previous analysis of malware when confronted with a new attack. This posting continues our coverage of fuzzy hashing by discussing types of malware against which similarity measures of any kind (including fuzzy hashing) may be applied.
Malware, which is short for "malicious software," consists of programming aimed at disrupting or denying operation, gathering private information without consent, gaining unauthorized access to system resources, and other inappropriate behavior. Malware infestation is of increasing concern to government and commercial organizations. For example, according to the Global Threat Report from Cisco Security Intelligence Operations, there were 287,298 "unique malware encounters" in June 2011, double the number of incidents that occurred in March. To help mitigate the threat of malware, researchers at the SEI are investigating the origin of executable software binaries that often take the form of malware. This posting augments a previous postingdescribing our research on using classification (a form of machine learning) to detect "provenance similarities" in binaries, which means that they have been compiled from similar source code (e.g., differing by only minor revisions) and with similar compilers (e.g., different versions of Microsoft Visual C++ or different levels of optimization).
Malware--generically defined as software designed to access a computer system without the owner's informed consent--is a growing problem for government and commercial organizations. In recent years, research into malware focused on similarity metrics to decide whether two suspected malicious files are similar to one another. Analysts use these metrics to determine whether a suspected malicious file bears any resemblance to already verified malicious files. Using these metrics allows analysts to potentially save time, by identifying opportunities to leverage previous analysis. This post will describe our efforts to develop a technique (known as fuzzy hashing) to help analysts determine whether two pieces of suspected malware are similar.
Malicious software (known as "malware") is increasingly pervasive with a constant influx of new, increasingly complex strains that wreak havoc by exploiting computers or personal and business information stored therein for malicious or criminal purposes. Examples include code that is designed to pilfer personal and digital credentials; plunder sensitive information from government or business enterprises; or interrupt, misdirect, or render inoperable computer hardware and computer-controlled equipment. This post describes our work to create a rapid search capability that allows analysts to quickly analyze a new piece of malware.
In response to a comment on my initial post introducing the SEI blog, I wanted to provide some additional information on new and upcoming SEI research initiatives. In this post, I describe these areas, and include a "sneak preview" of upcoming blog postings in each area.
As software becomes an ever-increasing part of our daily lives, organizations find themselves relying on software that originates from unknown and untrusted sources. The vast majority of such software is available only as executables, known as "binaries." Many binaries--such as malware or different versions and builds of a software package--are simply minor variants of old programs (or in some cases exact copies) that have been run through a different compiler.
In recent days, the VPNFilter malware has attracted attention, much of it in the wake of a May 25 public service announcement from the FBI, as well as a number of announcements from vendors and security companies. In this blog...