search menu icon-carat-right cmu-wordmark

Testing, Agile Metrics, Fuzzy Hashing, Android, and Big Data: The SEI Blog Mid-Year Review (Top 10 Posts)

Douglas C. Schmidt

The SEI Blog continues to attract an ever-increasing number of readers interested in learning more about our work in agile metrics, high-performance computing, malware analysis, testing, and other topics. As we reach the mid-year point, this blog posting highlights our 10 most popular posts, and links to additional related resources you might find of interest (Many of our posts cover related research areas, so we grouped them together for ease of reference.)

Before we take a deeper dive into the posts, let's take a look at the top 10 posts (ordered by number of visits, with #1 being the highest number of visits):

#1. Using V Models for Testing
Common Testing Problems: Pitfalls to Prevent and Mitigate
Agile Metrics: Seven Categories
Developing a Software Library for Graph Analytics
Four Principles of Engineering Scalable, Big Data Software Systems
Two Secure Coding Tools for Analyzing Android Apps
Four Types of Shift-Left Testing
Writing Effective YARA Signatures to Identify Malware
Fuzzy Hashing Techniques in Applied Malware Analysis
Addressing the Software Engineering Challenges of Big Data

Using V Models for Testing
Common Testing Problems: Pitfalls to Prevent and Mitigate
Four Types of Shift-Left Testing

Don Firesmith's blog posts on testing continue to rank among the most visited posts on the SEI Blog. The post Using V Models for Testing, which was published in November 2013, has been the most popular post on our site throughout the first half of this year. In the post, Firesmith introduces three variants on the traditional V model of system or software development that make it more useful to testers, quality engineers, and other stakeholders interested in the use of testing as a verification and validation method.

The V model builds on the traditional waterfall model of system or software development by emphasizing verification and validation. The V model takes the bottom half of the waterfall model and bends it upward into the form of a V, so that the activities on the right verify or validate the work products of the activity on the left.

More specifically, the left side of the V represents the analysis activities that decompose the users' needs into small, manageable pieces, while the right side of the V shows the corresponding synthesis activities that aggregate (and test) these pieces into a system that meets the users' needs.

In the two-part blog post series Common Testing Problems: Pitfalls to Prevent and Mitigate, Firesmith begins by citing a widely known study for the National Institute of Standards & Technology (NIST) reporting that inadequate testing methods and tools annually cost the U.S. economy between $22.2 billion and $59.5 billion, with roughly half of these costs borne by software developers in the form of extra testing and half by software users in the form of failure avoidance and mitigation efforts. The same study notes that between 25 and 90 percent of software development budgets are often spent on testing. In this two-part series (read the first post here; read the second post here), Firesmith highlights results of an analysis documenting problems that commonly occur during testing. Specifically, this series of posts identifies and describes 77 testing problems organized into 14 categories; lists potential symptoms by which each can be recognized, potential negative consequences, and potential causes; and makes recommendations for preventing them or mitigating their effects.

In the post, Four Types of Shift-Left Testing, Firesmith details four basic methods to shift testing earlier in the lifecycle (that is, leftward on the V model). These can be referred to as traditional shift left testing, incremental shift left testing, Agile/DevOps shift left testing, and model-based shift left testing.

Readers interested in finding out more about Firesmith's work in this field can refer to the following resources:

Agile Metrics: Seven Categories

For agile software development, one of the most important metrics is delivered business value. This progress measure, while observation-based, does not violate the team spirit. A group of SEI researchers began research to help program managers measure progress more effectively. At the same time, we want teams to work in their own environment and use metrics specific to the team, while differentiating from metrics that are used at the program level.

The SEI blog post, Agile Metrics: Seven Categories, details three key views of agile team metrics that are typical of most implementations of agile methods: velocity, spring burn-down chart, and release burn-up chart.

This research, which is presented in greater detail in the SEI technical note Agile Metrics: Progress Monitoring of Agile Contractors, involved interviewing professionals who manage agile contracts, which gave SEI researchers insight from professionals in the field who have successfully worked with agile suppliers in DoD acquisitions.

Based on interviews with personnel who manage agile contracts, the technical note (and blog post) also identify seven successful ways to monitor progress that help programs account for the regulatory requirements common in the DoD.

Readers interested in finding out more about this research can read the following SEI technical reports and notes:

Developing a Software Library for Graph Analytics

Graph algorithms are in wide use in DoD software applications, including intelligence analysis, autonomous systems, cyberintelligence and security, and logistics optimizations. In late 2013, several luminaries from the graph analytics community released a position paper calling for an open effort, now referred to as GraphBLAS, to define a standard for graph algorithms in terms of linear algebraic operations. BLAS stands for Basic Linear Algebra Subprograms and is a common library specification used in scientific computation. The authors of the position paper propose extending the National Institute of Standards and Technology's Sparse Basic Linear Algebra Subprograms (spBLAS) library to perform graph computations. The position paper served as the latest catalyst for the ongoing research by the SEI's Emerging Technology Center in the field of graph algorithms and heterogeneous high-performance computing (HHPC). This blog post, the second in a series highlighting ETC's work in high-performance computing, is a follow-up to the 2013 post, Architecting Systems of the Future. This second post describes efforts to create a software library of graph algorithms for heterogeneous architectures that will be released via open source.

This post details research that bridges the gap between the academic focus on fundamental graph algorithms and our focus on architecture and hardware issues. The post by the SEI's Scott McMillan also highlights a collaboration with researchers at Indiana University's Center for Research in Extreme Scale Technologies (CREST), which developed the Parallel Boost Graph Library (PBGL). In particular, the SEI is working with Dr. Andrew Lumsdaine who serves on the Graph 500 Executive Committee and is considered a world leader in graph analytics. Researchers in this lab worked with us to implement and benchmark data structures, communication mechanisms and algorithms on GPU hardware.

Readers interested in finding out more about our work in this area can read the following SEI technical note:

Big Data
Four Principles of Engineering Scalable, Big Data Software Systems
Addressing the Software Engineering Challenges of Big Data

New data sources, ranging from diverse business transactions to social media, high-resolution sensors, and the Internet of Things, are creating a digital tsunami of big data that must be captured, processed, integrated, analyzed, and archived. Big data systems that store and analyze petabytes of data are becoming increasingly common in many application domains. These systems represent major, long-term investments, requiring considerable financial commitments and massive scale software and system deployments. With analysts estimating data storage growth at 30 to 60 percent per year, organizations must develop a long-term strategy to address the challenge of managing projects that analyze exponentially growing data sets with predictable, linear costs.

In a popular series on the SEI blog, Ian Gorton continues his exploration of the software engineering challenges of big data systems. In the first post in the series, Addressing the Software Engineering Challenges of Big Data, Gorton describes a risk reduction approach called Lightweight Evaluation and Architecture Prototyping (for Big Data) that he developed with fellow researchers at the SEI. The approach is based on principles drawn from proven architecture and technology analysis and evaluation techniques to help the Department of Defense (DoD) and other enterprises develop and evolve systems to manage big data.

In another post, the tenth most popular on our site in the first six months of 2015, Four Principles of Engineering Scalable, Big Data Software Systems, Gorton describes principles that hold for any scalable, big data system.

Readers interested in finding out more about Gorton's research in big data can refer to the following additional resources:

Two Secure Coding Tools for Analyzing Android Apps

One of the most popular areas of research among SEI blog readers this year has been the series of posts highlighting our work on secure coding for the Android platform. Android is an important of focus, given its mobile device market dominance (82 percent of worldwide market share in the third quarter of 2013the adoption of Android by the DoD, and the emergence of popular massive open online courses on Android programming and security.

Since its publication in late April, the post Two Secure Coding Tools for Analyzing Android Apps, by Will Klieber and Lori Flynn, has been among the most popular on our site. The post highlights a tool they developed, DidFail, that addresses a problem often seen in information flow analysis: the leakage of sensitive information from a sensitive source to a restricted sink (taint flow). Previous static analyzers for Android taint flow did not combine precise analysis within components with analysis of communication between Android components (intent flows). The SEI CERT Division's new tool analyzes taint flow for sets of Android apps, not only single apps.

DidFail is available to the public as a free download. Also available is a small test suite of apps that demonstrates the functionality that DidFail provides.

The post by Klieber and Flynn is the latest in a series detailing the CERT Secure Coding team's work on techniques and tools for analyzing code for mobile computing platforms.

Readers interested in finding out more about the CERT Secure Coding Team's work in secure coding for the Android platform can refer to the following additional resources:

Writing Effective YARA Signatures to Identify Malware
Fuzzy Hashing Techniques in Applied Malware Analysis

Previous SEI Blog posts on identifying malware have focused on applying similarity measures to malicious code to identify related files and reduce analysis expense. Another way to observe similarity in malicious code is to leverage analyst insights by identifying files that possess some property in common with a particular file of interest. One way to do this is by using YARA, an open-source project that helps researchers identify and classify malware. YARA has gained enormous popularity in recent years as a way for malware researchers and network defenders to communicate their knowledge about malicious files, from identifiers for specific families to signatures capturing common tools, techniques, and procedures (TTPs). In the blog post Writing Effective YARA Signatures to Identify Malware, CERT Division researcher David French provides guidelines for using YARA effectively, focusing on selection of objective criteria derived from malware, the type of criteria most useful in identifying related malware (including strings, resources, and functions), and guidelines for creating YARA signatures using these criteria.

YARA provides a robust language (based on Perl Compatible Regular Expressions) for creating signatures with which to identify malware. These signatures are encoded as text files, which makes them easy to read and communicate with other malware analysts. Since YARA applies static signatures to binary files, the criteria statically derived from malicious files are the easiest and most effective criteria to convert into YARA signatures. The post highlights three different types of criteria that are most suitable for YARA signature development: strings, resources, and function bytes.

The simplest usage of YARA is to encode strings that appear in malicious files. The usefulness of matching strings, however, is highly dependent on which strings are chosen.

In the post Fuzzy Hashing Techniques in Applied Malware Analysis, French highlights improved ways of assessing whether two files are similar including fuzzy hashing.

When investigating a security incident involving malware, analysts will create a report documenting their findings. To denote the identity of a malicious binary or executable, analysts often use cryptographic hashing, which computes a hash value on a block of data, such that an accidental or intentional change to the data will change the hash value. In communication science, cryptographic hashes are frequently used to determine the integrity of a message sent through a communication channel. In malware research, they are useful for positively identifying a piece of software. If a suspected file has the same cryptographic hash as a known file, an analyst is reasonably confident that the files are identical. Modifying even a single bit of a malicious file, however, will alter its cryptographic hash. The result is that inconsequential changes to malicious files will prevent analysts from rapidly observing that a suspected file is identical to a file they have already seen.

To counter this behavior, analysts seek improved ways of assessing whether two files are similar. One such method is known as fuzzy hashing. Fuzzy hashes and other block/rolling hash methods provide a continuous stream of hash values for a rolling window over the binary. These methods produce hash values that allow analysts to assign a percentage score that indicates the amount of content that the two files have in common. A recent type of fuzzy hashing, known as context triggered piecewise hashing, has gained enormous popularity in malware detection and analysis in the form of an open-source tool called ssdeep.

Looking Ahead

In the coming months, we will be continuing our series on DevOps, as well as posts on vulnerability analysis tools, predictive analysis, context-aware computing, and the SEI strategic plan. We will also continue our SPRUCE series highlighting recommended practices in the fields of Agile at Scale, Monitoring Software-Intensive System Acquisition Programs, Managing Intellectual Property in the Acquisition of Software-Intensive Systems.

Thank you for your support. We publish a new post on the SEI Blog every Monday morning. Let us know if there is any topic you would like to see covered in the SEI Blog.

We welcome your feedback in the comments section below.

Additional Resources

For the latest SEI technical reports and notes, please visit