search menu icon-carat-right cmu-wordmark

Provenance Inference in Software

William Casey

Code clones are implementation patterns transferred from program to program via copy mechanisms including cut-and-paste, copy-and-paste, and code-reuse. As a software engineering practice there has been significant debate about the value of code cloning. In its most basic form, code cloning may involve a codelet (snippets of code) that undergoes various forms of evolution, such as slight modification in response to problems. Software reuse quickens the production cycle for augmented functions and data structures. So, if a programmer copies a codelet from one file into another with slight augmentations, a new clone has been created stemming from a founder codelet. Events like these constitute the provenance or historical record of all events affecting a codelet object. This blog post describes exploratory research that aims to understand the evolution of source and machine code and, eventually, create a model that can recover relationships between codes, files, or executable formats where the provenance is not known.

Other major events for codelets may include creation, copy, modification, and deletion of a codelet from a project. The presence of software bugs and vulnerabilities brings a new set of questions for which the provenance of codelet is particularly relevant. Consider the copy/paste of a codelet that contains a bug. By cloning the programmer has essentially introduced a new bug by copying an old bug.

The tracing of provenance in software developments has therefore become increasingly important to tracking both bugs and vulnerabilities. A generative model of provenance developed from observed software histories may inform researchers and developers how code evolves and mutates over time. Taking that notion one step further, a generative model that uses Bayesian Statistical techniques provides inference possibilities, such as identifying relations in software where the provenance is not completely known.

Examples of problems where provenance models and inference may be critical include identifying attribution for malware and analyzing potential vulnerabilities or vulnerability surface in third-party software product chains. These are two increasingly important problems in today's era of cyber warfare where we aim to both minimize vulnerability exposure and understand adversarial attack patterns. Further provenance inference may be an important step in understanding the dynamics of cyber conflict: If an organization deploys sophisticated cyber attacks can counter attacks (or more generally other attacks) occur by simply copying the malicious code and re-deploying it? Codelet mobility (the ease at which an implementation can be copied and redeployed) is an important factor to understanding the strategic options of agents in cyber conflicts.

Initial Research

This work began about 18 months ago when CERT researchers began using automated techniques to identify relationships (i.e., common authorship) among different types of malware. In the not-too-distant past, identifying the provenance or origin of malware required reverse engineering, which was labor intensive and time consuming.

I worked on that project with Aaron Shelmire, a former CERT researcher, to develop a Jaccard similarity index, which measures the amount of commonality between any two digital artifacts, source code files, or binary executables, based on the number of code clones in common. The Jaccard index identifies all common substrings in the artifacts and presents a measure of how much code is code in common. Not only is this technique a more efficient approach than reverse engineering, we have arrived at techniques that are algorithmically optimal, and shown that these methods may be used to prioritize the efforts of reverse engineers.

We used the Jaccard index to systematically examine subsets of artifacts to identify malware families, which share a high degree of code similarity in binary images or epoch similarity in traces (runtime). The basic notions of similarity are sufficiently long interesting strings or clusters of shorter strings. Because there are many generative factors giving rise to code clones (e.g. compiler artifacts, linked static libraries, copy-cut-and-past) the question arose, How can we interpret these measures in to infer shared provenance and how may we do so with statistical confidence?

From a malware attribution perspective this question reduces to How can we demonstrate that a high degree of similarity between two malware families is due to a common attacker or common source code? How could an analyst be certain that the similarities were not the result of shared, online, open-source code?
This simple question sparked our research question: How should we measure provenance in software? This exploratory research involves several phases:

  • Phase 1 involves leveraging a large-scale software history to gain a better understanding of how code evolves.
  • Phase 2 involves creating a generative model that will allow us to investigate the statistical problem of inference.
  • Phase 3 involves the application of the statistical model to problems of inference

This research provides malware analysts greater certainty in determining shared authorship and other cloned features among two or more file

A Foundation in Phylogenetics

Our interest in inference models for software provenance are preceded by effective inference models in the biological problems of phylogenetics, which allowed biologists to infer the structure of evolution events and determine common ancestry by examination of contemporary biological data. In fact, if you have received a flu vaccine then you have benefited from this science inference and prediction.

We recognized similarities to phylogenetics in our own work: We were researching malware artifacts in hopes of understanding more about their historical developments from the artifacts alone (with the provenance data withheld). To address this completely, we would need to understand and model the events that modify and edit source code. To accomplish this, we needed to look at source code with a known provenance.

Turning to Open-Source Code

It is an opportune time to explore the history of software for any researcher interested in examining the evolution of source code. Unix maintains a 43-year history detailing the evolution of its code while Linux maintains 21-year history. We decided to examine Mozilla's 14 year open-source history. The previous five years contain many significant developments indicated by the following statistic:

2,546 users modifying 25,000 files containing a total of 122 million lines of source code modified.

Of the 122 million lines of code (LOC), only 4.2 million LOC are unique. This statistic suggests that a vast majority of lines are redundant, many of which are copied or cloned. If foundational events may be understood and subsequent mutations identified within a generative provenance model, it is possible that among the 4.2 LOC, there may be still even greater reductions that arise from understanding the mutation structure and co-migration patterns of source code. While an entire list of driving factors leading to code mutation and migration may not be practical, major modification patterns can be inferred and dynamic modification patterns can be learned from data with know provenance.

Addressing Size and Scale

One of the major challenges in our research involves addressing the size and scale of the Mozilla source-code distribution, which in the last five years has logged 120,355 major commit events. One method to address this size and scale is through indexing techniques. For example, in January 2013 Mozilla distributed 17 million LOC (only 20 percent of which are unique) and distributed in 15 thousand files. Indexing these individual lines of code as well as their neighboring lines of code have has allowed us to identify portions of code that have migrated from one file to another during development events.

While there are large numbers of file objects and LOC objects, the relation between file objects and LOC objects is very sparse. Indexing also allowed us to map relations such as co-occurrence clusters of LOC and their containing files. Using these observations we may observe statistical trends of modifications made to files over time and test our assumptions of sparse structures and evolution patterns such as the following scenario: files exhibit rapid growth (initially) with lots of code added in, and then settle down eventually with bug fixes generating the only modifications there after. By creating an index of unique lines of code, we were able to map codelets as code patches with co-migratory patterns in files.

A challenging aspect of modeling provenance problems for software has been the notion of object granularity [Bunge and Ram, Liu] or determining what objects should be modeled. Because source code objects are so closely tied with their application or function, both the file (as object) or even the line of code (as object) present drawbacks. We theorize that the most useful object to consider are functional in nature, such as a well-formed function definition in the C programming language, which may also be considered a codelet (or a set of lines-of-code). Toward testing this viewpoint our indexing techniques allow us to observe co-migration of codelets or sets of lines forming well defined C language functions.

Another challenge in our research involved interpreting the summary of actions provided by developers. One way to interpret this summary is to create topic models that focus on key terms, such as bug, merge, adding code, etc. By creating topic vectors, we enhanced the inference of why certain modifications were made.

Another obstacle that we needed to address in our research involved accounting for the many mechanisms that affect the copy number of lines of code. Addressing this challenge involved understanding how to model deletion events in a distributed revision control system, these events unlike their analogous events affecting the provenance of physical objects, are harder to deal with theoretically as remote branches when merged may re-introduce lines of code that were previously deleted.


In examining Mozilla's source code, we have collaborated with Sudha Ram, a researcher at the University of Arizona who is an expert in data provenance. Her research along with Jun Liu has examined provenance in Wikipedia pages, iPlant Collaborative, and in the Materials Management domain. Their work on digital provenance in edit histories of Wikipedia pages has helped identify collaboration patterns that result in high quality pages. Ram and Liu have developed a model of semantics for data provenance known as W7, a technique that describes the who, what, when, where, why, which, and how, for data events.

We are also working with Rhiannon Weaver and Katherine Prevost, fellow CERT researchers, who are helping us design statistical measures of ownership and topic models for summaries associated to modification events.

Future Work

With our indexing approach, we have been able to establish measures of commonality between malware files. We are aiming to use these measures to infer the likelihood for several different types of code clone events:

  • subtraction - modifications that remove lines of code.
  • addition - modifications that introduce additional lines of code to enhance a function.
  • mutation - modifications of code by slight in situ modifications to fix/enhance function.
  • moving/copying - reorganization or re-introduction of codelets in files.

Our real goal, however, is to understand the mobility of code and how it mutates and flows from one source to another. For practical problems attaining this understanding will allow us to assess the likelihood of why certain codes are found in common. For malware, if we can confirm that 20 percent of a malware group is related to a previously identified malware group, can we conclude with confidence that what other factors would lead to such a result?

We plan to continue our collaboration with Sudha Ram with the intent of producing a provenance model for software code. Specifically, this model will describe the evolution mechanism for code clones. After training and evaluating the model, we will use statistical analysis to identify clone evolution patterns where inference may be applied to determine provenance among two different pieces of malware. Specific evolution patterns that admit model validation can then be applied to operational problems of malware attribution.

In this next phase of our research, we are also collaborating with two researchers in the SEI's Emerging Technology Center, Naomi Anderson, and Kate Ambrose-Sereno, who are examining data provenance models for software.

We welcome your feedback on our research. Please leave a comment in the feedback section below.

Additional Resources

To read the book Treatise on basic philosophy: ontology I. The furniture of the world by Mario Bunge, please visit

To read the article, Who does what: Collaboration patterns in the wikipedia and their impact on article quality by Jun Liu and Sudha Ram, please visit

To read the article, A Semantic Foundation for Provenance Management by Jun Liu and Sudha Ram, please visit

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed