Data Analytics for Open Source Software Assessment

In 2012, the White House released its federal digital strategy. What's noteworthy about this release is that the executive office distributed the strategy using Bootstrap, an open source software (OSS) tool developed by Twitter and made freely available to the public via the code hosting site GitHub. This is not the only evidence that we have seen of increased government interest in OSS adoption. Indeed, the 2013 report The Future of Open Source Software revealed that 34 percent of its respondents were government entities using OSS products.

The Carnegie Mellon University Software Engineering Institute (SEI) has seen increased interest and adoption of OSS products across the federal government, including the Department of Defense (DoD), the intelligence community (IC), and the Department of Homeland Security. The catalyst for this increase has been innovators in government seeking creative solutions to rapidly field urgently needed technologies. While the rise of OSS adoption signals a new approach for government acquirers, it is not without risks that must be acknowledged and addressed, particularly given current certification and accreditation (C&A) techniques. This blog post will discuss research aimed at developing adoptable, evidence-based, data-driven approaches to evaluating (open source) software.

In this research, members of the technical staff in the SEI's Emerging Technology Center (ETC) explored the availability of data associated with OSS projects and developed semi-automated mechanisms to extract the values of pre-defined attributes. The challenges of applying data analytics to address real problems and of understanding OSS assurance align with the ETC's mission, which is to promote government awareness and knowledge of emerging technologies and their application, as well as to shape and leverage academic and industrial research.

Our research leveraged the "openness" of OSS to develop an evidence-based approach for assessing and assuring OSS. This approach, which focused on producing evidence in support of assurance claims , is based on generating artifacts and creating traceability links from assurance claims to those artifacts.

Beyond a Trust-Based Approach

If we think of traditional, "shrink-wrapped" software, we accept that the software is developed by and purchased from a vendor who delivers a product against specified requirements. The software comes with installation instructions, FAQs, and access to support via hotlines and websites. Generally speaking, there is a company or some kind of legal entity that stands behind the product.

With OSS development, however, multiple developers from different organizations (even independent developers) can contribute to the code base of a product, which may or may not be backed by a single legal entity. In some cases, developers include helpful information in the software repository; in other cases, users are on their own to get the software working in their environment. Specific functionality may be driven by the community of developers, or by a small core team.

Current methods to assess software (OSS or otherwise) are trust-based and rely heavily on expert opinion. For example, users may run experiments with the software in a controlled environment to determine whether or not it is safe to operate. When certifying and accrediting OSS or any software, however, the trust-based model is not valid for several reasons:

In today's environment, many organizations and entities incorporate some aspect of open-source into their software. As a result, no single company or organization represents an OSS capability.
Individual expert assessments are manual and do not scale to the level required for large-scale, mission-critical projects that apply OSS.
Assurance claims are based on opinion rather than on a data-driven designation of assurance.
For these reasons, we wanted to develop a prototype of a tool that reaches beyond the functions of traditional static analysis tools. Our aim is to create a tool that government or industry could use to support their decision of whether to adopt an OSS package. We felt it was important to develop a tool to provide supporting evidence, rather than one that would provide a determination of whether a particular software package is "good" or "bad."

In this age of sequestration and other pressures on the expense of acquiring and sustaining software-reliant systems, government agencies can realize numerous benefits from a good OSS development and adoption strategy, including cost savings and increased flexibility in the acquisition and development of systems.

Foundations of Our Approach

In 1998, after Netscape published the source code for Netscape Communicator, Bruce Perens and Eric S. Raymond, founded the Open Source Initiative, an organization dedicated to promoting OSS. Since that time, a large number of OSS repositories have surfaced including Github, Launchpad, Sourceforge, and Ohloh.

In developing an approach, our team of researchers and software developers at the SEI wanted to create a tool that leveraged features of OSS, including the openness of code, development environment, documentation, and user community. Our aim was to design and develop an integrated, semi-automated software assessment capability that would allow an assessor to explore the evidence supporting an assurance claim.

The upside of the renewed interest in OSS adoption, both in government and industry, is that a wealth of data now exists within these repositories that provide insight into development of OSS as well as the code review and code committal process. Our aim with this research was to move beyond simple bug counts and static analysis and provide richer context for those charged with assessing software systems.

While no one measure or metric could provide an accurate assessment of software, we reasoned that several characteristics could provide acquirers with a more complete view of OSS assurance. During our study, we identified measurable characteristics that could be of interest, particularly if assessed in combination. For example, we examined complexities of the coding language used, test completion, and vitality or inertia of the project. Other characteristics that we evaluated included

milestones. Our analysis included proposed release dates versus actual release dates. Meeting clearly described milestones and schedules is often an indicator of sound project management.
bugs. We examined issues such as severity, discovery date versus fixed date, timing for a fix to be included in a release, percentage of bugs carried from previous release, distribution of bugs by severity, time-to-fix measures, rate at which new bugs are identified, bugs tracked in current release as well as bug aging and classification. Bug counts and defect density alone are not sufficient. If we look at sluggish time-to-fix measures, however, that may signal problems with the current release.
documentation. We looked at whether there was a process to update documentation, whether the documentation was up to date or not, release notes, change log, how lines of code versus the length and completeness of the user manual correlate. A lack of documentation is a risk to adopters and implementers of the software because the implementers are left to their own devices to get the software working in their environment, which can cause significant delays in rollout.
user base growth over time. We looked at activity levels in mailing lists (users and developers). We also considered activity levels at conferences, market penetration, and third-party support. We reasoned that evidence of increasing or decreasing activity from the user community was evidence of the strength of the product.
developer involvement over time. Our evaluation spanned number of commits (making suggested changes from the user community), number of unique authors, lines of contributed code versus total lines of code, evidence of hobbyist developers, and a network diagram illustrating connections and influence of the community of developers, code committers and reviewers. We reviewed the social network of the developer community that supported particular OSS projects.

Context is important. Using the data collected to help build an understanding of the development environment, developer activity, and user community commitment helps potential adopters get a better sense of the viability of the OSS project.

Challenges

When we first began this research, we focused on identifying data that would allow us to make valid comparisons between identifiers of quality in different software repositories. We soon realized, however, that quality attributes really are context dependent. For example, OSS acquirers may place various levels of importance on whether software is updated during daytime hours by full-time employees or during evening hours by hobbyists. Instead of placing a value judgment on these variables, we altered our approach to identify characteristics such as the ones listed above that can be used by decision makers to determine relevancy and weighting.

As we progressed through the research, we also realized that OSS repositories were starting to explore ways to represent data relevant to the OSS projects in the repositories. For example, Github maintains a graphs section that highlights data, such as code stability, trends over time, and a separate punch card section that represents the volume of code commits over the span of a week. Another example involves Ohloh, which provides a side-by-side comparison along different parameters about the OSS projects.

Another challenge that we encountered surfaced after we began exploring the OSS repositories. We found that while there are many typical developer tools being used, they were all being used differently across different software projects. One example of this involved JIRA, a bug tracking software that offers users configurable fields. Another example can be found in the Apache Software Foundation project Derby, some bugs have fields for urgency, environment, issues, fix information, or bug behavior facts while others do not.

Looking Ahead

All indicators point to increased adoption of OSS. In November 2013, Federal Computer Week published an article detailing the adoption of OSS across the DoD. An article on OSS and government in Information Week earlier that month stated that "Federal agencies, looking for new ways to lower their IT costs, are exploiting open-source software tools in a wider range of applications, not only to reduce software costs, but also to tighten network security, streamline operations, and reduce expenses in vetting applications and services."

In the coming year, we will continue our work in data analytics and OSS assurance. We are interested in collaborating with organizations to

expand selected data to analyze and find correlations among seemingly disparate dimensions and measures in software development
produce evidence for specific OSS projects that are critical to mission needs
test-specific assurance claims using data-analytics approach and build-in bi-directional traceability between claim and evidence
build tools to accommodate large scale analysis and evidence production (multiple OSS projects along multiple dimensions)
experiment with evidence production targeting tools
develop and publish a comprehensive open source assurance classification system

If you are interested in collaborating with us, please leave a comment below or send an email to info@sei.cmu.edu.

Additional Resources

For more information about the SEI Emerging Technology Center, please visit
https://www.sei.cmu.edu/about/divisions/emerging-technology-center/index.cfm

To read the article Has Open Source Officially Taken Off at DoD? by Amber Corrin, please visit
https://fcw.com/Articles/2013/11/19/DOD-open-source.aspx?Page=1

To read the article Agencies Widen Open-Source Use by Henry Kenyon, please visit
https://www.informationweek.com/agencies-widen-open-source-use--/d/d-id/899851

To read the article Army C4ISR portal uses open-source software for faster upgrades by William Welsh, please visit
https://defensesystems.com/articles/2013/01/30/army-c4isr-portal-open-source-software.aspx

Software Engineering Institute

SEI Blog