Evaluation and Validity for SEI Research Projects

Some of the principal challenges faced by developers, managers, and researchers in software engineering and cybersecurity involve measurement and evaluation. In two previous blog posts, I summarized some features of the overall SEI Technology Strategy. This post focuses on how the SEI measures and evaluates its research program to help ensure these activities address the most significant and pervasive problems for the Department of Defense (DoD). Our goal is to conduct projects that are technically challenging and whose solution will make a significant difference in the development and operation of software-reliant systems. In this post we'll describe the process used to measure and evaluate the progress of initiated projects at the SEI to help maximum their potential for success.

The Importance of Measurement and Evaluation in Software and Cybersecurity

Certain characteristics of software and cybersecurity are easy to measure--lines of code produced, errors found, errors corrected, port scan events, and time consumed. But much of what really matters in practice offers an even greater challenge to our measurement and evaluation capabilities. For example

How far are we from completion of the project?
Are the specifications consistent and feasibly implementable?
How much engineering risk is associated with this requirements specification?
How resilient is this architectural design with respect to flooding attacks?
What kinds of robustness are offered by this application programming interface (API )definition?
How many errors reside in the code base, and how many of them are vulnerabilities?
How many of these vulnerabilities are easy coding errors, and how many are intrinsic to the architectural design?
Is the component or system implementation safe for deployment in a secure environment?
Will the control system meet deadlines?

Over the years, much of the research activity at the SEI has been focused on enhancing our ability to measure and evaluate. Indeed, the genesis of the Capability Maturity Model (CMM) and its successors in the late 1980's is based on a need--unmet at the time--to evaluate the capabilities of software development organizations to deliver on their promises. Models such as the Resilience Maturity Model (RMM) and architecture evaluation methodologies serve similar purposes.
Much of the body of research work at the SEI is focused on these challenges. And so there is a natural question:

How can we evaluation the potential significance of the research, development, testing, and evaluation (RDT&E) work we (and others) undertake to improve technology and practice in software engineering and cybersecurity?

This is a significant issue in all research, development, test, & evaluation (RDT&E) programs. We need to take risks, but we want these to be appropriate risks that will yield major rewards, and with a maximum likelihood.

An overly conservative approach to RDT&E management can yield small increments of improvement to existing practices, when what is really needed, in many cases, are game-changing advances. Indeed, in software and cybersecurity, despite the admonition of Lord Kelvin ("If you cannot measure it, you cannot improve it"), purely "numbers-based" approaches can be dangerous (viz. the famous "light under the lamppost" story). Naive application of management-by-the-numbers is especially damaging in our disciplines, where so many important measures are lacking.

On the other hand, bold "open loop" approaches can yield risks with little likelihood of benefits. We must therefore employ a combination of measures, expert judgment, and careful identification of evaluation criteria--and we must structure the process of research to obtain early indicators and develop them when necessary (i.e., build new lampposts). This multidimensional evaluation process is essential because, in research, it is not just a matter of risk versus reward--thoughtful management in the definitional phases of a project can enable us to "push the curve out," simultaneously mitigating risk, identifying or developing meaningful measures in support of evaluation, and increasing potential for success and impact. A more general discussion on research management practice is the subject of chapter 4 of the 2002 National Academy report on Information Technology Research, Innovation, and E-Government.

Heilmeier's Catechism

In this vein, more than two decades ago, George Heilmeier, former director of the Defense Advanced Research Projects Agency (DARPA), developed a set of questions that he posed to prospective leaders of research projects. If these questions can be well addressed, then the proposed projects were more likely to be both successful and significant. These questions have been so widely adopted in research management that they have become known as the Heilmeier Catechism.

Experience at DARPA and other innovative research organizations has shown that compelling answers to the following questions are likely to predict successful outcomes:

What are you trying to do? Explain objectives using no jargon.
How is it done today? What are the limits of current practice?
What's new in your approach? Why do you think it will be successful?
If you're successful, what difference will it make? To whom?
What are the risks and the payoffs?
How much will it cost? How long will it take?
What are the midterm and final "exams" to assess progress?

(A more detailed presentation of these questions appears in the National Academy Critical Code report, pages 114 - 115.)

Our work at the SEI spans a spectrum ranging from deep, technically-focused research on the hard problems of software engineering and cybersecurity--the technology, tools, practices, and data analyses--to more operationally-focused efforts supporting the application of diverse technical concepts, methods, and tools to the challenges of practice, including development and operations for complex software-reliant systems. To ensure we help the DoD and other government and industry sponsors identify and solve key technical and development challenges facing their current and future software-reliant systems, we have adapted Heilmeier's Catechism into an evaluation process that is more adapted to the particular measurement challenges in software and cybersecurity.

This evaluation process--which we summarize below--involves a combination of four factors:

Mission relevance and potential significance of a proposed scope of work to the mission effectiveness of DoD, its supply chain, and other critical sectors
Field significance, or capability to evaluate mission effectiveness of emerging solutions to DoD and other stakeholders, including the development of early indicators of mission significance
Technical soundness of the research undertaken, based on accepted scientific approaches
Technical significance and innovation of the work, for example according to the standards of quality and significance evident in relevant top publication venues

Ensuring Validity

The wide spectrum of activities in the SEI body of work motivates us to contemplate four different dimensions of validity in assessing proposals for research projects from SEI technical staff:

Mission relevance. Are we working on the right problem? Means to enhance validity related to mission relevance include challenge problem workshops, along with early and ongoing collaboration with knowledgeable, mission-savvy stakeholders. This is analogous to the agile practice of involving the customer early and often; it dramatically lessens risk that we are solving the wrong problem. The SEI also benefits from having technical staff members with extensive operational experience in the development, sustainment, and modernization of systems, as well as in a wide range of cybersecurity mission activities. This experience, along with direct engagement with operational stakeholders, enhances validity related to mission relevance.
Field significance. Will our solution have the intended impact when it is matured and fielded? Early indicators of field significance include the use of models and simulation, surrogate trials, and field exercises, as well as identification of metrics and validated evaluation criteria and perhaps, most importantly, collaboration with partners who can assist in appraisals, including test and evaluation organizations. Development of such early indicators can be very challenging and can require explicit investment in trials, exercises, and, development of surrogates.
Technical soundness. Does our approach embody a sound scientific process that will lead to scientifically valid results? This dimension of validity forces attention to experiment design, mathematical soundness, measures and metrics, statistical significance, and other technical factors that influence the scientific acceptability of outcomes. Technical soundness of published results is generally validated through peer review. The best early predictors of sound outcomes relate to the caliber and experience of the technical staff and their research collaborators.
Technical significance. Is our technical approach well situated in the literature or practice of the discipline, and will results be recognized as significant and scientifically impactful? The best evidence of technical significance derives from direct evaluation by recognized peer expert. For publishable results, this can come in the form of peer review and publication in the top venues, best-paper awards, keynote invitations, recognition from professional technical societies, and other types of peer recognition.

Finally, we consider the use and development of metrics, which contribute generally both to our ability to evaluate research and to the advancement of practice generally in software and cybersecurity. Much of the research we do is focused on developing and validating new metrics. But, additionally, when we develop metrics useful in research, we find they can also have broad value in practice. In the management of research projects, there are four dimensions we consider:

Success criteria. How can we test for overall soundness, significance, and impact when a project reaches completion? What are the observable features of final project results, and who must be involved in making an assessment? Which of the observables are indisputably objective, and which require expert judgment? It is important to note that there is nothing wrong with the use of expert judgment to make evaluations of success and impact, especially when the alternative is a set of inaccurate and incomplete objective measures.
Performance criteria. What are early indicators and tracking mechanisms to monitor progress (the so-called "mid-term exams")? What are the various dimensions of performance, and how are they assessed? As with the success criteria, these can involve a combination of objective measures and expert judgment.
User inputs and feedback. Who are the early collaborative stakeholders, and how are they selected? What are the mechanisms to get feedback from them? This is an explicit criterion, because of the importance of expert judgment and stakeholder acceptance of much of the new work in software and cybersecurity.
Metric development. Where are there dark shadows where the light of measurement is needed? How will the project advance measurement capability as part of its work? At the SEI, this is a first-class activity for two reasons. First, the development of new metrics can have broad and deep impact on practice. For example, metric development (or, more generally, development of evaluation capability) is an explicit feature of some of the challenges mentioned in previous blog posts related to software assurance, cost-re-estimation, and architectural resilience.

Advances in Metrics

As noted at the outset, measurement is, in some ways, the deepest challenge in both software engineering and cybersecurity. For example, how can we measure the security of a system or the flexibility of a software framework? Likewise, how can we develop useful subsidiary measures that can help us understand the flow of data and information within a large system, or the extent of coupling among a set of federated software components?

This challenge is not only technically deep but profoundly significant to the advancement of practice. Particularly when earned value must be evaluated, the Architecture-Led Incremental Iterative Development (ALIID) approach depends on advances in metrics. Described in the first post in this series, ALLID (or, "agile at scale") enables iterative and incremental development of highly capable and innovative systems with acceptable levels of programmatic risk. Earned value based purely on accumulation of lines of code is unlikely to lead to project success, when much of the risk mitigation (accomplished through prototyping and modeling-and-simulation, for example) leads to reduced variances in estimates, though not necessarily reduced mean values. That is, the estimates become more predictive and reliable as a result of these kinds of activities, and thus greater levels of innovation and engineering-risk taking can be supported without a necessary threat to the potential for success of the overall effort. This is the essence of iteration and incrementality.

Effective metrics can have a profound impact on the industry--witness the CMM and subsequent CMMI families of measures, just successfully transitioned out of the SEI into a separate activity at Carnegie Mellon. This is why we take as a principle that the development of new measures must be a first-class activity in the research portfolio. Unlike the work done decades ago, our disciplines are now data rich, and we can build on data collection, models and simulation, field tests, and other analytic approaches that, just a few years ago, were relatively less feasible. We see this reflected in the research portfolio at the SEI, which is moving more aggressively than ever to address these challenges and create rich possibilities for the development and operation of software-reliant systems for the DoD, other mission agencies, and their supply chains.

In this series on the SEI Technical Strategic Plan, we have outlined how we are meeting the challenges of designing, producing, assuring, and evolving software-reliant systems in an affordable and dependable manner. The first post in this series outlined our strategy for ALLID or agile-at-scale, combining work on process, measurement, and architecture. The second post in the series detailed our strategies for to achieve game-changing approaches to designed-in security and quality (evidence-based software assurance). It also identified a set of DoD-critical component capabilities relating to cyber-physical systems (CPS), autonomous systems, and big data analytics. Finally, it outlined our approach to cybersecurity tradecraft and analytics.

I am also very excited to take this opportunity to introduce you to Kevin Fall, who joined the SEI late last month as its new chief technology officer. In this role, Dr. Fall will take over responsibility from me for the evolution and execution of the SEI Technical Strategic Plan, as well as direction of the research and development portfolio of the SEI's technical programs in cybersecurity, software architecture, process improvement, measurement and estimating, and unique technical support to sponsors. I will continue to take an active role in supporting the SEI and its essential mission--even including occasional future blog posts! Kevin will join the blogger team with a blog post in the near future summarizing his vision for research and transition activities at the SEI.

Additional Resources

We expect that the SEI Strategic Research Plan will be soon be available for download.

Software Engineering Institute

SEI Blog