Using Predictive Modeling in Software Development: Results from the Field

As with any new initiative or tool requiring significant investment, the business value of statistically-based predictive models must be demonstrated before they will see widespread adoption. The SEI Software Engineering Measurement and Analysis (SEMA)initiative has been leading research to better understand how existing analytical and statistical methods can be used successfully and how to determine the value of these methods once they have been applied to the engineering of large-scale software-reliant systems.

As part of this effort, the SEI hosted a series of workshops that brought together leaders in the application of measurement and analytical methods in many areas of software and systems engineering. The workshops help identify the technical barriers organizations face when they use advanced measurement and analytical techniques, such as computer modeling and simulation. This post focuses on the technical characteristics and quantified results of models used by organizations at the workshops.

Participants were invited and asked to present at the workshops only if they had empirical evidence about the results of their modeling efforts. A key component of this work is assembling leaders within the organizations who know how to conduct measurement and analysis and can demonstrate how it is successfully integrated into the software product development and service delivery processes. Understandably, attendees don't share proprietary information, but rather talk about the methods that they used, and, most importantly, they learn from each other.

At a recent workshop, the various models discussed were statistical, probabilistic, and simulation-based. For example, organizational participants

demonstrated the use of Bayesian belief networks and process flow simulation models to define end-to-end software system lifecycle processes requiring coordination among disparate stakeholder groups to meet product quality objectives and efficiency of resource usage,
described the use of Rayleigh curve fitting to predict defect discovery (depicted as defect densities by phase) across the software system lifecycle and to predict latent or escaping defects, and
described the use of multivariable linear regression and Monte Carlo simulation to predict software system cost and schedule performance based on requirements volatility and the degree of overlap of the requirements and design phases (e.g. surrogate for risk of proceeding with development prematurely).

Quantifying the Results

The presentations covered many different approaches applied across a large variety of organizations. Some had access to large data repositories, while others used small datasets. Still others addressed issues of coping with missing and imperfect data, as well as the use of expert judgment to calibrate the models. The interim and final performance outcomes predicted by the models also differed considerably, and included defect prevention, customer satisfaction, other quality attributes, aspects of requirements management, return on investment, cost, schedule, efficiency of resource usage, and staff skills as a function of training practices.

One case study, presented by David Raffo, professor of business, engineering, and computer science at Portland State University, described an organization releasing defective products with high schedule variance. The organization's defect-removal activities were based on unit test, where they faced considerable reliability problems. They knew they needed to reduce schedule variance and improve quality, but they had a dozen ideas to consider for how to actually accomplish that. They wanted to base their decision on a quantitative evaluation of the likelihood of success of each particular effort. A state-based discrete event model of large-scale commercial development processes was built to address that and other problems. The simulation was parameterized using actual project data. Some outcomes predicted by the model included the following:

cost in staff-months of effort or full-time-equivalent staff used for development, inspections, testing, and rework,
numbers of defects by type across the life cycle,
delivered defects to the customer, and
calendar months of project cycle time.

Raffo's simulation model was used as part of a full business case analysis. The model ultimately determined likely return on investment (ROI) and related financial performance under different proposed process change scenarios.

Another example presented by Neal Mackertich and Michael Campo of Raytheon Integrated Defense Systems demonstrated the use of a Monte Carlo simulation model they developed. The model was created to support Raytheon's goal of developing increasingly complex systems with smaller performance margins. One of their most daunting challenges was schedule pressure. Schedules are often managed deterministically by the task manager, limiting the ability of the organization to assess the risk and opportunity involved, perform sensitivity analysis, and implement strategies for risk mitigation and opportunity capture. The model developed at Raytheon allowed them to

statistically predict their likelihood of meeting schedule milestones,
identify task drivers based on their contribution to overall cycle time and percentage of time spent on the critical path, and
develop strategies for mitigating the identified risk.

The primary output of the model was the prediction interval estimate of schedule performance (generated from Monte Carlo simulation) using individual task duration probability estimation and an understanding of the individual task sequence relationships. Engineering process funding was invested in the development and deployment of the model and critical chain project management, resulting in a 15 - 40% reduction in cycle time duration against baseline.

Encouraging Adoption

While these types of models are used frequently in other fields, they are not as often applied in software engineering, where the focus has often been on the challenges of the system being developed. As the field matures, more analysis should be done to determine quantitatively how products can be built most efficiently and affordably, and how we can best organize ourselves to accomplish that.

The initial cost of model development can range from a month or two of staff effort to a year depending on the scope of the modeling effort. Tools can range from $5,000 to $50,000 depending on the level of capability provided. As a result of these kinds of investments, models can and have saved organizations millions of dollars through resultant improvements. Our challenge is to help change the practice of software engineering, where the tendency is to "just go out and do it" to include this type of product and process analysis. To do so, we know we have to conclusively demonstrate that the information gained is worth the expense and bring these results to a wider audience.

Additional Resource:

To read the SEI technical report, Approaches to Process Performance Modeling: A Summary from the SEI Series of Workshops on CMMI High Maturity Measurement and Analysis, please visit
www.sei.cmu.edu/library/abstracts/reports/09tr021.cfm

Software Engineering Institute

SEI Blog