The SPRUCE Series: Recommended Practices in the Software Development of Safety-Critical Systems
Software and acquisition professionals often have questions about recommended practices related to modern software development methods, techniques, and tools, such as how to apply agile methods in government acquisition frameworks, systematic verification and validation of safety-critical systems, and operational risk management. In the Department of Defense (DoD), these techniques are just a few of the options available to face the myriad challenges in producing large, secure software-reliant systems on schedule and within budget.
In an effort to offer our assessment of recommended techniques in these areas, SEI built upon an existing collaborative online environment known as SPRUCE (Systems and Software Producibility Collaboration Environment), hosted on the Cyber Security & Information Systems Information Analysis Center (CSIAC) website. From June 2013 to June 2014, the SEI assembled guidance on a variety of topics based on relevance, maturity of the practices described, and the timeliness with respect to current events. For example, shortly after the Target security breach of late 2013, we selected Managing Operational Resilienceas a topic.
Ultimately, SEI curated recommended practices on five software topics: Agile at Scale, Safety-Critical Systems, Monitoring Software-Intensive System Acquisition Programs, Managing Intellectual Property in the Acquisition of Software-Intensive Systems, and Managing Operational Resilience. In addition to a recently published paper on SEI efforts and individual posts on the SPRUCE site, these recommended practices will be published in a series of posts on the SEI blog. This post, the first in a series by Peter Feiler, Julien Delange, and Charles Weinstock, presents the challenges in developing systems for safety-critical systems and then introduces the first three technical best practices for the software development of safety-critical systems. The second post in the series will present the remaining five practices.
Safety-Critical (SC) Systems - SPRUCE / SEI
Our discussion of technical best practices for the software development of safety-critical (SC) systems has four parts. First, we set the context by addressing the questions "What are SC systems and why is their development challenging?" Three of the eight technical best practices for SC systems are presented below. We then briefly address how an organization can prepare for and achieve effective results from following these best practices. In addition, we have added links to various sources to help amplify a point; please note that such sources may occasionally include material that differs from some of the recommendations below.
Every organization is different; judgment is required to implement these practices in a way that provides benefit to your organization. In particular, be mindful of your mission, goals, existing processes, and culture. All practices have limitations--there is no "one size fits all." To gain the most benefit, you need to evaluate each practice for its appropriateness and decide how to adapt it, striving for an implementation in which the practices reinforce each other. Monitor your adoption and use of these practices and adjust as appropriate.
What are SC systems and why is their development challenging?
Software systems are getting bigger and more crucial to the things we do. The focus here is on SC systems--systems "whose failure or malfunction may result in death or serious injury to people, loss or severe damage to equipment, or environmental harm." Examples of SC systems include systems that fly commercial airliners, apply the brakes in a car, control the flow of trains on rails, safely manage nuclear reactor shutdowns, and infuse medications into patients. If any of these systems fail, the consequences could be devastating. We briefly expand on several examples below.
Today we take for granted "fly-by-wire" systems, in which software is placed between a pilot and the aircraft's actuators and response surfaces to provide flight control, thereby replacing wearable mechanical parts and providing rapid real-time response. Fly-by-wire achieves levels of control not humanly possible, providing "flight envelope protection" in which the aircraft's behavior around a specifiable envelope of physical circumstances (specific to that aircraft) can be accurately predicted. Pilots train on the fly-by-wire system to fly that type of aircraft safely; therefore, the loss of fly-by-wire capabilities reduces safety.
To provide a medical device example, the FDA is taking steps to improve the safety of infusion pumps, whose use in administering medication (or nourishment) has become a standard form of medical treatment. Infusion pump malfunctions or their incorrect use have been linked to deaths (see "FDA Steps Up Oversight" and "Medtronic Recalls Infusion Pump"). The experience with infusion pumps has similar implications for other medical devices, such as pacemakers and defibrillators.
SC systems are increasingly software-reliant, pervasive, and connected. This properties present a challenge to current development practices to successfully develop and evolve such systems while continuing to satisfy real-time and fail-safe performance.
The practices covered here are intended to address such objectives as the following:
- rigorously anticipating and addressing scenarios for how the system might fail (and not just the typical "sunny-day scenarios")
- identifying defects that can lead to failure early in the lifecycle, since identifying them later in the lifecycle is generally much more expensive to correct
- maintaining an appropriate specification of the system requirements and architecture that summarizes what the system must do and how it must do it, which experts in nonfunctional quality attributes (timing, security, etc.) can subject to analysis
- ensuring that the system is evolvable and developable in increments (requirements and solutions may change)
Technical Best Practices for Safety-Critical Systems
SC requirements are typically documented through some combination of quality attribute scenarios and mission-thread workshops. A quality attribute scenario is an extended use case that focuses on a quality attribute, such as performance, availability, security, safety, maintainability, extensibility, or testability. A mission thread is a sequence of end-to-end activities and events, given as a series of steps that accomplish the execution of one or more capabilities that the system supports.
Surveys and analyses of product returns and legal actions can help identify safety and related operational concerns with existing products. In the infusion pump example, faults account for a reported 710 deaths.
Like other systems, despite best efforts, SC systems may still fail, but the failure must be handled in a graceful way that protects the main asset--human lives, property, or the environment. For example, in the case of an infusion pump, the definition of a graceful failure depends on the circumstances: in some cases treatment should stop, while in other cases, such as intravenous feeding and chemotherapy, halting the treatment entirely may be more dangerous than putting out too much volume. Clearly, different failure scenarios may require different outcomes.
The Quality Attribute Workshop (QAW) is one mechanism for eliciting SC quality attribute scenarios and identifying and specifying SC requirements.
Challenging mission-critical requirements that create the need for novel solutions are a principal source for SC requirements. For example, high-performance military aircraft, such as the F-117 Nighthawk and the B-2 Spirit flying wing, are designed to be highly aerodynamic and highly maneuverable, qualities that are achieved by transferring stability requirements from the pilot to the flight-control software. It is no longer possible for humans to fly these aircraft unaided; instead the aircraft are largely flown by the flight-control software, which must be at least as reliable as a pair of pilots would be.
The effort to identify SC requirements is ongoing and tied to the other eight practices. For instance, when developing assurance cases, it is important to provide justification that the product design or development process addresses a particular failure scenario.
2. Specify safety-critical requirements, and prioritize them.
This practice highlights a few of the many important considerations in the specification of SC systems. An example of a fuller set of considerations can be found in the FAA Requirements Engineering Management Handbook. For the SC system, specify
- mission-critical requirements (function, behavior, performance) using, for example, state-machine representations of behavior: UML state charts, Simulink state flow, or scenario-driven threads through system functions to help derive a system's behavioral requirements
- safety-critical requirements (safety, reliability, security) as described in Practice 1
Inherent to the specification of a quality attribute is some kind of measure of the desired outcome, which aids in specifying the intended outcome in a scenario with greater clarity and assessing success with greater objectivity. In fact, quality attribute scenarios require some unit of measure. Measures are also important when specifying SC requirements; it is important to utilize or introduce some measure of behavior or performance as a first step to setting a threshold. Such measures can often be established by thinking through what an alternative or current approach requires: returning to our flight-control example, the probability of both the pilot and copilot suffering heart attacks over a ten-hour mission is about 10^(-9), and this establishes a reliability threshold for the software.
As the system's architecture emerges, identify which component (or subsystem) each safety requirement applies to in the system, recognizing that in some cases multiple components may need to meet a requirement collectively (and possibly a derived requirement would then be specified for each component).
Review the requirements, identifying which ones are safety critical and which ones are not, and which are the most important. The requirements that deserve the most attention deal with incidents that are more likely to happen or that have the most catastrophic effects. For example, for a fly-by-wire aircraft, you care about the effect a coffee pot has on the electrical system, but you don't care to the same extent that you do about the flight-control software. The latter will require many times more resources and attention than the former.
Priorities should be set with stakeholders who may be able to better assess the probability of failure (technologists and end users) and the impact of failure (end users and other stakeholders) in the context of particular missions. One key to not only specifying but also prioritizing requirements is therefore knowing who your stakeholders are and determining how, when, and why you will engage them during the project.
Typically, the result of prioritization is a set of requirements with associated criticality levels. You'll have requirements such as "the system must operate with some minimal functionality for some period of time" and "the system needs to be ready to take over so that if some component fails, it can fail safely with probability nine 9s (i.e., 1.0 - 10^(-9))."
It is often beneficial to explore alternatives in the allocation of requirements to components because alternatives may offer superior cost/feature tradeoffs (especially when alternative architectures are also considered--see Practice 4, which will appear in Part 2 of this blog posting). Such exploration should also be considered for achieving fail-safe operation. For example, some alternatives may explore use of redundancy.
You are unlikely to obtain the set of requirements right the first time, so expect some iteration through the requirements and adjustment to the allocation of requirements, especially as the architecture, priorities, and tradeoffs emerge or become better understood.
3. Conduct hazard and static analyses to guide architectural and design decisions.
Apply static analyses to the specification of the system (including mission threads, quality attribute scenarios, requirements, architecture, and partial implementation), or to models derived from those specifications, to help determine what can go wrong and, if something can go wrong, how to mitigate it. The analyses result in "design points" for the components that must be safety critical.
In our infusion pump example, at first glance, the design seems pretty simple. Among other things, you need a pump, something to control the rate of the motor, and a keypad for someone to enter the dosage and frequency. But when you consider manufacturing for a large market, you need to carefully consider what can go wrong and document situations that you will need to address. Note that such considerations might not have been part of the original infusion-pump concept.
For example, embolisms can result when air bubbles beyond a certain size enter the patient. To protect the patient from air getting in the line of the infusion pump, you will need to design certain components of the system to prevent that from happening and other components to detect it if it does happen. From a hardware standpoint, you'll need some kind of sensor that detects air bubbles of a certain size. From a software standpoint, if an air bubble is detected, the pump will need to shut down and raise an alarm (while shutting down the pump may be harmful, an embolism is generally worse). Likewise, you'll need to ensure that these actions take place, which means you'll need redundancy or some other fault-tolerance technique to make sure that these actions happen.
More generally, the development of SC systems must address several operational challenges, among them how to deal with system failure. This challenge in turn means that the system must monitor its operation to detect when a fault is going to occur (or is occurring), signal that failure is imminent or in process, and then ensure that it fails in the right way (e.g., through fault-tolerant design techniques). Depending on the degree of criticality, you might need a lot of redundancy in both the hardware and software to ensure that at the very least, the fail-safe portion of the system runs. Another approach is to implement the SC system as a state machine in which once the device reaches a failed state, it automatically transitions to a safe state (albeit not necessarily an operational state).
Returning to our fly-by-wire example, both redundancy and failing to a safe state have been utilized. Such fly-by-wire aircraft systems have been designed with
- fourfold redundancy, which requires monitoring and voting logic to resolve disagreement among duplicated subsystems
- automatic reversion to manual and mechanical backup controls, as in the Tornado airplane
A rich taxonomy of architectural and design tactics have been developed over the years to help in detecting, recovering from, and preventing faults.
Some static analysis methods, such as hazard analysis and failure mode and effect analysis (FMEA), have been around for decades and provide broad and proven approaches to assessing system reliability. There are several forms of FMEA, but they all undertake a systematic approach to identifying failures, their root causes, mitigations for selected root causes, and the kinds of monitoring required to detect failures and track their mitigation. The result of an FMEA can engender the need for additional design, such as to add a sensor to help identify an indication of failure or progress in its mitigation, followed by another iteration of FMEA to recalculate the risk exposure and new risk priorities, and so on. The result of a hazard analysis is a characterization of the risks and mitigations associated with high-priority hazards, including likelihood, severity of impact, how hazard will be detected, and how it will be prevented.
Other analysis methods focus on how the system responds in situations of resource contention or communication corruption. These include timing studies (can critical task deadlines be met?) and scheduling analyses (e.g., to eliminate priority inversion, deadlock, and livelock). Such resource contention problems were largely solved years ago for simple processor and memory configurations, and the solutions have been progressively extended to deal with distributed systems, multilayer cache, and other complexities in hardware configuration.
In our infusion-pump example, specifying the device in an appropriate formal language will allow timing studies of SC requirements to be conducted. For example, a timing study could investigate whether the air-bubble monitoring process will be able to execute frequently and consistently enough (perhaps as a function of motor speed) to ensure adequate time to shut down the pump.
Some of these analyses may involve creating or generating proofs that certain components or configurations of components can achieve certain properties, using theorem provers. These high-confidence software and systems analysis techniques are particularly critical for very high-risk requirements and components.
In Practices 4 and 5, we will have more to say about static analyses and will note some limits to what can currently be achieved with their use. In Practice 8, we will see that the hazard and static analyses and formal proofs described in Practice 3 feed into the development of the safety case.
Technology transition is a key part of the SEI's mission and a guiding principle in our role as a federally funded research and development center. The next post will in this series will present the remaining recommended practices for developing safety-critical systems.
These practices are certainly not complete--they are a work in progress. We welcome your comments and suggestions on this series in the comments section below.
To view the complete post on the CSIAC website, which includes a detailed list of resources for developing safety-critical systems, please visit