search menu icon-carat-right cmu-wordmark

The SPRUCE Series: 8 Recommended Practices in the Software-Development of Safety-Critical Systems

SPRUCE Project
PUBLISHED IN
CITE

This is the second installment of two blog posts highlighting recommended practices for developing safety-critical systems that was originally published on the Cyber Security & Information Systems Information Analysis Center (CSIAC) website. The first post in the series by Peter Feiler, Julien Delange, and Charles Weinstock explored challenges to developing safety critical systems and presented the first three practices:

  1. Use quality attribute scenarios and mission-tread analyses to identify safety-critical requirements.
  2. Specify safety-critical requirements, and prioritize them.
  3. Conduct hazard and static analyses to guide architectural and design decisions.

This post presents the remaining five best technical best practices.

Safety-Critical (SC) Systems--SPRUCE/SEI
https://www.csiac.org/reference-doc/safety-critical-sc-systems/

4. Specify the architecture incrementally, in a formal notation.

As with requirements, architectures are often specified incrementally, as new insights and risks emerge. These architectures are then communicated to developers and suppliers to align them with the selected design and implementation paths. Components with SC requirements should ideally be specified in a formal language with well-defined semantics to support rigorous model checking and theorem proving. Such notations enable evaluating the specification and predicting the component's behavior when it is successfully implemented.

When the risk of making the wrong architecture decision is high, it may be necessary to consider multiple architectures and co-develop one or more of these architectures with suppliers (when there are suppliers). Appropriate stakeholders should evaluate the results to select an architecture, or multiple architectures, to pursue.

We recommend to the extent possible using the same specification language (Practice 6) throughout system development for both system requirements and architecture. This commonality will enable architects and developers to

(1) detect defects early (before implementation and testing) through model consistency checking and predictive analyses of operational quality attributes across requirements and solution specifications
(2) allow for analysis by formal methods such as model checkers and theorem provers
(3) minimize incompatible abstractions, multiple truths, and indeterminate change impact

In our presentation of these practices, we've separated the practices for specifying requirements from specifying architecture, but these are not serial activities in which development teams do all of one and then all of the other. Rather, requirements and design interweave and influence each other. A bit more design often yields new derived requirements, which in turn might be addressed by additional design.

For example, an infusion pump needs a sensor to ensure that no air bubbles greater than a certain diameter enter the system. The presence of the sensor creates an additional point of failure, which needs to be addressed through further design. So a second sensor or other forms of redundancy are added, each of which has its own derived requirements, and the process continues.

5. Sustain a virtual integration of the software through multiple levels of its specification.

Applying virtual integration helps uncover issues with proposed technical solutions (candidate architectures and their implementations) before an expensive commitment to those solutions is required.

Virtual integration is characterized by an "integrate then build" mindset, as opposed to the more common "build then integrate" mindset. By pursuing a virtual integration, architects can analyze and identify potential system issues so that engineers can correct their design immediately. This approach reduces development cost and avoids late delivery.

In terms of current best practice, however, virtual integration is a concept not yet fully realized, except in separate, well-studied domains. It is possible to derive various domain-specific models from the specifications and subject them to domain-specific analyses, including evaluation by domain experts. But the current underlying meta-models are not yet fully semantically relatable; that is, they do not translate without loss or inconsistency in underlying semantics. These inconsistencies introduce subtly different meanings for the requirements or design. Practice 4 referred to these differences as "incompatible abstractions, multiple truths, and indeterminate change." Such different meanings would potentially invalidate transferring conclusions drawn from the static analyses to the system being developed.

Nevertheless, these automated, domain-specific static analyses still offer value. They can help detect many defects early, but with some false positives and false negatives. Such automated analyses become particularly important as the complexity of the software system increases beyond the ability of a single human to comprehend it. We therefore advocate taking a virtual-integration approach when developing SC systems.

"Full" virtual integration is the goal of the System Architecture Virtual Integration (SAVI) Program, which is intended to advance the state of the art in virtual integration of complex systems. According to the SAVI website,

  • SAVI aims to produce a new way of doing system integration--a virtual integration process (VIP)
  • Models are the basis for the VIP
  • The primary goal is to reduce the number of defects found during physical and logical system integration [resulting in] lower cost, less time
  • Integration starts with conceptualization.
  • "Integrate then build": Move integration forward, get it right sooner, and then keep it right as changes inevitably occur.

The SAVI Program is maturing best state of the art into best practice through a number of pilot projects and transition activities. It is expected to reach best practice maturity in 2016 or 2017, according to the website. At present, we advocate pursuing virtual integration to the maximum practical extent, covering those domains posing the highest risk to the success of a project with the technology and skills available to the project.

6. Use Architecture Analysis and Design Language (AADL) to formally specify requirements and architecture.

The specification will need to cover interactions with the operational environment, the hardware on which the software will operate, the architecture for the software, and initial implementations for some components from a component or reuse library. Specification should also include agreements and derived requirements for components to be provided by suppliers.

While other architecture definition languages have various strengths, we recommend AADL for these reasons:

  • It has a formal definition with well-defined semantics for both software and hardware concerns.
  • It supports specification and analysis of several quality attributes, including performance and safety.
  • It has been proven through almost a decade of use since Version 1 in 2004.
  • It is extensible through addition of other domains and associated static analyses.
  • It has support from a broad community, including tools such as OSATE.
  • The use of AADL for discovering development issues (such as safety, performance and integration) has been demonstrated in several research projects, such as SAVI. SAVI uses AADL for specifying the architecture and the main components, and AADL is the main backbone language used by SAVI. It addresses the ongoing safety-critical software development challenges by discovering issues at the earliest possible opportunity, by virtually integrating software and hardware components.

7. Monitor implementation, integration, and testing.

If we're lucky to work in a well-integrated set of mature domains, we may be able to generate all of the code from our detailed architectural specifications, perhaps through parameterization of prebuilt architectural patterns and associated code. Otherwise, and more likely, we will have to build some of the code. While the previous practices (especially Practices 3-5) have helped establish an architecture that can meet timing and other nonfunctional, safety-related requirements, implementation must proceed carefully to ensure that an architecturally conformant implementation results and that the integration proceeds smoothly. There may be some surprises.

As mentioned, much of the implementation may be automatically generated from the AADL specification, which is possible in some cases, particularly when using predefined architectural templates developed for this purpose. When reusing code developed for another purpose, be alert to the possibility that the assumptions made during its initial implementation--not all of them documented--may not be appropriate in the new operational context in which it will operate.

It is also necessary to carefully test the fail-safe parts of SC systems. Perhaps in part due to the general optimism with which humans approach projects and tasks, the tendency is to cover only scenarios based on the anticipated normal use, but then you and your users risk discovering that the system doesn't behave as intended during failure or restoration of system service. "Cause of Telephone System Breakdowns Eludes Investigators" and "The AT&T Crash" provide examples in which a system didn't behave as intended during failure. An example of reusing code developed for one context and placing it in another is the Ariane 5 catastrophe, which fortunately had satellites and not humans as payload.

For high-risk SC requirements and components, you might use high-fidelity models of the component annotated with formal assertions developed in a formal specification language combined with AADL (Practice 6) to specify the required behavior for that component. Then you can employ theorem proving or model checking to verify that the component's code does in fact satisfy its specification.

When architecting a system that has suppliers or draws upon code from external sources, additional care is needed to negotiate with suppliers and evaluate sources based on an understanding of the product and process capability within the appropriate product domains, relative to the project's SC needs. Many of these activities require diligent communication with stakeholders, both in the operational environment and in the supply chain, to set expectations and ensure understanding of relevant SC requirements and the architectural, implementation, and verification approaches that will be used to address them.

8. Prepare a safety case for certification concurrent with developing the system.

The question the manager responsible for developing SC software will ask is "How can I be sure that everything reasonable is being done to ensure that the developed system will behave safely in operation?" External stakeholders--in particular, regulatory agencies--will need to be sure of this too.

Products that have the potential for being unsafe must go through certification or some other sort of regulatory process before being sold. Such requirements vary according to the product being built. Flight-control software in the USA is subject to Federal Aviation Administration (FAA) regulations or the non-U.S. equivalent. For an infusion pump in the USA, the Food and Drug Administration (FDA) establishes requirements. Likewise, for software to shut down a nuclear reactor, it is the Nuclear Regulatory Commission in the USA. In all three cases, software suppliers must submit documentation of what they've done to address safety as part of the request for certification.

Apart from considering what it will take to achieve certification, your organization will not want to confront the liability arising from a catastrophe due to some oversight in how the system was designed, implemented, verified, and validated (and if applicable, manufactured). Typically, some sub-organization, perhaps Quality Assurance (QA), will take on the role of ensuring management that due diligence was (is being) taken in product development, but when it comes to systemic, critical quality attributes such as safety, taking due diligence is the responsibility of everyone involved in the development of the product. A compelling case must therefore be prepared for both internal and external stakeholders to show that the project has done all that is reasonable to ensure that the developed system will be safe in routine operation, when under stress, and if components fail. This case will reflect many of the early and ongoing considerations of a project seeking to mitigate risks to human safety.

Toward this end, projects should develop an assurance case for safety, also called a "safety case." A safety case is an argument supported by evidence that the system meets its claim of safety. It provides justified confidence that the system will function as intended (with respect to the safety claim) in its environment of use. The evidence in a safety case could include process arguments, formal analysis, simulation, testing, and hazard analysis--in affect all of the techniques previously discussed. The case becomes a reviewable artifact to make development, maintenance, and evaluation significantly more effective.

Special attention should be given to the fact that as SC systems become more software-reliant we rely less on failure probabilities of physical components. Software defects are design failures that will occur with probability of 1 every time the software is executed. We therefore should consider analytic redundancy approaches to mitigate such failures.

A compelling and thorough safety case must be planned and prepared for at the outset of a project. Indeed, a safety case can be considered an essential "component" of the product that the project will produce. As such, a safety case has requirements, a design, and content that has various functions and must be structured, as well as parts that must be related in various ways to each other and traced to the safety and regulatory requirements themselves. Because a safety case is so tightly wedded to early project design and planning considerations anyway--when the project's needed processes, methods, tools, and skill sets will be determined--it is both prudent and efficient to begin developing the safety case early alongside the system being developed and using it to guide system development.

At the beginning of system development, a safety case will be more abstract, addressing components from the top level of the product hierarchy. For example, "the infusion-pump keyboard will be resistant to errors because its keys have no bounce characteristics and because its human interface has been designed properly." As development proceeds, individual components of this argument will be extended with or supplemented by increasingly detailed arguments supported by evidence. For instance, "the keyboard has no bounce characteristics because the keyboard state is polled with high frequency to disambiguate key presses; and here are results from tests of the bounce characteristics of the selected keyboard."

Initially, there will also be a focus on the processes, methods, and tools to use, but these too will get more specific (e.g., in selecting testing methods for evaluating keyboard bounce) as the system and software architecture are refined. As the project progresses, both processes and product portions of the argument will become more granular and more complete and at some point will be represented by specific results from process, method, and tool application. For example, the project manager might initially know that the project is heading toward demonstrating that the keyboard has no bounce but might not know the specific way that will be demonstrated until the keyboard is selected.

As a safety case emerges, it provides context for interpreting a particular piece of evidence. For example, when provided with test results of the bounce characteristics of a particular keyboard, the manager or other stakeholders can relate that specific evidence to the broader safety case and evaluate the strengths and weaknesses of the overall argument that "this is safe because... ." As product design proceeds, you will make more decisions (e.g., to add or replicate sensors), have more claims to check, and accumulate more evidence, producing a tree of claims, which is the safety case.

Thus, by concurrently developing the safety case as the system is architected and implemented, you help ensure continual attention on high-priority technical requirements and risks. You also produce an organized tree of claims linking SC requirements and architectural decisions to claims and evidence of those claims. Here, "evidence" means the results of static analyses from Practice 3, testing from Practice 7, formal methods, and process-capability arguments. For instance, "a log of many years usage has shown that this subcomponent executing in similar circumstances has not experienced a problem."

Traditionally, when the device description, hazard analysis, results of testing, and other documents are filed with the appropriate regulatory agency (the FDA, in the case of the infusion pumps), there is a statutorily limited time period during which the agency reviews the documentation and makes a decision. This can be a daunting, even impossible challenge for reviewers. Typically, the reviewer probes specific claims in the documentation to assess how it supports safety rather than trying to assess every claim and all evidence.

Specifically in the case of infusion pumps, the FDA has introduced a requirement that vendors include a safety case as part of their submission with the intent of eventually extending it to cover all software-reliant medical devices. The benefit for reviewers--not just at the FDA but also stakeholders inside the company and in the supply chain--is that a safety case provides a description of the product, claims about it, and a body of evidence, as well as the argument linking these together. Thus, if a reviewer has a question about particular hazards, design decisions, claims about them, or evidence, he or she can more easily relate each of these to the others and the rest of the safety case and arrive more readily at an evaluation of individual claims and the quality of the overall argument. This enables the reviewer to make more strategic use of limited review time and more rapidly identify inadequately mitigated risks for appropriate follow up.

"Towards an Assurance Case Practice for Medical Devices" provides a partial example of a safety case for infusion pumps. In the case of aviation, the UK MoD Defence Standard 00-56 has been requiring that a safety case be part of a vendor's submission for a range of defense aircraft types since 2007. QinetiQ has developed an example military safety case.

Looking Ahead

Technology transition is a key part of the SEI's mission and a guiding principle in our role as a federally funded research and development center. The next post in this series will present recommended practices for enabling agility at scale.

We welcome your comments and suggestions on this series in the comments section below.

Additional Resources

To view the complete post on the CSIAC website, which includes a detailed list of resources for developing safety-critical systems, please visit
https://www.csiac.org/reference-doc/safety-critical-sc-systems/.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed