System Resilience Part 7: 16 Guiding Principles for System Resilience
Adverse events and conditions can disrupt a system, causing it to fail to provide essential capabilities and services. As I outlined in previous posts in this series, resilience is an essential quality attribute of most systems because they provide critical capabilities and services that must continue despite the inevitable adversities. These adversities are often unavoidable and come in many forms. Typical examples include coding defects (robustness), hazards and accidents (safety), vulnerabilities and attacks (cybersecurity and survivability), excessive loads (capacity), long lifespans (longevity), and lost communication (interoperability).
- In the first post in this series, I defined system resilience as the degree to which a system rapidly and effectively protects its critical capabilities from harm caused by adverse events and conditions.
- The second post identified the following eight subordinate quality attributes that categorize the adversities that can disrupt the critical system: robustness, safety, cybersecurity, anti-tamper, survivability, capacity, longevity, and interoperability.
- The third post covered the engineering of system resilience requirements and how they can be used to derive related requirements for these subordinate quality attributes.
- The fourth post presented an ontology for classifying resilience techniques and clarified the relationships between resilience requirements and resilience techniques.
- The fifth post in the series presented a relatively comprehensive list of resilience techniques, annotated with the resilience function (i.e., resistance, detection, reaction, and recovery) that they perform.
- The sixth post helped readers verify whether the architecture, design, or implementation of a system meets its resilience requirements as well as its subordinate requirements for robustness, safety, cybersecurity, anti-tamper, survivability, capacity, longevity, and interoperability.
This seventh and final post distills the information in the six previous posts into the following 16 guiding principles to help the system and software engineer develop resilient systems.
Focus on Mission-Critical Capabilities
A system typically supports many capabilities that vary radically depending on the system's mission and functionality. Some capabilities are critical to the success of the system's mission, whereas others might only play a supporting role or have work-arounds that avoid disrupting the mission. Similarly, not all capability-related requirements are equal. Some are necessary for mission success, whereas others are not.
The goal of system resilience is to ensure that mission-critical capabilities are not disrupted by adverse conditions and events. Meeting this goal is why the identification and understanding of mission-critical capabilities is the starting point for engineering a resilient system.
Identify Critical Assets
Each mission-critical capability is implemented by associated critical assets, including system components, system data, and potentially system-external data sources/sinks (e.g., external systems and external data bases) and external networks connecting the system to these data sources/sinks. To protect a mission-critical capability from disruption, engineers must protect the associated critical assets, which is why it is important to identify the critical assets that mission-critical capabilities depend upon.
Concentrate on Common Critical Assets
A single critical asset often supports multiple mission-critical capabilities. Thus, failure to protect these common critical assets can result in the disruption of multiple mission-critical capabilities. Avoiding these disruptions is why resilience engineering should concentrate on common critical assets, such as shared services/components, networks, and data repositories.
Concentrate on Disruptive Harm
There are many ways that critical assets can be harmed by adverse conditions and events. Certain conditions or events, however, may not disrupt mission-critical capabilities. Resilience engineering should therefore concentrate on harm that disrupts mission-critical capabilities.
Many adversities are unavoidable, especially in today's turbulent cyberspace environments. Resilience engineering activities should therefore be based on the assumption that adverse conditions exist and adverse events will occur.
Consider all Types of Adversities
This consideration should include both adverse conditions and adverse events, as well as adversities associated with all eight subordinate quality attributes: robustness, safety, cybersecurity, anti-tamper, survivability, capacity, longevity, and interoperability. Too often, system resilience concentrates on a single quality attribute (especially robustness, safety, or security) because the system resilience requirements are primarily driven by individual reliability, safety, or security engineers.
Assume Multiple Adversities
Adversities do not always happen in isolation and might interact with each other. Sometimes, they occur simultaneously or in rapid succession. The existence of adverse conditions often leads to associated adverse events. Most accidents result from a sequence or network of adversities. For example, a cybersecurity attack might result in a fault or failure that results in an accident. Similarly, an accident may produce a vulnerability that enables a successful cybersecurity attack.
Expect Adversities to Vary Over Time
Adversities vary over time. Moreover, new adversities (e.g., new security threats and vulnerabilities) are regularly discovered. Maintaining and updating a system often change the probabilities of occurrence and the negative consequences of adversities. As a result, resilience engineering is never finished, but instead is an onging activity.
Identify and Prioritize Potential Adversities
To prevent the disruption of mission-critical capabilities, systems must be protected from a large number of potential adverse conditions and events. Potential adversities must be identified and understood. However, the number of potential adversities is often so large that only a subset can be addressed in practice. Risk analysis is typically used to prioritize these adversities in terms of their probability of occurrence and the level of harm they can cause.
A system can often be architected, designed, and implemented to resist certain adversities, e.g., to passively prevent these adversities from disrupting mission-critical capabilities) This passive resistance can sometimes be more effective than (and even eliminate the need for) resilience techniques that actively detect, react, and recover the same adversities. On the other hand, defense-in-depth may lead to the use of both passive and active resilience techniques to protect mission-critical capabilities from the same adversities.
To react to and recover from adversities, it is first necessary that they be detected. This step includes not only detecting adverse events, but also detecting adverse conditions so that the associated reaction could prevent the associated adverse event.
Some might argue that it is sufficient to specify requirements for reaction and recovery, that detection is understood as being necessary (i.e., detection requirements might be derived from reaction or recovery requirements). However, calling out detection separately increases the likelihood that appropriate detection techniques are identified and incorporated into the system.
React to Adversities
The distinction between reacting to adversity and recovering from adversity is important. As soon as is practical, react by stopping an adversity from harming critical assets. This reaction might occur before full or partial recovery is possible. Thus, a system should include resilience techniques that react to adversities to minimize duration and scope of the disruptions they can cause.
Recover from Harm Caused by Adversities
After a system reacts to an adversity--and no further harm to critical assets occurs--it is important to completely (or at least partially) recover from the harm that disrupted the mission-critical capabilities. Achieving this objective might not be practical or even possible, however, depending on the harm caused and the system's location (e.g., a failed wheel motor on a Martian rover).
Assume Faulty, Failed, or Compromised Components
All system components eventually fail or become faulty if not replaced or eliminated in a timely manner during system updates. As discussed in my previous blog post on verification and validation, testing can never be exhaustive. It's therefore safe to assume that software (which typically implements the majority of a system's functionality) has a certain level of latent defects that can disrupt the system's capabilities. This lack of component reliability impacts resilience, as well as availability.
Prefer Autonomous Resilience Over Manual Resilience
It is important to limit the duration of any disruption of mission-critical capabilities. For this reason, autonomous resilience is typically preferred over manual resilience due to human reaction times being relatively long compared to the response time of automated resilience techniques. Moreover, it's not always possible to include human detection, reaction, and recovery due to location (e.g., in satellites and planetary probes and rovers).
On the other hand, humans and automatic resilience techniques often have different sweet spots, so that a combination of automated resilience techniques combined with human oversight should be used where practical.
Balance Layered Defense and Complexity
When it is vital to avoid the disruption of mission-critical capabilities, a layered defense-in-depth is commonly used. Each additional resilience technique increases system complexity, however, and excessive complexity paradoxically decreases resilience. Thus, resilience engineering must balance the number and types of resilience techniques against the complexity that they add to the system's architecture, design, and implementation.
Wrapping Up and Looking Ahead
System resilience is an essential quality attribute of most systems, especially those that are safety-, security-, and business-critical. My goal in this series of posts has been to emphasize the importance of system resilience and provide the reader with a relatively complete overview of the topic.
In this series of seven posts, I provided an overview of what system resilience is, how it relates to other quality attributes, and the impact it has on requirements, architecture, and verification (especially testing). Starting with a conceptual model, this series has identified different types of resilience-related requirements, numerous architecture and design techniques for increasing resilience, and associated testing techniques for verifying the degree to which the system adequately handles adversity.
It is my hope that all system stakeholders will find useful information that will provide practical guidance as they develop the critical systems of today and tomorrow.
On a more personal note, I retired last week after 17 years at the SEI and more than 40 years as a system and software engineer. This will therefore be my final SEI blog post. I hope that you have found my numerous posts useful and enjoyed reading them as much as I have enjoyed writing them. I have gathered these seven system resilience posts in the form of an SEI technical note that will hopefully be published shortly. I will continue to remain active in LinkedIn, and you can contact me there if you have questions about my technical work.
Read previous posts in this series:
- System Resilience Part 1: What Exactly is it?
- System Resilience Part 2: How System Resilience Relates to Other Quality Attributes
- System Resilience Part 3: Engineering System Resilience Requirements
- System Resilience Part 4: Classifying System Resilience Techniques
- System Resilience Part 5: Commonly-Used System Resilience Techniques
- System Resilience Part 6: Verification and Validation