System Resilience Part 3: Engineering System Resilience Requirements
At its most basic level, system resilience is the degree to which a system continues to perform its mission in the face of adversity. While critical to operational continuity, the system's services (capabilities) are only some of the assets the system must protect to continue to perform its mission. The system must detect adversities, react to them, and recover from the harm to critical assets that they cause. System resilience at a deeper level is therefore the degree to which a system rapidly and effectively protects itself and its continuity-related assets from harm caused by adverse events and conditions.
As mentioned in the first post in this series, system resilience can be decomposed into two subtypes: active resilience and passive resilience. Active system resilience requires the system to detect adverse events and conditions, to react accordingly to prevent or minimize any resulting disruption, and to recover from such disruption. On the other hand, passive system resilience can be achieved if the system is overengineered to passively avoid disruption (e.g., by having sufficient capacity that excessive loads do not result in lost or degraded capabilities). As indicated in the initial figure in the first blog post in this series, system resilience can be either passive or active, whereby passive resilience passively resists adversities whereas active resistance actively detects, reacts to, and recovers from adversities. This distinction can be used to classify resilience requirements that specify resistance, detection, reaction, and recovery.
In the second post in this series, I showed how system resilience is closely related to other quality attributes, especially robustness, safety, cybersecurity, anti-tamper, survivability, capacity, longevity, and interoperability. In this third post, I will address the system resilience requirements that drive the selection of the architectural, design, and implementation features (e.g., safeguards, security controls, and resilience-related patterns and idioms) that will achieve the required types and levels of resilience.
System Resilience and Subordinate Quality Attribute Requirements
System resilience requirements specify the degree to which the system shall continue to provide system capabilities in the face of adversities by detecting, reacting to, and responding to adverse events and conditions. As shown in the following table, the types of adversities that a resilient system must overcome are categorized based on their associated subordinate quality attributes.
Given the close relationship between system resilience and the above subordinate quality attributes, a natural question is how do system resilience requirements differ from other related quality attribute requirements such as robustness, safety, cybersecurity, and similar requirements? For example, is the following a resilience requirement or a robustness (fault tolerance) requirement:
The system shall continue to provide service X, possibly in degraded mode, when a fault occurs in subsystem Y (adverse event).
Is the following a system resilience requirement or a safety requirement:
The car shall detect black ice (a hazard - adverse condition) and react by modifying traction control to maintain adequate traction to prevent loss of steering control (service) and a resulting traffic accident (harm to critical assets).
Similarly, is the following a resilience requirement or a cybersecurity requirement:
The system shall continue to provide function X when a threat actor achieves unauthorized access beyond the system's outer firewall (adverse event).
Similar requirements can be specified concerning adversities related to anti-tamper (e.g., attempts to remotely access critical program information), survivability (e.g., detection of threats such as enemy radar or missile lock), capacity (e.g., system approaching or exceeding its maximum capacity), and longevity (e.g., system component approaching or exceeding its design life).
What Kind of a Requirement is It?
There are three schools of thought regarding the overlapping of resilience requirements with requirements of other quality attributes:
Approach 1. No Overlap
The figure below illustrates the MITRE approach to system resilience requirements [MITRE 2019]. A requirement is a system resilience requirement (indicated by a gray box) if it specifies that a system shall maintain a specific level of a specific capability when faced with a specific adversity. These specific adversities include adversities related to the quality attributes: robustness, safety, cybersecurity, etc. System resilience requirements include requirements that prevent (a.k.a., avoid) adversities, passively resist adversities as well as actively detect the existence of adversities, react to detected adversities, and recovery from harm caused by adversities. System resilience requirements do not include requirements for subordinate quality attributes unrelated to maintaining specific capabilities in the face of adversities (i.e., the bottom row of the following figure).
Approach 1 - No Overlap Between System Resilience and Subordinate Requirements
This first approach basically states that some robustness requirements are actually resilience requirements rather than robustness requirements, that some safety requirements are really reliance requirements instead of safety requirements, and so forth. In addition to being a source of confusion, this approach also makes it more difficult for specialty (e.g., reliability, safety, and security) engineers to find all requirements relevant to them collocated in the associated quality-attribute-specific sections of the requirements specifications. It also includes prevention (a.k.a., avoidance), which is logically outside of the scope of resilience.
Approach 2. Dual Requirements Types
A second approach is to have requirements that are simultaneously both resilience requirements and subordinate quality attribute requirements. The figure below illustrates this by the gray boxes representing requirements that are both resilience and subordinate quality attribute requirements. Specifically, each resistance, detection, reaction, and recovery requirement represented by a gray box is simultaneously a system resilience requirement and a subordinate quality attribute (such as a robustness, safety, or cybersecurity) requirement.
Approach 2 - Simultaneously Resilience and Subordinate-Quality-Attribute Requirements
This overlapping approach can lead to confusion. It also has the problem of redundant specification making it more difficult to ensure that requirements have unique requirement IDs and to trace requirements to tests (e.g., capacity, robustness, safety, security, and interoperability tests). However, this problem can be somewhat overcome when each requirement is stored only once in a requirements database, each requirement is annotated with the relevant quality attributes, and specialty engineers can filter the requirements based on their areas of expertise and responsibility.
Approach 3. Derived Requirements
The following figure illustrates the third approach in which high-level system resilience requirements (gray boxes) are kept separate from lower-level, derived, subordinate-quality-attribute requirements (white boxes). This separation of requirements keeps the system resilience requirements at such a very high level of abstraction that they ignore quality-attribute-specific adversities. For example, the following is an example of such a system resilience requirement:
The system shall continue to provide mission-critical capability C with key performance parameter KPP with a probability of at least P despite all identified potential adversities.
Once requirements engineers specify the high-level resilience requirements, they derive multiple subordinate-quality-attribute requirements from these resilience requirements. They specify these derived requirements in terms of specific adverse events and conditions related to robustness, safety, cybersecurity, and so on.
Approach 3 - Derived Subordinate-Quality-Attribute Requirements
This third approach clearly and cleanly distinguishes resilience requirements from their subordinate quality attribute requirements, keeps the resilience requirements at the level at which stakeholders are concerned, and supports the derivation of lower-level quality requirements. Note that as with the allocation of performance requirements to multiple architectural components, the resilience KPP level must be decomposed and allocated to the related derived subordinate quality attribute requirements.
Requirements Engineering Process
I recommend using the following process (based on the third approach above) to first develop high-level resilience requirements and then to derive corresponding lower-level, adversity-specific quality attribute requirements:
1. Identify Assets at Risk. What are the critical system capabilities (i.e., services) that must continue to be delivered despite adverse conditions or events? If the system architecture is known, what critical system components (such as subsystems, hardware, and software) are needed to provide these capabilities and services? Similarly, what critical system data must be protected in terms of availability, confidentiality, and integrity? Are there any system-external assets on which the system services depend and for which the system is responsible?
2. Determine Potential Harm. What kind of harm can adversities cause to these critical assets that would result in a loss or degradation of critical system capabilities or services?
3. Determine Maximum Acceptable Harm. What are the maximum acceptable amounts of harm that adversities can cause to assets needed for maintaining critical capabilities and services? For capabilities and services, consider setting the following types of limits:
- Maximum acceptable harm to service/capability-related assets:
- maximum acceptable level of degradation of services/capabilities
- minimum acceptable availability of service/capability during adversity and prior to recovery
- minimum acceptable reliability of service/capability during adversity and prior to recovery
- maximum acceptable harm to asset required for delivery of service/capability
- Maximum acceptable duration of harm to service/capability-related assets (see following diagram):
- maximum acceptable adverse condition/event detection time (required to detect adversity)
- maximum acceptable reaction start time (time between detection and reaction to prevent further harm)
- maximum acceptable reaction duration time (time required to complete reaction to stop further harm)
- maximum acceptable recovery duration time (time between completion of reaction to completion of recovery)
- maximum acceptable duration of loss/degradation of services/capabilities
Capability and service degradation might be measured in terms of decreased performance and might depend on the mode of operation (such as operational, training, exercise, maintenance, and update).
4. Prioritize assets and harm. Sufficient resources are rarely available to ensure adequate resilience under all credible circumstances. Analysis must therefore be limited based on the prioritization of the assets and associated harm.
5. Develop associated resilience requirements. Based on the maximum acceptable harm to critical assets, develop high-level system resilience requirements. Example templates of high-level resilience requirements (with optional clauses enclosed in brackets) include:
- Top-level resilience requirements:
- [While in mode M], the system shall continue to provide capability C [with a maximum degradation of D] with an availability of A during the duration of adverse conditions with a probability of at least P.
- [While in mode M], the system shall continue to provide capability C [with a maximum degradation of D] with an availability of A following the occurrence of adverse events with a probability of at least P.
- [While in mode M], the system shall continue to provide capability C [with a maximum degradation of D] with a reliability of R during adverse conditions with a probability of at least P.
- Detection resilience requirements:
- [While in mode M], the system shall detect P% of adverse events and conditions within S seconds/milliseconds.
- [While in mode M], the system shall detect loss/degradation of capability C within S seconds/milliseconds.
- Reaction resilience requirements:
- On detecting an adverse condition [while in mode M], the system shall limit further loss or degradation of capacity C within S seconds/milliseconds.
- Recovery resilience requirements:
- [While in mode M], the system shall recover capability C within S seconds/milliseconds of detecting adverse events and conditions with a probability of at least P.
6. Determine relevant adversities. What types of adverse conditions and events can cause unacceptable critical services and capabilities to be lost or significantly degraded? For each subordinate quality attribute, consider the associated types of adverse events and conditions that can harm resiliency-related critical assets.
7. Prioritize credible adversities. Because one rarely has sufficient resources required to ensure adequate resilience under all credible circumstances, requirements analysis must therefore be limited based on the prioritization of the harm-causing adversities.
8. Derive associated quality attribute requirements. Use the prioritized adversities to derive adversity-specific subordinate quality attribute requirements from the resilience requirements for reach appropriate quality attribute.
Top-level system resilience requirements can be used to derive component- and data-level resilience requirements, as well as to derive subordinate quality attribute requirements. The following are examples of resilience requirements and related derived requirements:
- System resilience requirements:
- Detection requirement: The system shall detect a major disruption (i.e., total loss or degradation in excess of 30 percent) of capability C within 5 seconds.
- Reaction requirement: Upon detection of a degradation of more than 15 percent of capability C, the system shall take sufficient steps to ensure that the degradation of capability C does not exceed 30 percent.
- Recovery requirement: On detecting a major disruption of capability C, the system shall autonomously restore full capability within 2 minutes with a probability of 99.5 percent.
- Component X resilience requirements:
- Detection requirement: The system shall detect a failure of subsystem X that is required to provide capability Y within 500 milliseconds.
- Reaction requirement: Upon detection of a failure of subsystem X that disrupts capability Y, the system shall take sufficient steps to ensure that the failure does not degrade capability Y by more than 30 percent.
- Recovery requirement: The system shall autonomously restore subsystem X within 1 minute of detecting a failure that causes a major disruption of capability Y.
- Subordinate quality requirements:
- Reaction capacity requirement: Upon detection of an excessive load that has decreased the throughput of transactions T below 20,000 per second, the system shall increase the number of servers to ensure that the throughput does not fall below 18,000 per second.
Many different credible adverse events and conditions can disrupt the same critical system capability. Some of these adversities are independent of each other and are of such low probability that one can reasonably overlook the possibility that these adversities happen simultaneously. On the other hand, it is possible for other adversities to have a common cause or to have a sufficiently high probability that simultaneous occurrences must be considered.
Wrapping Up and Looking Ahead
This post has addressed the different types of system resilience requirements and how they relate to derived subordinate quality attribute requirements, as well as provided a generic process how system resilience relates to other closely-related quality attributes. The fourth post in this blog series will cover resiliency features (e.g., robustness patterns, safeguards, and security controls) that support the detection of, reaction to, and recovery from adverse events and conditions.
Read the first post in this series, System Resilience: What Exactly is it?
Read the second post in this series, System Resilience Part 2: How System Resilience Relates to Other Quality Attributes.