search menu icon-carat-right cmu-wordmark

System Resilience Part 4: Classifying System Resilience Techniques

Headshot of Donald Firesmith
CITE

A system resilience technique is any architectural, design, or implementation technique that increases a system's resilience. These techniques (e.g., mitigations, such as redundancy, safeguards, and cybersecurity countermeasures) either passively resist adversities, actively detect adversities, react to them, or recover from the harm they cause. System resilience techniques are the means by which a system implements its resilience requirements. Resilience techniques can also be viewed as architecture, design, or implementation patterns or idioms. This post begins by clarifying the relationships between resilience requirements and resilience techniques. Because system-, software, and specialty engineers have many techniques that can be used to increase a system's resilience, this post also presents an ontology for categorizing these resilience techniques.

System Resilience--a Brief Recap

As I outlined in previous posts in this series, system resilience is important because no one wants a brittle system that cannot overcome the inevitable adversity. If adverse events or conditions cause a system to fail to operate appropriately, all manner of harm to valuable assets can result.

In the first post in this series on system resilience, I addressed these questions by providing the following more detailed and nuanced definition: A system is resilient to the degree to which it rapidly and effectively protects its critical capabilities from harm caused by adverse events and conditions.

The second post identified the eight subordinate quality attributes categorizing the adversities (i.e., adverse conditions and events) that can disrupt the critical system.

The third post covered the engineering of system resilience requirements and how they can be used to derive related requirements for subordinate quality attributes.

This fourth post in the series provides a way to classify system resilience techniques and shows how they relate to system resilience requirements.

System Resilience Techniques

A single resilient technique can often protect a mission-critical capability from multiple adversities of multiple types. Each critical capability can typically be disrupted by a large number of adversities of multiple types. Often there are more adversities than can be properly addressed within limited project resources (e.g., staffing, schedule, and budget). The emphasis is thus first on the critical capabilities that must be protected from disruptive harm. Risk management can then be used to identify, prioritize, and analyze a sufficient number of the most important adversities to provide adequate protection of the mission-critical capabilities.

As shown in the following diagram, resilience requirements do not directly drive the selection of resilience techniques. Rather, this selection is driven by specific adversities captured in the derived resilience-related robustness, safety, security, anti-tamper, survivability, capacity, longevity, and interoperability requirements. The following figure shows how critical capabilities, the critical assets that implement them, and the disruptive harm that can occur to them drive the engineering of resilience requirements. Specific adversities are used to derive requirements for the subordinate resilience-related quality attributes (i.e., robustness, safety, security, anti-tamper, survivability, capacity, longevity, and interoperability requirements). Architects and specialty engineers then select appropriate resilience techniques to implement these adversity-specific derived requirements directly and thereby indirectly implement the resilience requirements.

some file

Many resilience techniques increase multiple quality attributes in addition to resilience and its subordinate quality attributes. For example, redundancy can also improve availability and reliability, while modularity can also improve maintainability.

Resilience techniques are abstract and must be implemented in the system to achieve their intended effect. If a technique is poorly chosen or improperly implemented, however, the result might be different than intended or may even decrease the system's resilience. Resilience techniques are therefore not always "best practices," so adding more techniques is not necessarily better. Considerable expertise, analysis, and testing is required to ensure that the selected techniques as implemented achieve the system's resilience requirements without causing the system to fail to meet its other quality attribute requirements.

The following figure shows three different ways to classify resilience techniques. From left to right, they are by:

  • Degree of autonomy (purple). Autonomous resilience techniques automatically execute without human intervention, unlike manual resilience techniques. Hybrid resilience techniques are partially autonomous and partially manual.
  • Resilience functions performed (yellow). Resistance resilience techniques passively resist adversities. Detection resilience techniques actively detect adversities, while reaction resilience techniques actively react to detected adversities, and recovery resilience techniques actively repair the harm caused by adversities. Many techniques combine two or more of these types of techniques.
  • Composition (blue). Subsystem resilience techniques are implemented by dedicated subsystems (e.g., fire detection and suppression systems). They may be implemented with hardware (e.g., hardware interlocks and redundant sensors) or software (e.g., various voting schemes). Similarly, data resilience techniques are primarily implemented in data (e.g., check sums), although they typically require software to manipulate the data.
some file

The following figure shows how resilience techniques can be mapped to resilience requirements. However, unlike resilience requirements, which should be cohesive and specify only a single, implementation-independent need, individual resilience techniques often perform more than one function. For example, fire detection and suppression systems (FDSS) both detect adversities (existence of smoke) and react by suppressing the associated fire to minimize additional harm.

some file

Wrapping Up and Looking Ahead

There are clearly many techniques that can be used to implement system resilience requirements. These techniques can be categorized in multiple ways, the two most important of which are by resilience function and by implementation. This abundance of techniques and types of techniques provides system architects and specialty engineers with a great deal of flexibility when it comes to ensuring a sufficient resilience, especially when a multi-layer "defense-in-depth" approach is used. On the other hand, incorporating resilience techniques increases system complexity and can therefore, paradoxically, make the system less resilient. Selecting the right number, type, and balance of resilience techniques is anything but trivial.

In the fifth post in this series, I will explore a relatively comprehensive list of resilience techniques, as well as provide a table that organizes them by the resilience function they perform and by their composition.

Additional Resources

Read the first post in this series, System Resilience: What Exactly is it?

Read the second post in this series, System Resilience Part 2: How System Resilience Relates to Other Quality Attributes.

Read the third post in this series, System Resilience Part 3: Engineering System Resilience Requirements.

CITE

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed