search menu icon-carat-right cmu-wordmark

System Resilience Part 6: Verification and Validation

Donald Firesmith
• SEI Blog
Donald Firesmith

Adverse events and conditions can disrupt a system, causing it to fail to provide essential capabilities and services. As I outlined in previous posts in this series, resilience is an essential quality attribute of most systems because they provide critical capabilities and services that must continue despite the inevitable adversities.

  • In the first post in this series, I defined system resilience as the degree to which a system rapidly and effectively protects its critical capabilities from harm caused by adverse events and conditions.
  • The second post identified the following eight subordinate quality attributes that categorize the adversities that can disrupt the critical system: robustness, safety, cybersecurity, anti-tamper, survivability, capacity, longevity, and interoperability.
  • The third post covered the engineering of system resilience requirements and how they can be used to derive related requirements for these subordinate quality attributes.
  • The fourth post presented an ontology for classifying resilience techniques and clarified the relationships between resilience requirements and resilience techniques.
  • The fifth post in the series presented a relatively comprehensive list of resilience techniques, annotated with the resilience function (i.e., resistance, detection, reaction, and recovery) that they perform.

This sixth post addresses verifying whether the architecture, design, or implementation of a system meets its resilience requirements, as well as its subordinate requirements for robustness, safety, cybersecurity, anti-tamper, survivability, capacity, longevity, and interoperability. These eight quality attributes involve different types of adversities, so they often have attribute-specific verification techniques that evaluate the extent to which a system's multiple resilience techniques enable it to (1) passively resist, (2) actively detect, (3) actively react to, and (4) actively recover from adverse conditions and events, as well as the disruptions they cause.

Verification is typically performed in one of four ways:

  • Inspection involves the visual examination of the system, its component subsystems, and its engineering documentation. It may be relatively informal, based solely on the experience of the stakeholder/expert performing the inspection. It may also involve use of various devices to take measurements and the use of various technologies (e.g., ultrasonic, x-ray, and microscopic) to augment the naked eye.
  • Analysis involves the use of modeling, analytical techniques, and calculations based on data and other evidence to verify whether an architecture or design will meet requirements once implemented. Analysis can be performed manually or using analysis tools. Analysis can also be dynamic or static (i.e., with or without the execution of a model or simulation). Analysis may also be qualitative or quantitative.
  • Demonstration involves executing the implemented system to show that it meets its requirements. A demonstration may be of a prototype or various (possibly partial) versions of the system or its subsystems.
  • Testing involves executing the implemented system or its component subsystems with known inputs and under known preconditions to determine whether its behavior (i.e., outputs and postconditions) meets its required/expected behavior. A major difference between demonstration and testing is that demonstration attempts to show a system works correctly, whereas testing attempts to identify cases where it does not.

Inspection

Inspection by visual examination includes audits, desk-checking, inspection, walkthroughs, and both formal and informal technical reviews of the system and its relevant technical documentation. Specifically, inspection concentrates on the resilience techniques. The stakeholders/experts performing the inspections determine the degree to which the system's resilience techniques enable it to overcome adversities and any associated disruptions of its essential capabilities. Inspections are relevant to all eight of resilience's subordinate quality attributes. The system and its architecture and design can be inspected (audited) to evaluate whether the selected resilience techniques have actually been incorporated into the system, properly implemented, and, where appropriate, properly configured.

Analysis

Analysis is used to identify potential adversities and estimate the ability of the system to passively resist and actively detect, react to, and recover from adversities. Multiple analysis techniques can be applied to analyze a system's resilience and subordinate quality attributes:

Demonstration

Demonstration can informally show stakeholders that the implemented system resilience techniques work correctly to improve the system's resilience to specific adversities so that the system continues to provide critical capabilities/services without unacceptable disturbances. Demonstration is a relatively weak verification method because it is typically used to demonstrate correct functioning under limited, often nominal circumstances. It is seldom used in practice to demonstrate correct functioning during off-nominal circumstances (e.g., during adversities). Testing is typically a more powerful verification technique because it concentrates on stressing the system under test in an attempt to uncover defects, weaknesses, and vulnerabilities that can lead to failures such as disruption of critical services due to adversities.

Testing

System resilience testing is not about verifying whether the system operates properly under normal circumstances. Instead, this type of test is primarily performed to verify whether the system is actually resilient when faced with the adversities. In resilience testing, testers cause adversities to determine how the system reacts.

System resilience testing primarily falls into three categories:

  • Requirements-based black-box testing. Does the system continue to provide essential services when the tester causes (or simulates) adversities during system operation? Note that this seeks to provoke externally visible failures and does not include uncovering hidden faults that might eventually (e.g., after the test) cause a visible failure. These black-box tests are intended to verify whether the system adequately resists, detects, reacts to, and recovers from subordinate-quality-attribute-specific adverse conditions and events.
  • Architecture-based white-box testing. Are the system's resilience techniques properly implemented and configured? Are these resilience techniques effective (i.e., do they enable the system to continue providing essential capabilities)?
  • Gray-box testing. The following questions can often be answered using black-box testing, white-box testing, or a combination of the two (i.e., gray-box testing). The selection of test cases can depend on understanding how the implementations of the resilience techniques interact with other components, as well as with each other.
  • Do the active resilience techniques detect adversities, react accordingly, and enable the system to recover afterwards? Note that recovery testing is a specialized form of resilience testing that verifies whether recovery has been successful.
  • When multiple techniques must work together to provide resilience (e.g., redundancy and voting), do they properly work together as intended to increase resilience?
  • When multiple layers of resilience techniques are incorporated to achieve defense-in-depth, do the subsequent techniques take over when the initial techniques fail to achieve their desired results?
  • Do the resilience techniques that should be effective against multiple types of adversities effectively prevent or mitigate disruption when challenged by each type of adversity?

The following list can be used to select the types of testing to employ to verify system resilience in terms of the eight subordinate quality attribute associated with the relevant types of adversities being considered:

Challenges in Verifying System Resilience

The following are some key challenges that system and software engineers face when verifying a system's resilience:

  • Requirements. System resilience deals with adversities and disruptions that might be relatively rare. Therefore, the relevant quality attribute requirements that should drive testing might be overlooked and therefore not exist. System resilience-related requirements might be scattered across multiple requirements specifications and multiple sections within a requirements specification, making them hard to identify, especially on large complex projects with numerous requirements.
  • Architecture and design. Resilience testers require access to relevant architectural documentation that may be scattered across documents, diagrams, and models. The relevant documentation identifying resilience techniques might not exist.
  • Required expertise. Verifying system resilience requires access to specialty (e.g., robustness, safety, cybersecurity, and survivability) engineering expertise, which might be hard to obtain. It also requires both hardware and software expertise because (1) adversities can cause both hardware and software failures and (2) many resilience techniques involve a combination of hardware and software.
  • Specialized types of testing. Requires, but is not limited to, specialized forms of testing, such as penetration testing for cyber-attacks, stress testing for excessive loads (capacity), and accelerated life testing for excessive life span (longevity).
  • Difficulty due to adversities. Adverse conditions and events are often hard to cause or simulate.
  • Inadequate emphasis. There is a natural tendency for testing to concentrate on verifying nominal operations, leaving little time for testing comparatively rare adverse conditions and events.
  • Timing of the adversities. Disruptions may result at different times: when the adverse condition begins, during the adverse condition, and when the adverse condition causes an adverse event. With hard real-time systems, testers must be aware of timing issues (e.g., scheduling of actions) and determine whether the degree to which the time that an adversity occurs negatively impacts resilience.
  • Operational system. Might be unable to test on the actual operational system because the resilience tests attempt to cause a disruption of essential capabilities which might cause major or catastrophic negative results. Resilience testing in a cloud environment with massive amounts of reserve capacity might be totally impractical in embedded systems with far more limited resilience.
  • Many-to-many mappings. Different adversities of different types (i.e., different subordinate quality attributes) can cause the same type of disruption, and a single type of adversity can cause multiple types of disruptions.
  • Cascading adversities and disruptions. Adversities can cause a cascading network of multiple faults and failures, whereby this cascading can involve multiple subordinate quality attributes. For example, a cyberattack (security) can result in a hazard that leads to an accident (safety), which could cause faults and failures (robustness). To identify the true scope of the disruptions, testers need to allow faults and failures to play out rather than stop the test at the first sign of a problem.
  • Safety. Due to the danger associated with hazards and accidents, safety testing typically does not involve the actual system's hardware. Instead, the hardware and external environment are typically simulated. Any deviation in the behavior of these simulations and the actual hardware and environment can cause tests to yield false positive and false negative results.
  • Resilience verification identifies problems, which when fixed increase a system's resilience to adversities attempting to disrupt critical capabilities.
  • The defects and weaknesses uncovered by the verification of a system's resilience, especially those found by testing, increase awareness across the organization of the need to focus on resiliency.
  • Resilience testing of the production system ensures that no one becomes complacent. This, in turn, encourages building resilience into the system via requirements and architecture (resilience techniques).
  • During verification, the date and time of a potential disruption are predetermined so that proper preparations can be made and personnel are ready to immediately jump in and fix the problems when they occur, thereby minimizeing the impact of any resulting disruption.
  • Because adversities are likely rare events, the software to handle them is likely to have lower quality than the software that provides essential functionality under nominal conditions. It is therefore likely to hide more defects than average. Resilience testing concentrates on where requirements, architecture, design, and implementation defects tend to congregate.
  • All development/test teams should use the same tools for injecting adversities and for monitoring/logging service disruptions.
  • For 24x7 systems, verify whether detection, reaction, and recovery are automatic.
  • Focus monitoring the system data before, during, and after any adversity that might cause a disruption.
  • Where practical, perform system resilience testing on the production system as it will force developers to pay attention to resilience during requirements, architecture, design, implementation, and testing.
  • Plot response times during adverse conditions and events.

Benefits of System Resilience Verification

  • Resilience verification identifies problems, which when fixed increase a system's resilience to adversities attempting to disrupt critical capabilities.
  • The defects and weaknesses uncovered by the verification of a system's resilience, especially those found by testing, increase awareness across the organization of the need to focus on resiliency.
  • Resilience testing of the production system ensures that no one becomes complacent. This, in turn, encourages building resilience into the system via requirements and architecture (resilience techniques).
  • During verification, the date and time of a potential disruption are predetermined so that proper preparations can be made and personnel are ready to immediately jump in and fix the problems when they occur, thereby minimizeing the impact of any resulting disruption.
  • Because adversities are likely rare events, the software to handle them is likely to have lower quality than the software that provides essential functionality under nominal conditions. It is therefore likely to hide more defects than average. Resilience testing concentrates on where requirements, architecture, design, and implementation defects tend to congregate.

Recommendations for Improving the Verification of System Resilience

  • All development/test teams should use the same tools for injecting adversities and for monitoring/logging service disruptions.
  • For 24x7 systems, verify whether detection, reaction, and recovery are automatic.
  • Focus monitoring the system data before, during, and after any adversity that might cause a disruption.
  • Where practical, perform system resilience testing on the production system as it will force developers to pay attention to resilience during requirements, architecture, design, implementation, and testing.
  • Plot response times during adverse conditions and events.

Wrapping Up and Looking Ahead

This post has provided various techniques for verifying the degree to which a system adequately handles adversity and thereby resists, detects, reacts to, and recovers from the disruption of its critical capabilities. The seventh and final post in this series will summarize the key takeaways of this series in the form of 15 guiding principles for engineering resilient systems.

Additional Resources and References

Read previous posts in this series:

[Armstrong 2015] Mark Armstrong, "Chaos Monkey and Resilience Testing - Insights from the professionals," IBM blog post, 10 December 2015. https://www.ibm.com/cloud/blog/resilience-testing-insights-from-the-pros

[Izrailevsky and Ariel Tseitlin 2011] Yury Izrailevsky and Ariel Tseitlin, "The Netflix Simian Army," techblog.netflix.com http://techblog.netflix.com/2011/07/netflix-simian-army.html

[Martins 2017} Ricardo Martins, "Resilience testing: breaking software for added reliability," Talkdesk Engineering, 5 September 2017. https://engineering.talkdesk.com/resilience-testing-breaking-software-for-added-reliability-7f1e60207d06

[Mooney 2013] Gregory Mooney, "Test Monkeys and a Method to Madness," Smartbear Blog, 8 May 2013. http://blog.smartbear.com/testcomplete/test-monkeys-and-a-method-to-madness/

[Poonam 2018] Poonam, "Explaining the Term 'Resilience Test,'" TestOrigen, 6 July 2018. https://www.testorigen.com/explaining-the-term-resilience-testing/

[Scott Will], Scott Will, "Improve application resiliency with chaotic testing," IBM, https://www.ibm.com/garage/method/practices/manage/practice_chaotic_testing

[Vogels] Rebecca Vogels, "A Guide to Software Resilience Testing, UserSnap, https://usersnap.com/blog/resilience-testing/

About the Author