Certifiable Distributed Runtime Assurance in Cyber-Physical Systems

Runtime assurance (RA) has become a promising technique for ensuring the safe behavior of autonomous systems (such as drones or self-driving vehicles) whose behavior cannot be fully determined at design time. The Department of Defense (DoD) is increasingly focusing on the use of complex, non-deterministic systems to address rising software complexity and the use of machine learning techniques. In this environment, assuring software correctness has become a major challenge, especially in uncertain and contested environments. This post highlights work by a team of SEI researchers to create tools and techniques that will ensure the safety of distributed cyber-physical systems.

The Foundation of Our work: Verification of Cyber Physical Systems

Software assurance is a major challenge to achieving trusted autonomy affordably and rapidly. RA is a promising approach for capability with confidence in autonomous systems with complex components that cannot be completely verified statically. RA relies on the use of an enforcer to monitor the target system and correct (i.e., preserve the safety of) a system's behavior at runtime. Current RA techniques, however, are restricted to single domains (i.e., types of analyses) that rely on unverified assumptions and potentially inconsistent enforcers. These enforcers can be circumvented easily and require system-wide reverification when an enforcer is changed, added, or removed.

The SEI team has deep expertise in functional and timing verification of distributed real-time systems and exposure to DoD programs developing autonomous systems. The complexities of verification and validation of cyber physical systems are exacerbated by components with unpredictable (e.g., machine learning) behaviors that are essential for mission-critical capability. In particular, when verifying cyber physical systems, including autonomous systems, two challenges must be addressed:

The first challenge applies to all cyber-physical systems. The verification of cyber physical systems involves very complex software. Applying traditional methods of verification, such as model checking and theorem proving, are not scalable.
The second challenge applies to autonomous systems in particular. Autonomous systems have components that learn via machine learning techniques. Machine learning creates a moving target for verification because the system's behavior is continually subject to and modified by the learning behavior. Consequently, the system's behavior is unknown.

With machine learning techniques in play, the system will make changes without human intervention based on what it learns. That adaptation, however, cannot be verified because the verification step already occurred prior to system deployment.

As an autonomous system is executing, learning, and changing behavior, it is impossible to know if the changing behavior will be correct or not. This condition makes verification even more complex, which makes it hard to know whether the system is scalable or not.

Our Approach: Certifiable Distributed Runtime Assurance (CDRA)

Our CDRA approach allows us to verify safety properties of a cyber-physical system (e.g., prevent a drone from crashing) without verifying the whole system. Instead, we use a two-step strategy:

In the first step we add enforcers to the system to monitor its behavior (e.g., drone movements) and if the behavior leads to an unsafe state (crash) then the monitor modifies the behavior (changes movements) to keep it safe.
In the second step we formally verify the enforcers, which are small pieces of code where formal verification can be readily applied. This verification step ensures that the enforcers will always be able to (1) detect incorrect behavior from the unverified parts of the system and that they will (2) correct such behavior to preserve the safety property. The end result is a system that is guaranteed to preserve this safety property.

From the logical point of view, the monitoring simply involves a piece of code observing system outputs and making a judgment about whether the output will transition the system into an unsafe state (e.g., a self-driving vehicle that crashes into a wall or a drone that flies outside of a virtual fence). After the code observes outputs of the system, it then replaces that output with one that is safe. In the case of a self-driving vehicle that may collide with a wall, the output could involve a braking action.

We can now think of the system as split into two pieces. The piece that has not been verified is the original system. We refer to the original system with the monitoring code as an enforcer, because it not only monitors that output but also replaces the unsafe output with a safe output.

The goal of adding this enforcement is to verify the behavior of a system. The output that is received cannot define what the output will be. Put another way, the original system can produce any output that it wants, but our efforts have the goal of verifying that the system with the enforcer will always be able to identify wrong outputs and replace them with a correct, safe output.

To accomplish this goal, our team built a verification model to verify that the enforcer will be able to identify and replace the wrong outputs. To continue with our drone example, imagine that the drone is flying within an area that we call a virtual fence. If the drone flies outside of the virtual fence, the enforcer will replace that with a system that keeps it within the virtual fence.

The Periodic Enforcer

Continuing with our drone example, the physical process of flying a drone involves inertia. If the drone is flying in one direction, and the software does not take any action to change its state, inertia will cause the drone to continue to fly in the same direction. The monitoring of the cyber-physical system must therefore happen in a timely fashion (e.g., the system needs to constantly check whether it is getting close to the virtual fence). Moreover, the system must check its proximity to the fence frequently enough to ensure that it would have time to react.

The logical enforcer that will be observing the outputs and replacing them if they are not correct is insufficient to ensure the safety of the system given the reaction time requirements of many cyber-physical systems. In particular, if the unverified original system takes too long to produce an output, it may be too late for the logical enforcer to take a corrective action. In the case of the drone, for example, the drone may already be outside the virtual fence when the enforcer decides that the next action is incorrect.

For such scenarios, we have another type of enforcement, called a temporal enforcement, that is applied before the drone flies outside of the fence. The temporal enforcement checks the time and will issue an output that is safe if an operation is taking too long. In the case of the drone, the command might be hover until we decide what to do.

When working with the temporal enforcer, it is important to verify that its reaction time is short enough to prevent inertia from taking the drone to an unsafe state.

The State of Our Research

At this point, we already have the logical model for the logical enforcers, as well as an approach to modify the logical enforcers in the system in general. To implement our approach, we used tools, such as the Satisfiability Modulo Theory (SMT). In particular, we applied the Z3 tool to verify the logical enforcers.

We then enhanced our real-time schedulers to include the temporal enforcers. This step allowed us to take advantage of earlier verification work on Verifying Zero Slack Rate Monotonic Scheduling (ZSRM), which is a type of fixed priority thread-scheduling technique for mixed-criticality systems. ZSRM schedules the threads that execute a specific function periodically, as outlined in a blog post that I co-authored with Sagar Chaki. We took advantage of code that was already verified through our previous work to ensure that the temporal enforcer is completed at the correct time. This approach ensures the following properties

The equations that we develop for ZSRM are properly implemented in C as verified by the Verifying Software with Timers and Clocks (STAC) program. With ZSRM, we not only have implementation, but we have equations to ensure that the timing was correct.
A piece of code is created (the temporal enforcer) that will run and create a timer that calls a function to execute the action. For example, the action in the drone example might be to hover.

We are transitioning parts of the logical and temporal enforcers (ZSRM scheduling) into an Air Force Research Laboratory development framework for unmanned systems known as UxAS.

Looking Ahead: A Focus on Critical Properties

Through certifiable distributed RA, we are trying to accelerate the deployment of autonomous systems that are verified for safety with the use of enforcers without requiring verification of the whole system. At this point our work has focused on small systems with simple physical models. Our future work will expand this focus to include more complex systems, e.g., in the style of fixed-wing drones, which require more sophisticated models of the physics.

Our approach going forward is to ensure that, within the hypervisor, key enforcers will ensure that the system will survive even if the rest of the system is compromised.

Additional Resources

Read the SEI Blog Post, Verifying Software with Timers and Clocks.

Read the paper Contract-Based Verification of Timing Enforcers, which was presented at ACM SIGAda's High Integrity Language Technology International Workshop on Model-Based Development and Contract-Based Programming (HILT), October 6-7, 2016, Pittsburgh, PA, USA.

For more information on ZSRM, read the 2009 paper On the Scheduling of Mixed-Criticality Real-Time Tasksets by Dionisio de Niz, Karthik Lakshmanan, and Raj Rajkumar.

Software Engineering Institute

SEI Blog