Toward Safe Optimization of Cyber-Physical Systems

Cyber-physical systems (CPS) are characterized by close interactions between software components and physical processes. These interactions can have life-threatening consequences when they include safety-critical functions that are not performed according to their time-sensitive requirements. For example, an airbag must fully inflate within 20 milliseconds (its deadline) of an accident to prevent the driver from hitting the steering wheel with potentially fatal consequences. Unfortunately, the competition of safety-critical requirements with other demands to reduce the cost, power consumption, and device size also create problems, such as automotive recalls, new aircraft delivery delays, and plane accidents.

Our research leverages the fact that failing to meet deadlines doesn't always have the same level of criticality for all functions. For instance, if a music player fails to meet its deadlines the sound quality may be compromised, but lives are not threatened. Systems whose functions have different criticalities are known as mixed criticality systems. This blog posting updates our earlier post to describe the latest results of our research on supporting mixed-criticality operationsby giving more central processing unit (CPU) time to functions with higher value while ensuring critical timing guarantees.

During our research, we observed that different functions provide different amounts of utility or satisfaction to the user. For instance, a GPS navigation function may provide higher utility than a music player. Moreover, if we give more resources to these functions (for example, more CPU time) the utility obtained from them increases.

In general, however, the amount of utility obtained from additional resources does not grow forever, nor does it grow at a constant rate. The additional increment in utility for each additional unit of resource instead decreases to a point where the next increment in utility is insignificant. In such cases, it is often more important to dedicate additional computational resources to another function that is currently delivering lower utility and will deliver a larger increment in utility for the same amount of CPU time.

For example, assuming that we get a faster route to our destination if more CPU time is dedicated to the GPS functionality, it seems obvious that the first route we get from the GPS will give us the biggest increment in utility. If we lack enough CPU time (due to the execution of other critical functions) to run both the GPS and the music player, we will choose the GPS. We may even prefer to give more CPU time (if we discover that more time is available) to the GPS to help avoid traffic jams before we decide to run the music player. Letting the GPS run even longer to select a less traffic-clogged route, however, may give us less utility than running the music player.

At this point, we may prefer to start running the music player if we have more CPU time available. We thus change our allocation preference because the additional utility obtained by giving the GPS more CPU time is less than the utility obtained by giving the music player this time. This progressive decrease in the utility obtained as we give more resources to a function is known as diminishing returns, which can be used to allocate resources to ensure we obtain the maximum total utility possible considering all functions in the system.

Our research uses both the diminishing returns characteristics of low-criticality functions and criticality levels to implement a double-booking computation time reservation scheme. Traditional real-time scheduling techniques consider the worst-case execution time (WCET) of the functions to ensure they always complete before their deadlines by reserving CPU time used only in the rare occasion that the WCET occurs. We take advantage of this fact and allocate the same CPU time for functions of lower-criticality. When both functions request the CPU time reserved for both at the same time, we favor the higher-criticality function and let the lower-criticality miss its deadline.

Our double-booking scheme is analogous to the strategies airlines use to assign the same seat to more than one person. In this case, the seat is given to the person with preferred status (e.g., "gold members"). Our project uses utility--in addition to criticality--to ensure the CPU time that is double booked is given to functions providing the largest utility in case of a conflict (both functions requesting the double-booked CPU time). Our double-booking scheme provides the following two benefits:

It protects critical functions ensuring that their deadlines are always met and
It uses the unused time from the critical functions to run the non-critical functions that produce the highest utility.

Our research is aimed at providing real-time system developers with an analysis algorithm that accurately predicts system behavior when it is running (runtime). Developers use these algorithms during the design phase (design-time) to test whether critical tasks will meet their deadlines (providing assurance), and how much overbooking is possible.

To evaluate the effectiveness of our scheme, we developed a utility degradation resilience (UDR) metric that quantifies the capacity of a CPS to preserve the utility derived from double-booking. This metric evaluates all possible conflicts that can happen due to double booking and how much total utility is preserved after the conflict is resolved by deciding what function gets the double-booked CPU time and what functions are left without CPU time. The utility derived from the preserved functions is then summed to compute the total utility that a specific conflict resolution scheme can preserve.

In theory, a perfect conflict resolute scheme should preserve the maximum possible utility. In reality, however, decisions must be made ahead of time assuming that some critical functions will run for their worst-case execution time (even though they may not) to ensure that they finish before their deadlines. Unfortunately, if they execute for less time, it may already be too late to execute other functions.

Using the UDR metric we compare our scheme against the Rate-Monotonic Scheduler (RMS) and a scheme called Criticality-As Priority Assignment (CAPA) that uses the criticality as the priority. Our experiments showed we can recover up to 88 percent of the ideal utility that we could get if we could fully reclaim the unused time left by the critical functions if we had perfect knowledge of exactly how much time each function needed to finish executing. In addition, we observed our double-booking scheme can achieve up to three times the UDR that RMS provides.

We implemented a design-time algorithm to evaluate the UDR of a system and generate the scheduling parameters for our runtime scheduler that performs the conflict resolutions of our overbooking scheme (deciding which function gets the overbooked CPU time). This scheduler was implemented in the Linux operating system as a proof-of-concept to evaluate the practicality of our mechanisms. To evaluate our scheme in a real-world setting, we used our scheduler in a surveillance UAV application using the Parrot A.R. Drone quadricopter with safety-critical functions (flight control) and two non-critical functions (a video streaming and a vision-based object detection functions).

Our results confirmed that we can recover more CPU cycles for non-critical tasks with our scheduler than with the fixed-priority scheduler (using rate-monotonic priorities) without causing problems to the critical tasks. For example, we avoided instability in the flight controller that can lead to the quadricopter turning upside down. In addition, the overbooking between the non-critical tasks performed by our algorithm, allowed us to adapt automatically to peaks in the number of objects to detect (and hence execution time of the object detection function) by reducing the frames per second processed by the video streaming function during these peaks.

In future work we are extending our investigation to multi-core scheduling where we plan to apply our scheme to hardware resources (such as caches) shared across cores.

This research is done in collaboration with Jeffrey Hansen of CMU, John Lehoczky of CMU's Statistics Department; and Ragunathan (Raj) Rajkumar and Anthony Rowe of the Electrical and Computer Engineering Department at CMU.

Additional Resources:
www.contrib.andrew.cmu.edu/~dionisio/

Software Engineering Institute

SEI Blog