Verifying Distributed Adaptive Real-Time Systems
This post was co-authored by Sagar Chaki
In 2011, the U.S. Government maintained a fleet of approximately 8,000 unmanned aerial systems (UAS), commonly referred to as "drones," a number that continues to grow. "No weapon system has had a more profound impact on the United States' ability to provide persistence on the battlefield than the UAVs," according to a report from the 2012 Defense Science Board. Making sure government and privately owned drones share international air space safely and effectively is a top priority for government officials. Distributed Adaptive Real-Time (DART) systems are key to many areas of Department of Defense (DoD) capability, including the safe execution of autonomous, multi-UAS missions having civilian benefits. DART systems promise to revolutionize several such areas of mutual civilian-DoD interest, such as robotics, transportation, energy, and health care. To fully realize the potential of DART systems, however, the software controlling them must be engineered for high-assurance and certified to operate safely and effectively. In short, these systems must satisfy guaranteed and highly-critical safety requirements (e.g., collision avoidance) while adapting smartly to achieve application requirements, such as protection coverage, while operating in dynamic and uncertain environments. This blog post describes our architecture and approach to engineering high-assurance software for DART systems.
The Challenges of Autonomy
A major concern of the Federal Aviation Administration (FAA) and other government agencies is incorporating drones safely into national air space (NAS). Government agencies want a system to provide high assurance that adding UAVs to the NAS won't cause problems.
The DoD has drone-related concerns of its own. The DoD worries that UAVs cannot be brought into the formal acquisition cycle, and therefore it must field them as prototypes. While this restriction causes acquisition problems, it also places limits on what the DoD can do with autonomous systems.
DART systems are groups of physically separated cyber-physical systems (CPS) that operate in an environment, intelligently adapt their behavior, and communicate and collaborate to achieve some overall group objective.
There are many challenges to achieving high-assurance DART software:
- Concurrency is inherently hard to reason about.
- There is often uncertainty in the physical environment in which DARTs operate.
- Autonomous capability leads to unpredictable behaviors.
- It is necessary to achieve both guaranteed and probabilistic properties.
- Verification results on models must be carried over to source code.
Our group--in addition to Sagar and myself, the team of researchers included Dionisio de Niz, Bjorn Andersson, Mark Klein, Jeffery Hansen, David Kyle, Gabriel Moreno, Scott Hissam, Charles Hammons, and Arie Gurfinkel--aims to engineer and objectively analyze high-assurance software for DART systems.
Foundations of Our Work
In recent years, we have focused on developing open-sourced middleware that is:
- useful to application developers
- enables distributed, decentralized artificial intelligence
- fast, small, and capable
- portable to as many platforms relevant to UAS as possible
- extensible to facilitate new transports, linking with external libraries, security, assurance, and consistency
- extensively documented
These priorities became the basis for several SEI research initiatives:
- Multi-Agent Distributed Adaptive Resource Allocation (MADARA) - MADARA allows developers to write both state-based and events-based programs (or combinations of both) for distributed artificial intelligence. Programs can react to receive, send, or rebroadcast events. MADARA allows application developers to focus not on message queues between robotic components or devices, but instead on the interaction between robots and knowledge, with the middleware handling the communication if necessary.
In MADARA, values are mapped to variable names in a knowledge base, while changes to knowledge are aggregated and sent over the network. Consistent checks are implemented via Lamport clocks per variable to allow for resends over unreliable transports. MADARA has been designed for environments with variable latency and loss and includes an extensible filter system for bandwidth and deadline shaping, encryption, and security protocols.
- Group Autonomy for Mobile Systems (GAMS) - The GAMS architecture is middleware for implementing self-interested agents that is defined in a formal manner. These agents collaborate with a certain type of finite state machine as executed. GAMS gives developers guidance on how to do this in a way that can be verified.
GAMS builds directly on top of MADARA, and provides localization and heading/vector primitives per agent. GAMS also provides a consistent Monitor, Analyze, Plan, Execute (MAPE) loop. What sets GAMS apart, however, is an extensible algorithm and platform layers that allow algorithms to be written for many different types of robots. GAMS also provide a Pose system for translating between reference frames (e.g., automatically transitioning from a GPS-based positioning system to a Cartesian grid). The GAMS and MADARA middleware platforms are both well-documented, with over 40 percent of the combined code bases being Doxygen-compliant documentation for methods and classes.
- Model Checking for Distributed Applications (MCDA) - In our MCDA work, we developed tools and demonstrations for producing verifiably correct distributed applications (i.e., software implementing distributed algorithms, with safety-critical requirements). Our current focus is on applications where nodes are mobile and communicate wirelessly (e.g., a quadcopter).
Our work on DART builds on the exploratory research we began with MCDA, but we added more verification, more emphasis on middleware technologies, and thread scheduling to make an approach that we can verify and transition to our stakeholders and the public. With this latest work, we also added self-adaptive behaviors (i.e., agents behave intelligently to change their behavior to adapt to changing external situations.)
In DART, we moved beyond mere code analysis to produce C++ code that will run on an actual drone platform. This code uses GAMS for communication among agents, and MADARA to interact with the (simulated) physical platform (i.e., robots).
The DART Approach
To precisely program DART systems and specify properties, we use a domain-specific language, DMPL (DART Modeling and Programming Language), and the SEI's Architecture Analysis and Design Language (AADL) notation.
Built for programming distributed systems, DMPL resembles C and C++, but it has the concepts of nodes, communication, and shared variables. DMPL is an extension of a language called DASL, which we developed for MCDA. The DMPL language is a higher level of abstraction, so we can analyze it more tractably. Once we have conducted this analysis, we then generate code in C++ that uses GAMS and MADARA, as mentioned earlier.
AADL helped us define the architecture and environment and extend integration of DASL to lend structure to the interacting elements that are going to execute. DMPL interfaces with the AADL architecture, resulting in a better fidelity model for verification. The use of AADL provided a nice synergy because while AADL is an architecture description language, it is also extensible.
For example, the designers of AADL understood that AADL by itself is not going to specify the system in complete detail. They therefore allowed an extension mechanism, called an annex mechanism, by which one could add a new domain-specific language to AADL as a "sub-language". The annex allowed us to incorporate DMPL into AADL as a sub-language (i.e., we are developing a DMPL annex for AADL.)
Our end vision is that architects can design the system using this AADL-DMPL combination. Since system components are well specified (i.e., structure, semantics, and communication mechanisms), and code is generated automatically from them, users can have high confidence that the analysis results apply to the real system.
Verifying the DART System
We are using the following techniques to verify DART systems:
- Software model checking - We are using software model checking, a formal, rigorous, exhaustive verification technique to verify what we call "guaranteed properties," such as collision avoidance, which must always be true.
- Statistical model checking - The statistical model checking technique uses Monte Carlo Simulations to evaluate the probability of mission success. Statistical model checking is used to verify a system that is supposed to achieve a goal (e.g., go from here to there without being exposed to a certain level of danger.)
We know that because a system is probabilistic in practice, on some runs it will fail to achieve the mission goal. On other runs, the system will succeed. We used statistical model checking to estimate the likelihood that the systems will succeed in the mission.
- Zero Slack Rate Monatonic Scheduling - We also used Zero Slack Rate Monotonic Scheduling (ZSRM) for thread schedulability (i.e., timing). ZSRM is a type of fixed priority thread scheduling technique for mixed criticality systems. Given a set of threads with different levels of criticality (e.g., engine control thread with high criticality and audio player thread with low criticality) ZSRM ensures that under overload conditions, threads with higher criticality are given CPU cycles preferentially to those with lower criticality. This assurance is important for DART since we want threads that are responsible for "guaranteed properties" to have higher criticality than others.
We are using all of these techniques to analyze DART systems at a higher level of abstraction.
From Ground to Space
We have not solved all the problems, or even all of the important problems, concerning DART assurance. Our work continues in the following areas, with particular focus on transition to production systems:
- Ground: We are working with robots designed by the National Robotics Engineering Center (NREC), a part of the Robotics Institute at CMU, to apply our analysis techniques to achieve high assurance in ground-based robotic systems.
- Water: We are using boats designed by Platypus LLC, a Carnegie Mellon University (CMU) spinoff led by Paul Scerri, to investigate group autonomy in unmanned surface vehicles.
- Air: We have investigated autonomy issues in unmanned aerial vehicles using a variety of commercial off-the-shelf offerings including those from Parrot and 3D Robotics in the form of quadcopters and hexacopters. We are working with engaged stakeholders in the AFRL to augment our techniques to transition them into DoD systems.
- Space: We are working on a distributed autonomous space system called the Multi-Planetary Smart Tile, a product of academic workshops at the Keck Institute for Space Studies, to examine the application of DART concepts to multi-agent autonomy in the harshness and remote of space.
Making Verification of Distributed Artificial Intelligence Available
Our aim with this work was to give the public access to, and encourage collaboration on, artificial intelligence development, including an infrastructure for building and verifying distributed systems. We know that our approach has the potential to scale to hundreds of robots, which, if successful, would be unique among similar systems currently in use.
Our approach, conversely, is more rigorous and defines a very precise behavior. We innovate, not just in terms of features but also in terms of how easy it is to put together. Our approach also seeks predictability, so when you program you know what you are supposed to get.
In our experience working with government and industry projects, we continually see complex production systems go over budget as a result of unexpected behaviors. Our driving focus has therefore been to make an approach that is well defined and predictable because it is easy to get lost in all the emergent behaviors that occur as soon as a complex system is developed and deployed.
We welcome your feedback on this work in the comments section below.
All of our tools developed through our work on verifying distributed systems are BSD-Licensed (i.e., available via open source with minimal restrictions).
GAMS is available on GitHub.
Madara, Middleware for Distributed Applications, is available on SourceForge.
DART Software: https://github.com/cps-sei/dart
DART Website: http://cps-sei.github.io/dart