Safety Assurance Does Not Provide Software Assurance
Cyber attacks on physical infrastructure, such as pipelines, electrical grids, and water-processing plants, have intensified interest in the cybersecurity of cyber-physical systems, which previously had been more focused on individual devices, such as automobiles and airplanes. History has demonstrated that these infrastructure systems are subject to many of the same attacks as IT systems—they share a lot of common software in their applications. However, cyber-physical systems have an additional attack surface through the environmental inputs used by applications to control and manage the device. Sensors and actuators can malfunction. Monitors and regulators can fail. The safety community has invested significantly to create processes for evaluating whether a device can be safely operated in the face of a device malfunction. There is a claim that these safety evaluations effectively provide software assurance, as well. The argument is simply that the detection and remediation of a physical anomaly, say a sensor input, is indifferent to whether a natural phenomenon like a cosmic ray hit the sensor or an adversary shined a laser at the sensor. Either way, the systems need to respond to handle the bad sensor input.
While appealing, the simple idea that one need only consider sensor failures without considering the cause can fail to provide cybersecurity because of one important distinction: adversaries do not obey the laws of physics. This blog post will consider some examples to illustrate the differences and other common attack surfaces of cyber-physical devices.
Redundancy does not provide security. Redundancy is one of the cornerstones of reliability for yielding safer devices. Chips may fail for any number of reasons, but if there are extra copies of circuits or computers, the system remains resilient in the face of a single failure. (For simplicity of discussion, common mode failures, like a shared power supply, are not considered here.) The assumption is that each device will fail independently, so the probability of system failure is the product of individual failures—usually an acceptable low probability.
However, adversaries, and the exploited errors and vulnerabilities, are not independent events. To the extent there is a vulnerability or other software error that can take down a subsystem, duplicate copies of the subsystem can be taken down together. For example, in the August 2003 blackout of the U.S. Northeast, in response to a primary server crash, an identical secondary server was brought on that subsequently crashed for the same reason. Thus, when the independence assumption is broken the safety analysis does not apply.
Fault trees do not generate an exhaustive collection of attack scenarios. Fault trees are a popular technique for exploring the state of a system, typically based around the failure of a component. The failure model is commonly based on physical properties to gauge its likelihood. Depending on one’s point of view, either software is a single component or every line in a program is a component. In either case, an adversary is not following along potential fault lines in the software but can cause both intracomponent failure (i.e., within a software module) and simultaneous component failures (i.e., in many software modules). Moreover, adversaries are not prioritizing the most likely component to fail, but rather the component attack that provides the most value (or in a more sophisticated analysis, the value vs. the cost of attack.) These situations do not correspond easily to conventional fault trees for safety analysis. Attack Trees provide an analogous technique for cybersecurity analysis that considers factors of adversary interest, skill, and access.
Steady state behavior cannot be relied on as evidence of security. Self-inspection by systems is one technique to provide evidence that a system deemed safe remains safe. In particular, given a set of operating parameters, a safe system that performs within those parameters is considered to remain safe. However, advanced persistent threats (APTs) are a growing style of attack where adversaries insert an extremely low-level aberration into the system that can be remotely triggered into an attack when desired. The aberrations are so small that system monitoring remains within operational parameters. More sophisticated versions of malware go further and manipulate the monitoring systems. For example, the STUXNET attack simulated a status display reporting a safely controlled system to the operator while destroying the physical centrifuges.
Deteriorating performance cannot be relied on as evidence of anticipated operation. Mechanical systems wear out based on experience and physical models of the components. Routine maintenance schedules are examples where such models are used to confirm that the system is performing as expected. Adversaries can use the same kinds of modeling to hide activity that is causing similar degradation. As a variant on APTs, an adversary’s attack could slowly increase the depth and scope of its attack instead of triggering an attack that leaves traces and signals. In a sophisticated form, the adversary could make the behavior appear to be a slightly accelerated wear-and-tear that would cause accelerated maintenance, not cybersecurity remediation.
Another class of cybersecurity risks in cyber-physical systems derives from commonly used implementation techniques that are viewed as acceptable for managing safety risk but increase the cybersecurity risk. In particular, various optimization techniques to conserve power, space, or components can reduce the isolation that is a key tenet of a zero trust cybersecurity strategy.
Global state increases risk. Cyber-physical systems share global state as a method to reduce communication costs between software components. The practice of using parameters or access functions to set or interrogate state enforces boundaries between objects and modules, preserving their invariants that can be used to prove security (and other) properties. These practices typically cause overheads in space and time. However, global variables have long been known as a source of errors that can lead to vulnerabilities.
Combining functions increases risk. Cyber-physical systems can share parallel or unrelated operations to reduce overhead in switching between operations and reuse common resources. For example, real-time systems can be organized into independent tasks, such as a watch dog timer and an actuator controller. These might share an execution unit. There are many technologies for organizing and managing these units of execution. Using a dedicated machine offers the most isolation and protection, while other technologies trade isolation and protection for reduced resources, ranging from virtual machines to operating system processes to containers to threads.
Threads offer the least overhead in space and processing overhead, and are therefore an execution unit of choice for cyber-physical systems. However, thread exploitation can result in fragile code leading to vulnerable software. The recent Toyota experience combining tasks into a single thread likely led to the system’s catastrophic failures, exemplifying the dangers of using threads badly. Adversaries typically gain control of a system through its weakest interface and then move laterally to other parts of the system that contain the function or data of interest.
Unfortunately, threads provide no protection from a compromised thread getting access to other threads in the shared process. Moreover, a common attack is a denial of service (DoS) attack. With isolated components, it is possible for a system to continue to operate if one component is subject to DoS, whereas in a threaded system, any successful DoS against one thread denies them all tasks in the process.
Maintenance backdoors offer an increased attack surface that affects cybersecurity more than safety. Cyber-physical systems can contain special interfaces intended to be used for administrative, configuration, upgrade, and other maintenance purposes. Such interfaces are not routinely used during operation, so they pose little increased safety hazard. Unlike users, however, adversaries will attempt to access every available interface. Maintenance backdoors typically run with exceptional or unfettered privileges to accomplish their tasks, such as reloading the software. These interfaces are intended to be executed only by knowledgeable and trustworthy agents. Therefore, these interfaces are not scrutinized and tested to the same level as operational interfaces, and hence have a higher risk of vulnerability. As a consequence, maintenance interfaces are prime targets for an adversary.
Designed air gaps offer a false sense of security. Rigorously implemented air gaps are effective in practice for securing components. Strongly related to the maintenance backdoor is the use of air gaps to isolate parts of the system. This technique is effective in a truly air-gapped system. However, much as the maintenance backdoor is intended to be unavailable during operation, practice has shown the air gap systems wind up being connected.
One famous example of a connected air gap is the Jeep hack where an intended air gap was breached, giving access where none was intended or expected. This access led to persistent vulnerability exploitation enabling unintended remote control of the vehicle. Air gaps are violated for the same reason that backdoors are created: to make operations, typically maintenance ones, easier to accomplish. In other instances, air gaps are closed to reduce human error that might be introduced through manual communication between the air gapped systems. Looking into the future, the notion of an “air gap” itself may be made obsolete by a persistent adversary who can use a variety of side channel techniques to traverse what appears to be an uncrossable gap. One particularly imaginative example was illustrated by Adi Shamir of RSA fame using a laser to cross an air gap to breach a system through a multifunctioning printer’s scanner.
Hardwired (i.e., preloaded) keys, passwords and certificates are a bane for cybersecurity. Although the choice to hardwire a credential is not a safety issue, it is driven by the same desire to minimize resource use and a desire to make cyber-physical devices easy to set up. Hence the practice of wiring in security secrets, such as a password, into the device is unfortunately common. Adversaries can reverse engineer devices to learn embedded internal data. The reports on nanny cams and refrigerators illustrate the pervasiveness of this concern.
Safety engineers and software developers for cyber-physical systems have made great strides in producing devices and systems that are reliable, safe, and functional. With the growing threat posed by cyber adversaries, additional attention is needed for securing cyber-physical devices that is not provided by safety analysis alone.
I’d like to thank Carol Woody, Chuck Weinstock, and John Goodenough for insightful conversations that led to this blog post. All of the opinions are mine.
Bev Littlewood, Lorenzo Strigini, “Redundancy and Diversity in Security,” Computer Security - ESORICS 2004, 9th European Symposium on Research Computer Security, Sophia Antipolis, France, September 13-15, 2004, Proceedings, https://www.researchgate.net/publication/221631956_Redundancy_and_Diversity_in_Security
National Aeronautics and Space Administration (NASA) System Failure Case Studies, “Powerless,” December 2007, Vol 1, Issue 10, https://sma.nasa.gov/docs/default-source/safety-messages/safetymessage-2008-03-01-northeastblackoutof2003.pdf
Étienne André, Didier Lime, Mathias Ramparison, Mariëlle Stoelinga, “Parametric analyses of attack-fault trees,” May 8, 2019, https://arxiv.org/pdf/1902.04336.pdf
Connor Simpson, “Stuxnet Used an Old Movie Trick to Fool Iran's Nuclear Program,” The Atlantic, November 20, 2013, https://www.theatlantic.com/international/archive/2013/11/stuxnet-used-old-movie-trick-fool-irans-nuclear-program/355340/
Geoff Sanders, “Zero Trust Adoption: Managing Risk with Cybersecurity Engineering and Adaptive Risk Assessment,” March 8, 2021, https://insights.sei.cmu.edu/blog/zero-trust-adoption-managing-risk-with-cybersecurity-engineering-and-adaptive-risk-assessment/
W. Wulf and M. Shaw, “Global Variable Considered Harmful,” ACM SIGPLAN Notices, Vol. 8, No. 2, 1973, pp. 28-34, https://dl.acm.org/doi/10.1145/953353.953355
Junko Yoshida, “Toyota Trial: Transcript Reveals ‘Task X’ Clues,” EE Times, Oct 29, 2013, https://www.eetimes.com/toyota-trial-transcript-reveals-task-x-clues/
Alex Drozhzhin, “Black Hat USA 2015: The full story of how that Jeep was hacked,” Kaspersky Daily, Aug 7, 2015, https://usa.kaspersky.com/blog/blackhat-jeep-cherokee-hack-explained/5749/
Lucian Constantin, “Utterly crazy hack uses long-distance lasers to send malware commands via all-in-one printers,” IDG News Service/PC World, Oct 16, 2014, https://www.pcworld.com/article/2834972/allinone-printers-can-be-used-to-control-infected-airgapped-systems-from-far-away.html
Melissa Willets, “Could Your Nanny Cams Be Hacked? See Warning from Mom Whose Camera Feed Was Posted Online,” Parents, August 15, 2017, https://www.parents.com/toddlers-preschoolers/everything-kids/could-your-nanny-cams-be-hacked-see-warning-from-mom-whose/
Zack Whittaker, “Internet-connected industrial refrigerators can be remotely defrosted, thanks to default passwords,” TechCrunch, February 8, 2019, https://techcrunch.com/2019/02/08/industrial-refrigerators-defrost-flaw/