Networking at the Tactical and Humanitarian Edge

Edge systems are computing systems that operate at the edge of the connected network, close to users and data. These types of systems are off premises, so they rely on existing networks to connect to other systems, such as cloud-based systems or other edge systems. Due to the ubiquity of commercial infrastructure, the presence of a reliable network is often assumed in industrial or commercial edge systems. Reliable network access, however, cannot be guaranteed in all edge environments, such as in tactical and humanitarian edge environments. In this blog post, we will discuss networking challenges in these environments that primarily stem from high levels of uncertainty and then present solutions that can be leveraged to address and overcome these challenges.

Networking Challenges in Tactical and Humanitarian Edge Environments

Tactical and humanitarian edge environments are characterized by limited resources, which include network access and bandwidth, making access to cloud resources unavailable or unreliable. In these environments, due to the collaborative nature of many missions and tasks—such as search and rescue or maintaining a common operational picture—access to a network is required for sharing data and maintaining communications among all team members. Keeping participants connected to each other is therefore key to mission success, regardless of the reliability of the local network. Access to cloud resources, when available, may complement mission and task accomplishment.

Uncertainty is an important characteristic of edge environments. In this context, uncertainty involves not only network (un)availability, but also operating environment (un)availability, which in turn may lead to network disruptions. Tactical edge systems operate in environments where adversaries may try to thwart or sabotage the mission. Such edge systems must continue operating under unexpected environmental and infrastructure failure conditions despite the variety and uncertainty of network disruptions.

Tactical edge systems contrast with other edge environments. For example, in the urban and the commercial edge, the unreliability of any access point is typically resolved via alternate access points afforded by the extensive infrastructure. Likewise, in the space edge delays in communication (and cost of deploying assets) typically result in self-contained systems that are fully capable when disconnected, with regularly scheduled communication sessions. Uncertainty in return results in the key challenges in tactical and humanitarian edge environments described below.

Challenges in Defining Unreliability

The level of assurance that data are successfully transferred, which we refer to as reliability, is a top-priority requirement in edge systems. One commonly used measure to define reliability of modern software systems is uptime, which is the time that services in a system are available to users. When measuring the reliability of edge systems, the availability of both the systems and the network must be considered together. Edge networks are often disconnected, intermittent, and of low bandwidth (DIL), which challenges uptime of capabilities in tactical and humanitarian edge systems. Since failure in any aspects of the system and the network may result in unsuccessful data transfer, developers of edge systems must be cautious in taking a broad perspective when considering unreliability.

Challenges in Designing Systems to Operate with Disconnected Networks

Disconnected networks are often the simplest type of DIL network to manage. These networks are characterized by long periods of disconnection, with planned triggers that may briefly, or periodically, enable connection. Common situations where disconnected networks are prevalent include

disaster-recovery operations where all local infrastructure is completely inoperable
tactical edge missions where radio frequency (RF) communications are jammed throughout
planned disconnected environments, such as satellite operations, where communications are available only at scheduled intervals when relay stations point in the right direction

Edge systems in such environments must be designed to maximize bandwidth when it becomes available, which primarily involves preparation and readiness for the trigger that will enable connection.

Challenges in Designing Systems to Operate with Intermittent Networks

Unlike disconnected networks, in which network availability can eventually be expected, intermittent networks have unexpected disconnections of variable length. These failures can happen at any time, so edge systems must be designed to tolerate them. Common situations where edge systems must deal with intermittent networks include

disaster-recovery operations with a limited or partially damaged local infrastructure; and unexpected physical effects, such as power surges or RF interference from broken equipment resulting from the evolving nature of a disaster
environmental effects during both humanitarian and tactical edge operations, such as passing by walls, through tunnels, and within forests that may result in changes in RF coverage for connectivity

The approaches for handling intermittent networks, which mostly concern different types of data distribution, are different from the approaches for disconnected networks, as discussed later in this post.

Challenges in Designing Systems to Operate with Low Bandwidth Networks

Finally, even when connectivity is available, applications operating at the edge often must deal with insufficient bandwidth for network communications. This challenge requires data-encoding strategies to maximize available bandwidth. Common situations where edge systems must deal with low-bandwidth networks include

environments with a high density of devices competing for available bandwidth, such as disaster-recovery teams all using a single satellite network connection
military networks that leverage highly encrypted links, reducing the available bandwidth of the connections

Challenges in Accounting for Layers of Reliability: Extended Networks

Edge networking is typically more complicated than just point-to-point connections. Multiple networks may come into play, connecting devices in a variety of physical locations, using a heterogeneous set of connectivity technologies. There are often multiple devices that are physically located at the edge. These devices may have good short-range connectivity to each other—through common protocols, such as Bluetooth or WiFi mobile ad hoc network (MANET) networking, or through a short-range enabler, such as a tactical network radio. This short-range networking will likely be far more reliable than connectivity to the supporting networks, or even the full Internet, which may be provided by line-of-sight (LOS) or beyond-line-of-sight (BLOS) communications, such as satellite networks, and may even be provided by an intermediate connection point.

While network connections to cloud or data-center resources (i.e., backhaul connections) can be far less reliable, they are valuable to operations at the edge because they can provide command-and-control (C2) updates, access to experts with locally unavailable expertise, and access to large computational resources. However, this combination of short-range and long-range networks, with the potential of a variety of intermediate nodes providing resources or connectivity, creates a multifaceted connectivity picture. In such cases, some links are reliable but low bandwidth, some are reliable but available only at set times, some come in and out unexpectedly, and some are a complete mix. It is this complicated networking environment that motivates the design of network-mitigation solutions to enable advanced edge capabilities.

Architectural Tactics to Address Edge Networking Challenges

Solutions to overcome the challenges we enumerated generally address two areas of concern: the reliability of the network (e.g., can we expect that data will be transferred between systems) and the performance of the network (e.g., what is the realistic bandwidth that can be achieved regardless of the level of reliability that is observed). The following common architectural tactics and design decisions that influence the achievement of a quality attribute response (such as mean time to failure of the network), help improve reliability and performance to mitigate edge-network uncertainty. We discuss these in four main areas of concern: data-distribution shaping, connection shaping, protocol shaping, and data shaping.

Data-Distribution Shaping

An important question to answer in any edge-networking environment is how data will be distributed. A common architectural pattern is publish–subscribe (pub–sub), in which data is shared by nodes (published) and other nodes actively request (subscribe) to receive updates. This approach is popular because it addresses low-bandwidth concerns by limiting data transfer to only those that actively want it. It also simplifies and modularizes data processing for different types of data within the set of systems running on the network. In addition, it can provide more reliable data transfer through centralization of the data-transfer process. Finally, these approaches also work well with distributed containerized microservices, an approach that is dominating current edge-system development.

Standard Pub–Sub Distribution

Publish–subscribe (pub–sub) architectures work asynchronously through elements that publish events and other elements that subscribe to those to manage message exchange and event updates. Most data-distribution middleware, such as ZeroMQ or many of the implementations of the Data Distribution Service (DDS) standard, provide topic-based subscription. This middleware enables a system to state the type of data that it is subscribing to based on a descriptor of the content, such as location data. It also provides true decoupling of the communicating systems, allowing for any publisher of content to provide data to any subscriber without the need for either of them to have explicit knowledge about the other. As a result, the system architect has far more flexibility to build different deployments of systems providing data from different sources, whether backup/redundant or entirely new ones. Pub–sub architectures also enable simpler recovery operations for when services lose connection or fail since new services can spin up and take their place without any coordination or reorganization of the pub–sub scheme.

A less-supported augmentation to topic-based pub–sub is multi-topic subscription. In this scheme, systems can subscribe to a custom set of metadata tags, which allows for data streams of similar data to be appropriately filtered for each subscriber. As an example, imagine a robotics platform with multiple redundant location sources that needs a consolidation algorithm to process raw location data and metadata (such as accuracy and precision, timeliness, or deltas) to produce a best-available location representing the location that should be used for all the location-sensitive consumers of the location data. Implementing such an algorithm would yield a service that might be subscribed to all data tagged with location and raw, a set of services subscribed to data tagged with location and best available, and perhaps specific services that are interested only in specific sources, such as Global Navigation Satellite System (GLONASS) or relative reckoning using an initial position and position/motion sensors. A logging service would also likely be used to subscribe to all location data (regardless of source) for later review.

Situations such as this, where there are multiple sources of similar data but with different contextual elements, benefit greatly from data-distribution middleware that supports multi-topic subscription capabilities. This approach is becoming increasingly popular with the deployment of more Internet of Things (IoT) devices. Given the amount of data that would result from scaled-up use of IoT devices, the bandwidth-filtering value of multi-topic subscriptions can also be significant. While multi-topic subscription capabilities are much less common among middleware providers, we have found that they enable greater flexibility for complex deployments.

Centralized Distribution

Similar to how some distributed middleware services centralize connection management, a common approach to data transfer involves centralizing that function to a single entity. This approach is typically enabled through a proxy that performs all data transfer for a distributed network. Each application sends its data to the proxy (all pub–sub and other data) and the proxy forwards it to the necessary recipients. MQTT is a common middleware software solution that implements this approach.

This centralized approach can have significant value for edge networking. First, it consolidates all connectivity decisions in the proxy such that each system can share data without having any knowledge of where, when, and how data is being delivered. Second, it allows implementing DIL-network mitigations in a single location so that protocol and data-shaping mitigations can be limited to only network links where they are needed.

However, there is a bandwidth cost to consolidating data transfer into proxies. Moreover, there is also the risk of the proxy becoming disconnected or otherwise unavailable. Developers of each distributed network should carefully consider the likely risks of proxy loss and make an appropriate cost/benefit tradeoff.

Connection Shaping

Network unreliability makes it hard to (a) discover systems within an edge network and (b) create stable connections between them once they are discovered. Actively managing this process to minimize uncertainty will improve overall reliability of any group of devices collaborating on the edge network. The two primary approaches for making connections in the presence of network instability are individual and consolidated, as discussed next.

Individual Connection Management

In an individual approach, each member of the distributed system is responsible for discovering and connecting to other systems that it communicates with. The DDS Simple Discovery protocol is the standard example of this approach. A version of this protocol is supported by most software solutions for data-distribution middleware. However, the inherent challenge of operating in a DIL network environment makes this approach hard to execute, and especially to scale, when the network is disconnected or intermittent.

Consolidated Connection Management

A preferred approach for edge networking is assigning the discovery of network nodes to a single agent or enabling service. Many modern distributed architectures provide this feature via a common registration service for preferred connection types. Individual systems let the common service know where they are, what types of connections they have available, and what types of connections they are interested in, so that routing of data-distribution connections, such as pub–sub topics, heartbeats, and other common data streams, are handled in a consolidated manner by the common service.

The FAST-DDS Discovery Server, used by ROS2, is an example of an implementation of an agent-based service to coordinate data distribution. This service is often applied most effectively for operations in DIL-network environments because it enables services and devices with highly reliable local connections to find each other on the local network and coordinate effectively. It also consolidates the challenge of coordination with remote devices and systems and implements mitigations for the unique challenges of the local DIL environment without requiring each individual node to implement those mitigations.

Protocol Shaping

Edge-system developers also must carefully consider different protocol options for data distribution. Most modern data-distribution middleware supports multiple protocols, including TCP for reliability, UDP for fire-and-forget transfers, and often multicast for general pub–sub. Many middleware solutions support custom protocols as well, such as reliable UDP supported by RTI DDS. Edge-system developers should carefully consider the required data-transfer reliability and in some cases utilize multiple protocols to support different types of data that have different reliability requirements.

Multicasting

Multicast is a common consideration when looking at protocols, especially when a pub–sub architecture is selected. While basic multicast can be a viable solution for certain data-distribution scenarios, the system designer must consider several issues. First, multicast is a UDP-based protocol, so all data sent is fire-and-forget and cannot be considered reliable unless a reliability mechanism is built on top of the basic protocol. Second, multicast is not well supported in either (a) commercial networks due to the potential of multicast flooding or (b) tactical networks because it is a feature that may conflict with proprietary protocols implemented by the vendors. Finally, there is a built-in limit for multicast by the nature of the IP-address scheme, which may prevent large or complex topic schemes. These schemes can also be brittle if they undergo constant change, as different multicast addresses cannot be directly associated with datatypes. Therefore, while multicasting may be an option in some cases, careful consideration is required to ensure that the limitations of multicast are not problematic.

Use of Specifications

It is important to note that delay-tolerant networking (DTN) is an existing RFC specification that provides a great deal of structure to approaching the DIL-network challenge. Several implementations of the specification exist and have been tested, including by teams here at the SEI, and one is in use by NASA for satellite communications. The store-carry-forward philosophy of the DTN specification is most optimal for scheduled communication environments, such as satellite communications. However, the DTN specification and underlying implementations can also be instructive for developing mitigations for unreliably disconnected and intermittent networks.

Data Shaping

Careful design of what data to transmit, how and when to transmit, and how to format the data, are critical decisions for addressing the low-bandwidth aspect of DIL-network environments. Standard approaches, such as caching, prioritization, filtering, and encoding, are some key strategies to consider. When taken together, each strategy can improve performance by reducing the overall data to send. Each can also improve reliability by ensuring that only the most important data are sent.

Caching, Prioritization, and Filtering

Given an intermittent or disconnected environment, caching is the first strategy to consider. Making sure that data for transport is ready to go when connectivity is available enables applications to ensure that data is not lost when the network is not available. However, there are additional aspects to consider as part of a caching strategy. Prioritization of data enables edge systems to ensure that the most important data are sent first, thus getting maximum value from the available bandwidth. In addition, filtering of cached data should also be considered, based on, for example, timeouts for stale data, detection of duplicate or unchanged data, and relevance to the current mission (which may change over time).

Pre-processing

An approach to reducing the size of data is through pre-computation at the edge, where raw sensor data can be processed by algorithms designed to run on mobile devices, resulting in composite data items that summarize or detail the important aspects of the raw data. For example, simple facial-recognition algorithms running on a local video feed may send facial-recognition matches for known people of interest. These matches may include metadata, such as time, data, location, and a snapshot of the best match, which can be orders of magnitude smaller in size than sending the raw video stream.

Encoding

The choice of data encoding can make a substantial difference for sending data effectively across a limited-bandwidth network. Encoding approaches have changed drastically over the past several decades. Fixed-format binary (FFB) or bit/byte encoding of messages is a key part of tactical systems in the defense world. While FFB can promote near-optimal bandwidth efficiency, it also is brittle to change, hard to implement, and hard to use for enabling heterogeneous systems to communicate because of the different technical standards affecting the encoding.

Over the years, text-based encoding formats, such as XML and more recently JSON, have been adopted to enable interoperability between disparate systems. The bandwidth cost of text-based messages is high, however, and thus more modern approaches have been developed including variable-format binary (VFB) encodings, such as Google Protocol Buffer s and EXI. These approaches leverage the size advantages of fixed-format binary encoding but allow for variable message payloads based on a common specification. While these encoding approaches are not as universal as text-based encodings, such as XML and JSON, support is growing across the commercial and tactical application space.

The Future of Edge Networking

One of the perpetual questions about edge networking is, When will it no longer be an issue? Many technologists point to the rise of mobile devices, 4G/5G/6G networks and beyond, satellite-based networks such as Starlink, and the cloud as evidence that if we just wait long enough, every environment will become connected, reliable, and bandwidth rich. The counterargument is that as we improve technology, we also continue to find new frontiers for that technology. The humanitarian edge environments of today may be found on the Moon or Mars in 20 years; the tactical environments may be contested by the U.S. Space Force. Moreover, as communication technologies improve, counter-communication technologies necessarily will do so as well. The prevalence of anti-GPS technologies and associated incidents demonstrates this clearly, and the future can be expected to hold new challenges.

Areas of particular interest we are exploring soon include

electronic countermeasure and electronic counter-countermeasure technologies and techniques to address a current and future environment of peer–competitor conflict
optimized protocols for different network profiles to enable a more heterogeneous network environment, where devices have different platform capabilities and come from different agencies and organizations
lightweight orchestration tools for data distribution to reduce the computational and bandwidth burden of data distribution in DIL-network environments, increasing the bandwidth available for operations

If you are facing some of the challenges discussed in this blog post or are interested in working on some of the future challenges, please contact us at info@sei.cmu.edu.

Software Engineering Institute

SEI Blog