search menu icon-carat-right cmu-wordmark

The SPRUCE Series: Challenges to Managing Operational Resilience

SPRUCE Project

Software and acquisition professionals often have questions about recommended practices related to modern software development methods, techniques, and tools, such as how to apply agile methods in government acquisition frameworks, systematic verification and validation of safety-critical systems, and operational risk management. In the Department of Defense (DoD), these techniques are just a few of the options available to face the myriad challenges in producing large, secure software-reliant systems on schedule and within budget.

In an effort to offer our assessment of recommended techniques in these areas, SEI built upon an existing collaborative online environment known as SPRUCE (Systems and Software Producibility Collaboration Environment), hosted on the Cyber Security & Information Systems Information Analysis Center (CSIAC) website. From June 2013 to June 2014, the SEI assembled guidance on a variety of topics based on relevance, maturity of the practices described, and the timeliness with respect to current events. For example, shortly after the Target security breach of late 2013, we selected Managing Operational Resilience as a topic.

Ultimately, SEI curated recommended practices on five software topics: Agile at Scale, Safety-Critical Systems, Monitoring Software-Intensive System Acquisition Programs, Managing Intellectual Property in the Acquisition of Software-Intensive Systems, and Managing Operational Resilience. In addition to a recently published paper on SEI efforts and individual posts on the SPRUCE site, these recommended practices will be published in a series of posts on the SEI blog. This following post, Managing Operational Resilience by Julia H. Allen, Pamela Curtis, and Nader Mehravari, presents challenges for managing operational resilience (in this post) and recommended practices for helping organizations manage operational resilience (in the second post in this series).

Managing Operational Resilience - SPRUCE/SEI
https://www.csiac.org/reference-doc/managing-operational-resilience/

A search at your favorite news aggregator for keywords such as "malware," "computer virus," or "data breach" will return tens of thousands of results. For most organizations it's not a question of if a cyber attack will occur, but when. When an attack happens, the tempo of response must be fast, so an organization must already have practices in place covering how to respond. These practices should reflect a strategic approach that balances actions that protect assets--such as customer data and intellectual property--with actions that sustain services and operations.

A recommended approach to address both protection and sustainment is the application of resilience management practices. Operational resilience is the ability of an entity to prevent disruptions to its mission from occurring, continue to meet its mission if a disruption or incident does occur, and return to normalcy when the disruption is eliminated. The concept of operational resilience applies to entities such as organizations, systems, networks, supply chains, critical infrastructure, cyberspace, Armed Forces, and even nations.

Operational resilience management includes all the practices of planning, integrating, executing, and governing activities to ensure that an entity can

  • identify and mitigate operational risks that could lead to service disruptions before they occur
  • prepare for and respond to disruptive events (realized risks) in a manner that demonstrates command and control of incident response and service continuity
  • recover and restore mission-critical services and operations following an incident within acceptable time frames

Operational resilience management draws from several complex and evolving disciplines, including risk management, business continuity, disaster recovery, information security, incident and emergency management, information technology (IT), service delivery, workforce management, and supply-chain management, each with its own terminology, principles, and solutions. The practices described here reflect the convergence of these distinct, often siloed disciplines. As resilience management becomes an increasingly relevant and critical attribute of their missions, organizations should strive for a deeper coordination and integration of its constituent activities.

Our discussion of operational resilience management as presented in this post has three parts. First, we set the context by providing an answer to the question "Why is operational resilience management challenging?" The next post in this series will present a set of recommended practices for operational resilience management follows. Our original SPRUCE post concludes with an extensive list of selected resources to help you learn more about operational resilience management and added links to various sources to help amplify some points.

Every organization is different; judgment is required to implement these practices in a way that benefits your organization. In particular, be mindful of your mission, goals, existing processes, and culture. All practices have limitations. Some of these practices will be more relevant to your situation than others, and their applicability will depend on the context in which you apply them. To gain the most benefit, you need to evaluate each practice for its appropriateness and decide how to adapt it, striving for an implementation in which the practices meet your business objectives. Also, consider additional collections of recommended practices, including those among the various sources at the bottom of the webpage. Monitor your adoption and use of these practices, and adjust as appropriate.

These practices are certainly not complete--they are a work in progress.

Why is Managing Operational Resilience Challenging?

Over the past 10 years, organizations have invested a tremendous amount of resources in cybersecurity. Nevertheless, regardless of how much has been spent on protection, cyber attackers continue to penetrate systems. We have reached a point in the battle for information and cybersecurity where we should change the focus of security investment from a narrow focus on planning how to avoid cyber attacks to a more balanced focus on avoidance and planning how to recover from cyber attacks.

Operational resilience management has two sides--protect and sustain--and both are equally important. An organization must learn about the threat environment, maintain situational awareness of the context in which it operates, and create a risk-management plan that is as thorough and reliable as possible. But when an attack occurs, can the organization sustain its critical services and operations? Can it adequately recover its systems and get them back online as quickly as possible? Can it restore and recover service within a prescribed recovery time and according to its recovery-point objectives? An organization must ask, where can we not afford to have something bad happen, and where can we afford to have something bad happen and bounce back as quickly as we can? The need for organizations to achieve a balance between protect and sustain is why operational resilience management is so important.

Operational resilience management is challenging for several reasons:

1. Making a long-term commitment: Operational resilience is an emergent property. An emergent property is not something an organization can buy and put in place or assemble by buying its parts. For a property to emerge within an organization, the organization must execute a certain set of activities in a coordinated manner and do so with consistent discipline. Achieving operational resilience requires an organization to make a long-term commitment to perform certain activities with consistency. The activities involved in operational resilience management must become part of the organization's daily habits across the enterprise.

2. Understanding the big picture: To be operationally resilient, organizations must address operational risk on many dimensions simultaneously, including people, technology, information, facilities, supply-chain, management, cyber, and physical dimensions. This requires careful planning, coordination, and training across many interdependent domains, as well as understanding how the organization's capabilities along these dimensions contribute to mission success.

3. Overcoming organizational hurdles: An organization may encounter a number of barriers to operational resilience management, including

  • the vague and abstract nature of operational risk management
  • compartmentalization of operational risk-management activities, such as segmenting responsibilities for information security and business continuity/disaster recovery
  • focusing on technology instead of on all the dimensions listed in Challenge 2
  • the proliferation of practices for operational resilience management
  • insufficient funding and staff
  • insufficient success stories and measurements
  • (over)reliance on people
  • regulatory climate
  • existing policies
  • the tendency to ignore current information to avoid a painful reality and the need to act
  • competitive pressures or short-term goals


Looking Ahead

Technology transition is a key part of the SEI's mission and a guiding principle in our role as a federally funded research and development center. The next post will in this series will explore recommended practices for managing operational resilience in organizations as well as strategies for deriving more benefits from those recommended practices.

We welcome your comments and suggestions on this series.

Additional Resources

For comprehensive information about CERT's research operational resilience management, please see www.cert.org/resilience.

For more information about frameworks and maturity models, please see Buyer Beware: How to be a Better Consumer of Security Maturity Models presented by Julia Allen and Nader Mehravari at the February 2014 RSA Conference.

Richard A. Caralli, Julia H. Allen, and David W. White, also published the book CERT Resilience Management Model (CERT-RMM): A Maturity Model for Managing Operational Resilience by Addison-Wesley Professional, 2011.

CITE

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed