search menu icon-carat-right cmu-wordmark

10 Lessons in Security Operations and Incident Management

Robin M. Ruefle

Incident response is a critical need throughout government and industry as cyber threat actors look to compromise critical assets within organizations with cascading, often catastrophic, effects. In 2021, for example, a hacker allegedly accessed a Florida water treatment plant’s computer systems and poisoned the water supply. Within the U.S. critical national infrastructure, 77 percent of organizations have seen a rise in insider-driven cyber threats over the last three years. The 2023 IBM Cost of a Data Breach report highlights the crucial role of having a well-tested incident response plan. Companies without a tested plan in place will face 82 percent higher costs in the event of a cyber attack, compared to those that have implemented and tested such a plan.

Researchers in the SEI CERT Division compiled 10 lessons learned from our more than 35 years of developing and working with incident response and security teams throughout the globe. These lessons are relevant to incident response teams contending with an ever-evolving cyber threat landscape. In honor of the CERT Division (also referred to the CERT Coordination Center in our work with the Forum of Incident Response and Security Teams) celebrating 35 years of operation, in this blog post we take a look back at some of the lessons learned from our Cyber Security Incident Response Team (CSIRT) capacity building experiences that also apply to other areas of security operations.

Foundations of Our Work

The CERT Division has helped develop incident management and security operations capability in other organizations almost since its inception in 1988. In fact, the original CERT Coordination Center (CERT/CC) emerged from a postmortem review of the response to the Morris Worm in 1988. During the postmortem, conducted by the Defense Advanced Research Projects Agency (DARPA), analysts determined that organizations needed better coordination and communications related to computer incident analysis and response. As stated in the SEI publication State of the Practice of Computer Security Incident Response Teams (CSIRTs)

In recognition of this problem, DARPA announced its intention to fund the development of a coordination center for Internet security incidents. DARPA chose the Software Engineering Institute as the new center’s home and charged the SEI with establishing a capability to quickly and effectively coordinate communication among experts during security emergencies in order to prevent future incidents. The new center was also charged with building awareness of security issues across the Internet community.

This new center, the CERT/CC, recognized that one organization could not provide this function; each organization instead needed its own team that understood its mission, assets, threats, and operations. From its beginnings, the CERT/CC worked to help other teams stand up and coordinate efforts for joint information sharing, such as the Forum of Incident Response and Security Teams (FIRST). The SEI formalized this work in 1996 with the establishment of the CSIRT Development Team (later the CSIRT Development and Training Team and the Security Operations Team) within the CERT/CC. This team developed the first training courses for CSIRT managers and analysts and the ­­­­­first publications for CSIRTs (including the CSIRT handbook). Once many CSIRTs were reaching full operational capability, they wanted to know how they were doing. CERT developed methods for evaluating whether they were meeting their missions or implementing the right components.

For many years, the CERT Division has helped organizations build capability through training, guidance publication, and on-site support. During that time, we learned many lessons about CSIRT development and sustainment that are also applicable to security operation centers (SOCs). The following sections discuss the lessons we learned over the past three plus decades.

  1. Organizations Must Be Flexible

Every organization is different, and although many of our trainees wanted us to tell them the “one right way” to build a CSIRT, we emphasize that many variables affect structure, services, and daily operations. Flexibility is therefore required, along with an understanding of the parent organization’s mission and processes. Organizations must also identify the location of critical assets, what data they contain, what risk and threats target them, the impact to the organization of compromise or damage to these assets, and constraints on mitigation that might be in place. Likewise, knowledge of industry, legal, and privacy compliance requirements is a must.

2. No One Organizational Structure Fits All CSIRTs

Some CSIRTS perform multiple activities, such as incident handling, vulnerability analysis, malware analysis, and media analysis (forensics), within their parent organization or constituency. In other situations, these tasks are performed by separate organizational units that have to work together. They need to determine how to share data and identify who performs what role. We see the same thing in SOC organizational structures: Different organizations have different SOC missions and makeup. Some focus on just monitoring and detection activities while others perform incident response and information sharing functions additionally.

3. CSIRTs or Incident Response Teams Do Not Operate Alone or in a Vacuum

Teams must be integrated into the organization and identify other components of the organization that play a part in incident management, such as IT, firewall teams, vulnerability management, patch management, risk management, insider risk teams, breach response teams, privacy, legal, human resources, and even training and media relations components. These teams must identify all the components they need to interact with; define the interactions, including inputs, outputs, mechanisms, triggers, time frames, and POCs; and institutionalize these into standard operating procedures.

4. Some Practices Must Be Considered Universally

One such practice is the documentation and institutionalization of processes and procedures to ensure operational resilience when staff members move on to other roles. All organizations must also have a knowledge management process, and mechanisms to capture and retrieve information learned from handling incidents or gathered through situational awareness activities. Other universal practices include defining staff roles and responsibilities; clearly aligning competencies, knowledge, skills, and abilities (KSAs); and career path progressions.

5. Identifying Critical Assets Is the Starting Point to Building Processes and Services

CSIRTs must understand what they are protecting and what is critical. We observed that if priorities aren’t identified, then team members consider everything as a priority. This mindset overwhelms a team’s workload and prohibits it from successfully fulfilling a mission.

6. Functions and Services Are More Important than Names and Labels

We observed that some organizations didn’t call their entity a CSIRT and, as security needs grew, structures such as SOCs and network operations centers (NOCs) evolved, all of which played a role in incident management. Your entity’s name is not important. If you are doing any of the following—monitoring, detection, triage, analysis, or response—then you are a target audience for our work. Over time, we began to refer to these structures as an incident management capability rather than a CSIRT. The FIRST CSIRT Development Framework Special Interest Group (SIG) created a document to outline potential services that could be offered by CSIRTs or SOCs, the CSIRT Services Framework. Note, that teams should select the key services to provide, not provide them all. We also recognized that some entities were specific types of teams that required the CSIRT title, such as National CSIRTs or Product Security Incident Response Teams (PSIRTs). National CSIRTs coordinate and facilitate the handling of incidents for a particular country or economy. They usually have a broader scope and a more diverse constituency. PSIRTs handle analysis of vulnerabilities within the products that their parent organizations produce and provide. The FIRST CSIRT Development Framework Special Interest Group (SIG) has a draft document out for review that defines four types of incident management capabilities.

7. A Successful CSIRT Needs More than Good Technology and Tools

CSIRTs or incident management capabilities are customer-service oriented and must continue to communicate with stakeholders and collaborators and develop trusted relationships. A CSIRT needs staff with critical analysis and problem-solving skills who can think outside of the box and adapt to new and unexpected situations in a calm and thoughtful manner. Along with their technical skills, staff also need effective communication skills. Skill development should be supported by a high-level training program, with appropriate governance, that provides ample opportunity for the continuous learning and professional development needed to keep up with the dynamic nature of the domain.

8. CSIRTS Must Have a Set of Clearly Defined Services

The level of service provided by the CSIRT will impact the corresponding infrastructure and organizational support needed to perform that service. For example, will incident responders go on site to help investigate or resolve the incident or only provide verbal assistance via phone or email? The level of service will also inform the types of engagement with constituents and stakeholders and the types of skills needed to provide the services. Those receiving services from a CSIRT or SOC need to know what services can be provided and also what is not provided. Codifying this clarity helps set expectations and set needed communication interfaces and information dissemination tasks.

9. CSIRTs Must Be Proactive

In the beginning, we observed many CSIRTs focused on being reactive, but over the years they became more proactive. They manifested this growth by taking on tasks, such as vulnerability scanning, security assessments, and active research aimed at uncovering malicious or anomalous activity and new threats. Today proactive approaches have evolved to include activities like threat hunting, situational awareness, security awareness training and integration with cyber intelligence.

10. Incident Management Capabilities Can Provide Situational Awareness to the Rest of the Organization

CSIRTs or SOCs within an organization should be part of any change management board, configuration management activities, or technical review boards to alert the organization to possible security threats as infrastructure changes or process changes are planned and implemented. They can also provide information about threats and risks to risk management groups. In return, they can use the information they receive about risk impacts for critical assets to prioritize analysis and response tasks. This information can also be used to keep teams up to date with infrastructure changes in the organization that may have security implications.

Applying CSIRT Lessons Learned to Security Operations

Our work in CSIRT capacity building has expanded to support security operations in general. The lessons we learned over the past three-plus decades provided the foundation to expand support and guidance to the broader organizational context of security operations. Incident management is a key element of security operations, and security operations are foundational to operational risk management. All these components must be aligned and work together for effective cyber defense.

Our work in incident management capability development aligns with security operations, so we did not have to develop our capacity building work from scratch. The security operations work can use all the basic processes, methods and lessons learned from incident management/CSIRT development and add more focused security operations processes and methods where needed.

The lessons we learned through our CSIRT development, and later through incident management capability development, are applicable to security operations. Our incident management evaluation instruments can easily assess various types of incident management and security operations capabilities. We have evaluated with the same instruments a variety of organizational entities including incident response teams, SOCs, and network security operation centers (NSOCs) across government, industry, and academic institutions.

Common Problems and Trends

As we used our incident management capability evaluations to assess operational teams, we have seen common problem areas and trends. Surprisingly, the top problems and gaps are not technical in nature but, rather, normal organizational problems. The biggest problem is lack of communication from management to staff, from the incident management capability to rest of the organization, and among groups who play a role in incident management activities. Other problems include

  • lack of policies and procedures
  • lack of staff training
  • lack of management support and governance
  • duplicate or redundant functions
  • lack of a defined mission and corresponding roles and responsibilities

As you can see, these problems overlap with a lot of the same concepts covered in our lessons learned. As the broader area of security operations grows, organizations within this domain will be vulnerable to these same issues and can use our lessons to help plan their strategy for development and avoid many such problems.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed