It’s a usual day, just like any other work day.
Until it’s not.
Something happens. Something unexpected – your website crashed or your online store’s shopping basket stopped responding or your server decided to give up the ghost. Whatever it is, it’s a major event that needs immediate attention, and incident responders need to have the right tools and information at the right time to fix the issue as quickly as possible, writes Steve Barrett, Head of EMEA, PagerDuty.
Anything less has the potential to compromise your organisation’s brand reputation and negatively impact the bottom line.
Peacetime to Wartime
When a major incident happens (and in today’s complex digital world, incidents will happen), responders need to significantly pivot from their normal everyday work: Think of it as moving from Peacetime operations to Wartime operations.
Peacetime is when things are moving along as expected. In other words, it’s business as usual.
During Wartime, however, business is anything but usual. Communications change. Timeframes compress. Roles and hierarchies shift.
Companies that have proactively planned for this shift are well positioned to move through Wartime pretty much unscathed. Those that haven’t leave the business’s welfare to chance.
Taking the best-practice elements from emergency response teams around the world – including the UK’s Gold-Silver-Bronze command structure – can help ensure organisations are in the best position to respond to and rectify any incident.
Peacetime is the time to determine who will comprise the incident response team and the roles they will play. For an effective Wartime response, it’s critical that every person on the team has an in-depth understanding of these roles, which include the following:
Incident Commander: The most important role on the team is Incident Commander (IC)—the person who gives direction to the team to resolve incidents. Any trained IC on the on-call schedule may be tapped to lead the process during a major incident. ICs-in-training are typically on a “shadow” schedule. (It’s important to note the IC is not the CEO. The CEO must focus on running the business, regardless of what’s going on.)
The IC does not perform any remediation during a response, but instead acts as lead and makes decisions. The IC’s responsibilities include:
- Preparing for major incidents by setting up communications channels, funneling people to those channels when there is a major incident, and training team members and other Incident Commanders.
- Acting as the single authority in incident response by driving major incidents to resolution. The IC ensures everyone is on the same communications channel, gathers incident status information, collects incident resolution proposals, and delegates resolution actions.
- Managing the postmortem process by scheduling a meeting immediately after an incident so people can share their thoughts on the process, identify the cause(s) that led to the issue, and take steps to prevent the problem from recurring.
Deputy: During a major event, the Deputy provides direct support to the IC. The Deputy role has specific tasks; the Deputy is not an IC-in-training.
The Deputy’s responsibilities include:
- Bringing up issues to the IC that might not otherwise be addressed.
- Acting as the Incident Commander should the IC have to step away from the role.
- Managing incident communications and being prepared to remove people from an incident response call if instructed by the IC.
Scribe: The Scribe documents the incident and captures all important decisions and data for later review. The Deputy may also act as the Scribe.
Responsibilities of the Scribe include:
- Ensuring that the incident response call is recorded.
- Taking note of important data, events, and actions as they happen (within a communications channel such as Slack).
Subject Matter Expert: A Subject Matter Expert is an authority on a particular service, product or process.
The Subject Matter Expert is expected to:
- Diagnose common problems.
- Rapidly resolve issues within a relevant affected area.
- Provide concise communications on an affected area’s condition and actions that need to be taken to resolve an issue, as well as provide required support.
Organisations may also have a Customer Liaison (someone responsible for interacting with customers) and/or an internal liaison (a person responsible for interacting with internal stakeholders).
The Peacetime to Wartime shift also requires a change in communications. During Peacetime, there is relatively higher tolerance for back-and-forth discussions and constructive arguments. During Wartime, however, every second counts and communication must be concise and widely comprehensible.
Team members must speak the same language—literally and figuratively. This is especially true when it comes to the incident response call. It’s critical to communicate calmly, clearly, and explicitly to ensure effective response.
For example, it’s important to limit (if not altogether avoid) the use of acronyms: Too many can confuse newcomers and will add cognitive overhead in general. So while it may save you a few seconds to use an acronym instead of the full term, clarity is more important than speed in this case.
Incident Response Team in Action
Every situation is different and will call for specific actions, but in general, the team—led by the IC—must assess the problem by gathering information to determine the scope and impact of the incident, then collect proposed repair actions and their associated risks.
When the time comes, it’s ultimately up to the IC to make a decision—and it must be made quickly even without buy-in from everyone on the team. With that said, the IC can put forth a few options and give team members an opportunity to register strong opposition. (Using the adjective “strong” will dissuade team members from wasting time with random thoughts, and is an example of the importance of precise and careful language throughout the incident response process.)
Postmortem: Learning from Mistakes
Once the incident has been resolved, a postmortem helps the team figure out what happened, what went wrong and what went right.
However, it’s important to proceed carefully as the nature of a postmortem makes it ripe for multiple rounds of the blame game, which negatively impacts everyone. Instead, the postmortem process should be future-oriented, with a goal of instilling a culture of learning and identifying opportunities for improvement that otherwise would be lost.
To make this happen, postmortems must happen in an environment in which teams can be completely honest without fear of negative repercussions. With this kind of focus and an eye toward making the process itself as streamlined as possible, team members (and the company) will get the most out of the time invested in the postmortem.
The postmortem report should include:
- A high-level summary of what happened
- Root-cause analysis
- Steps taken to diagnose, assess and resolve
- A timeline of significant activity
- Learnings and next steps
Throughout the incident response process, a shift away from blame and toward working together to limit damage from inevitable events will help establish a positive culture of constant and iterative learning and growth.
Read this: 5 Things to do Before Ransomware Strikes