
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
In our previous blog, we touched on the difference between alerts and incidents, and how they are created. Now, let’s look at the similarities and differences in the processes of resolving alerts and incidents.
Both alerts and incidents begin with a notion of diagnosis and remediation, with incidents having an added dimension of broader communication and a more thorough process of inspection on the incident after it is resolved. With incidents, you get three significant process additions to consider in addition to the work done resolving alerts: Incident Roles, Stakeholder Communications & Postmortems.
Alert & Incident Diagnosis & Remediation
After acknowledging your on-call notifications and looking into a new alert or incident, the first thing that happens is ensuring you understand what is going on in the environment, meaning you want to understand the total blast radius of the issue (users and services impacted). From there, or to help identify impact, you want to get into the technical depths of what could be causing the issue which is generally done by looking into the logs, metrics, traces, and recent pipeline changes that may have affected your app or infrastructure. A good place to start is the resource and threshold that breached causing the first alert. If the alert was directly sent into ServiceNow, you usually get a alert URL back to the source monitoring or observability system to continue these diagnostics.
You may want to do this alongside your team or on-call members of other teams that your service depends on, such as a networking, web, payments, or database group. Defining a service dependency map to automatically identify potential downstream services will aid this process.
During both diagnosis and remediation, you want to document the key insights you and the team find in your timeline through a technical work note summarizing the findings. After you identify what the problem could be, it’s time to start applying immediate solutions to restore service, as well as thinking through longer term fixes.
Remediation will generally require a good amount of collaboration as team members discuss short term solutions, as well as trade offs between different directions to take. While alerts generally don’t require much cross functional effort, incidents could be widespread and impacting multiple teams and services, and there may be a business decision on the best path. For example, you could immediately rollout a restoration from backup data a week old, losing a week of data, or alternatively takes 1-2 weeks manually fixing individual customers with no data loss.
Incident Specific Process Considerations
The first difference is in the scope of the collaboration and response. Whereas alerts are less severe and generally worked by a single responder, incidents generally involve multiple responders and teams coordinating to restore service and deploy a fix. To collaborate effectively, real-time channels like virtual meetings in zoom or live chats in slack are common, and incident response systems routinely help auto create these bridges. Adding only the additional responders or supporting teams that you believe are critical to resolution is key to ensuring you aren’t over notifying other technical users on potential off hours. With the many individuals that may be involved in resolving incidents, you generally want to assign specific response roles such as an incident commander in charge of driving the incident, a communication lead sending customer updates, a scribe working on a post mortem, and the different technical SME contributors.
The two other major differences that you will run into with incidents, but not alerts, are the mentioned stakeholder communications and post mortems.
Stakeholder communications come into play with incidents as they are validated degradations impacting external users. You want to keep the affected users informed that your team is aware of the issue, and actively resolving it. It is important to send regular updates to the stakeholders, even if nothing changed, informing them of key activities related to the restoration of their service. Commonly communications are sent when the issue is opened, when service is restored (even if partially), and when the incident is resolved. These communications are routinely sent via email directly to the affected users, but a more modern approach that is common is the use of an externally facing status page to provide the stakeholder updates. The last communication is generally related to a post mortem explaining what happened.
Post mortems are your best investment in future resilience. They will help you structure what happened, why it happened, and the action items you’re going to take to prevent the issue from happening again. It’s common to spin up a meeting after the incident is resolved to gather this information in a post mortem, writing up an executive summary, timeline, detailed summary, and action items on how the issue will be prevented.
With Site Reliability Operations, we can combine capabilities from ITOM & ITSM to provide you with all the workflows and data needed to collaborate, diagnose, remediate, and communicate from the start of an alert to the end of an incident post mortem.
- 1,383 Views
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.