Incident Alert Management

AlisonQuattrocc · ‎01-09-2014

At times, things get very far from what they should be or what is even normal. Hair is on fire, up the creek without a paddle and so on - a service is down or degraded to such an extent that resources are directed into a crisis mode. Most organizations have a predetermined method or process for handling this as a crisis, major incident, or sev-1 outage.

ITIL defines these types of issues as the highest priority and tries to outline a specific process for handling and how resources should respond accordingly. The goal is usually around restoration in the least amount of time possible, at times circumventing previously established procedures with the end-state of bringing customers and services online.

Typically organizations have a predefined role of incident manager or crisis manager, rotating through a set of trusted individuals who are trained or have a responsibility in the hierarchy to make decisions or bring resources to bear. Having clearly defined roles and responsibilities is key to a well run and repeatable process. Example roles include:

Incident Manager — the person running the technical bridge and managing outbound communications to bring services back online
Duty Manager — a senior role who may be communicating with the business or executive leadership, and can make decisions to implement certain changes or activities
Incident coordinator — a junior role who is making calls, finding resources and providing more tactical activities

Once you have the right people in place, ensuring operational efficiencies during a major incident or crisis ultimately comes down to process, communication, delivery and practice.

Process: Consistent and Repeatable

Having a well-defined and repeatable process is key to ensuring that major incidents are handled and dealt with, but just as important is when not to cry wolf or pull the trigger too soon — potentially sparking widespread panic when an incident isn't an incident! Making use of a priority matrix that encompasses dimensions such as impact, urgency, cost, and number of users will allow incident managers to quickly and easily determine the scope and impact of an incident, regardless of their skill set and/or tribal knowledge and experience.

Tell Me What's Going On!

Few things are more annoying than constantly being asked for status — it is like having a 2-year-old child saying 'why' every 2 minutes. Having a communications plan and pre-timed events of when to communicate and what to communicate will ensure that the resources are focused on service restoration rather than continually explaining the current status and where they are in the process. Utilizing a subscription or opt-in process is a great way to get messages out and allow users to define what they want to be notified of, rather than force-feeding from IT. The plan should incorporate a regular schedule of when users can expect to be notified, maybe every 15 minutes and so on to help manage expectations.

It's Not What Happened — It's How You Respond

As important as it is to know what happened, how you respond is key to establishing credibility and trust. Two key factors that drive the perception of IT to end users and the business is that they actually know what is going on, what is going to happen next and that service is going to be restored. Acknowledgement of the incident and establishing the facts, in a non-emotional, non-political way, will help to keep the focus on the task at hand and less procrastination.

Practice, Practice, Practice

There is nothing like proving the process works than when in the middle of a firefight, but stress and time boxing introduces mistakes and "hacks". Unauthorized change, who to communicate with, who to call and so on, are all things that will occur and the better this is known by resources the more likely the process will breakdown and frequency will decrease. Planned exercises and tests are very good for rehearsals, but cannot always be representative of the issues as there is too much pre-scripted. "Pull the plug" testing where only a few individuals know is a great live exercise to see who responds and how they respond.

ServiceNow helps to manage this process with supporting functionality with new features released in Dublin. ServiceNow Incident Alert Management (IAM) helps to drive the outbound communication process to gather the right resources for restoration, whilst communicating to subscribers of issues with the services they are interested in. To learn more, check out the ServiceNow Wiki.

Incident Alert Management

Problem Management in ServiceNow - Don't Skip the Role Planning

Why Shared Groups Between ITSM and CSM Will Cause You Problems

Getting Proofpoint SMTP Relay Working with ServiceNow