A Customer-centric model for handling Incidents

SimonMorris · ‎02-29-2012

A few weeks ago we launched a new set of enhancements to our internal ITSM processes at ServiceNow.

As customers you won't see this work filter through to your own instances with the Berlin release this year, but you are already interacting with the changes we made as consumers of our service.

As a SaaS provider we realise that our customers are buying a complete experience from us - as cheesy as it sounds to say it.

Our software is a big part of our offering but fundamentally we are a service provider, selling the availability of our platform and product. Good IT Service Management is critical to our success - as it is to the success of our customers.

We also realise that to having the best possible Service Management involves continuous review and taking steps to improve the way we handle disruptions or degradation to our services.

The release we deployed earlier this month concentrated on Incident Management. We designed over the Christmas holiday period and then built and deployed to our own instance of ServiceNow.

I wanted to share the design that we came up with - It is generic enough for anyone to take and use but the design considerations were quite specific to us as a SaaS Service Provider. If anyone finds the model we used helpful please feel free to implement it yourself or ask us questions.

We want to be Customer centric.

Our previous Incident model had evolved over time and in retrospect it was clear to see that we'd made changes for our own needs without looking at the process from the view of the customer.

I think the overall aim of ServiceNow Customer Support is to be the best in the world at what we do. To have the best reference example of Incident Management in a ServiceNow environment. To wow our customers with the support they receive.

We look at Rackspace - one of our customers and envy the reputation they have for "Fanatical" support. We won't be stealing their marketing message but we want that reaction and recognition for ourselves.

When designing our model for handling Incidents we placed ourselves in our customers position and asked:

What do our customers want out of this process?

We used the 5 "R"'s to describe the outcomes that a customer wants from the moment they tell us about an issue to the moment that issue goes away.

Response

When a customer contacts us with an issue they want to know that we have received it and that their issue isn't "lost in the system". Getting to the stage of "Response" was one critical stage we wanted to get right in our state model.

Recognition

A customer is often forced to describe the issue as they see it from their point-of-view. They may not have the access or knowledge to provide the exact root cause of their issue and we felt that recognition of their issue was an important stage. Is the customer happy that we know what their exact issue is?

Relief

We defined Relief as the stage where we have been able to diagnose the issue that they are having and we have a solution that we can propose to them. That isn't necessarily the same as the customer having the issue removed from their environment but getting to the stage where we have identified and provided a solution. We saw this as a critical stage to get to in the process.

Resolution

Resolution refers to the state where the issue is removed from their environment. In our industry where an issue could be caused by a defective piece of code it isn't always the case that we can conclusively fix the issue without rolling the fix up into a Hotfix or Patch release.

Here we find systems such as the Known Error Database and Workarounds very helpful to provide alternative, temporary solutions to the customer.

Because we are often unable to apply Workarounds directly onto our customers environments - they have Change control and their own compliance processes to deal with - we felt that achieving the stage of Resolution is often going to be in the hands of the customer.

Removal

And finally the customer expects never to have the same issue affect them again. The final "R" in the process is removal which typically fits into our Problem and Release processes. The customer may have disengaged with the Support team at this point but we aim to make sure that we remove the cause of that Incident from our system.

Our new Incident State Model

We wanted to have as simple a state model as possible whilst still capturing the performance data we needed. We also wanted to eliminate parts of the state model where Incidents go to die - no states of inactivity or waiting for things to happen.

The states in our model are:

New
Work in progress
Solution proposed
Closed

You might be surprised to see a lack of Resolved or an overall Pending state in our model. We really wanted to put a focus on getting customers to the stage of Relief and we didn't want to be tempted with states that get in the way of that. Having a Pending state might infer that we have the ability to stop the clock for various activities when that isn't the case from the customer point of view. They don't care if we are waiting for various background activities to happen - they want a Solution.

We didn't need a Resolved state as our focus is on the Relief stage and we can't control when customers actually implement the workaround that we give them. In our process Resolution is a step that the customer would typically perform by applying a Workaround or hotfix.

So what happens in these states:

New

In the New state we are working towards the first "R" - Response. We didn't want to rely on an automated "Thank you for logging your ticket" type of email, we want to make sure that the first communication is valuable and meaningful.

During this state we are performing triage, making sure the Incident category is applied correctly, the priority is correctly set and we assign it to the correct engineer.

Achieving the state of Response means that the customer knows exactly who is owning their Incident and they have some sort of decent update. At this point only the Incident Owner can transition the Incident to Work in progress

Work in progress

During this state we are performing all the tasks required to get a solution out to the customer.

This would involve researching the Incident, matching it against Known Errors in our Problem database and finding suitable Workarounds.

The Incident owner might bring in help from other teams in the company, from Operations and Development but we don't change the Incident Owner at any time. We wrote a system of delegating certain tasks required to get a solution to other individuals.

By researching the issue, communicating with the customer and locating the cause of their Incident we can work towards proposing a solution to the customer.

Solution proposed

The next "R" we are focused on is Relief. We define this as providing the customer with a suitable solution that either conclusively fixes their issue or provides a Workaround.

As this is a key point in the Incident lifecycle we mark this stage with a transition in state to "Solution Proposed".

Whilst we always want to fix the issue conclusively that isn't always possible especially in the case of Software Defects. There is normally a Workaround for customers however and it's up to the customer to decide whether they feel that it's suitable for them.

When we transition to Solution Proposed we ask customers to do one of three things:

Accept Solution: Indicate that they are happy with the Solution or Workaround and the Incident can be closed
Reject Solution: Indicate that the solution doesnt work for them or they are unhappy with the Workaround
Do nothing: Indicate that the Workaround works for them but they want to keep the Incident open until a conclusive fix is found. Or they want a Root Cause Analysis

This puts the control of the solution into the hands of the customer. They decide if the solution is suitable or not.

Clicking Reject Solution moves the Incident straight back to Work in progress (and the Relief SLA resumes). Clicking Accept Solution means that the customer is happy and we can close that Incident.

Closed

The Closed state is used to indicate that no more work is required on this Incident and we make the entire record read-only for everyone. We could manually move Incidents out of this state if an Incident is closed in error but our intention is that Closed incidents stay closed.

Sub-states

We didn't want any states in our model that infer that we've stopped working on the Incident for customers. For reporting reasons we do want to be able to indicate that this Incident has a relationship to other processes - Problem and Change.

For this reason we have a sub-state that is available within the Work in progress and Solution proposed states.

Our sub-states are:

Pending Problem
Pending Change
Pending RCA

Incident flags

Lastly - we wanted a method to communicate with the Incident owner that didn't extend the state model. The state model is focused on driving the Incident towards Response and Relief.

We have 3 flags on the Incident that allows reporting and filtering for the Incident Owner and the Technical Support Managers.

Action needed
Customer Action needed
Assistance Required

Action needed is a flag set by interactions with the customer. For example after the Incident is updated by the customer we set the Action Needed flag.

If the customer rejects the solution the flag is set. If other members of ServiceNow update the Incident the flag is set.

Action needed is a visual reminder of which work needs attention from the Incident owner.

Customer Action needed is a flag that we manually set to communicate with the customer that input is needed from them. In other state models we might have used "Pending User" but we felt that having this in the state model detracts from the outcomes that the customer wants.

Assistance Required is a flag the Incident Owner can set to get the attention of his managers and peers within the company. For example, as a developer I might keep an eye on Incidents that are categorised in my product area and set to Assistance Required to see who needs my help.

The 3 Incident flags are a communciation mechanism for the Incident Owner and customer outside of the state model.

How is it working?

A few weeks after implementation we've settled down into our state model really well.

We are much more focussed on delivering two customer outcomes that matter to them. Response and Relief
We have consistent ownership of Incidents from the initial submission to closure.
The quality of the initial response is higher with our controls for transitioning to Work in progress
Customers have the ability to reject solutions and control the Relief SLA resuming so we are more likely to ensure the solution is acceptable before proposing
We aren't seem tickets reopened from Closed state giving us better reporting on Time to Relief and SLAs

There you go - I hope our customers have a better insight into the states that they see when dealing with Customer Support and hopefully we've given a few ideas for new customers considering their own Incident model.