Everyone knows the existential question: If a tree falls in the forest, and nobody is there to hear it, does it make a sound? Leading-edge customer service management today has produced a corollary: If a network problem is fixed before the customer even suspects there’s an issue, did it even happen?
For Dominic Walton, senior director of site reliability engineering at ServiceNow, the answer is a firm no. Walton leads a global team of three dozen engineers who have spent the last five years honing the craft of pre-emptive monitoring—scanning real-time analytics for indications that customers might be affected by some future IT event. They are trained to spot issues. The team prevents a majority of issues before customers are aware of any kind of impact.
Walton doesn’t linger long on the philosophical implications. “Proving the negative,” Walton says with a chuckle. “It’s actually one of the things we struggle with when we report on what we do. Say we resolve an issue. If left alone, the customer might have suffered. But we intervened, so it didn’t. How do we say that it would have?”
Signal vs. noise
Proactive monitoring starts with the ability of the Now Platform® to integrate a variety of cloud monitoring tools. “The beauty is that because of our platform, all of these elements are connected,” says Brooke Hendricks, director of business process management for the ServiceNow customer support portal. “Your alerts are in the same system as your events; your events are in the same system as your customer incidents or cases; your cases are all in the same system as your change tickets. We can see how they all affect each other and mitigate any risks.”
Two applications on the platform have been key to the progress: event management for monitoring, and analytics dashboards that bring all that disparate data into a single view. As the team gained experience, it fine-tuned the dashboards, experimented with how thresholds should be set for creating various kinds of alerts, and looked for patterns that indicated the likelihood of common issues. Once captured, many of these lessons were then automated into workflows, improving the signal-to-noise ratio significantly. A good signal-to-noise ratio helps translate to higher quality alerts.
Walton has found that about half of what used to trigger service alerts qualified as noise, not requiring remediation. And more than half of what remained could be addressed immediately and solved without involving customers. Only about one in six potential issues resulted with his team opening a service ticket and informing a customer that the team was working on an issue that the customer hadn’t even known about before then. This is the only time customers are made aware of the issue and customer satisfaction becomes a primary factor.
Natural disasters are one example of how his team may get ahead of issues. In the case of facilities potentially being struck by hurricanes, he says, “Where necessary, we may proactively decide to temporarily move the services away from those data centers. This failover would be seamless from the customer perspective.”
Leaner, smarter team
By moving the point of resolution to the point of detection, Walton’s team has made the traditional network operations center redundant. In the past, a company the size of ServiceNow may have operated with engineers organized into tiers and that used scripts to triage, escalate, and respond to events within a ticketing system.
Walton has instead created a smaller but highly skilled global team, without tiers. Engineers can take responsibility for issues as they arise and see them through resolution. Team members are located around the world and provide follow-the-sun coverage without the need for graveyard shifts. His team is staffed by “highly experienced technical engineers who've been there, seen it, done it.” He hires the best multi-discipline engineers he can find to ensure the team resolves the various scenarios that pop up.
Their methods have been designed to not only address the immediate need of restoring service, but to drive toward platform-level solutions that increase reliability by solving underlying causes. He invokes the 80/20 rule to describe the division of focus.
“Eighty percent of the events we've seen before so we can plan for them and automate. Twenty percent of things that happen we have never seen before, and that’s where we need to be well prepared and rely on our experience and calm,” he adds. “Altogether, we should be that team that no one thinks about. We want our customers to go about their day unaware that we are helping to protect their ServiceNow instances from the issues that might arise.”
For more information on how ServiceNow uses its own technology to run its operations, visit the Now on Now, our website that is chockful of webinars, case studies, and other information on the Now Platform.
© 2020 ServiceNow, Inc. All rights reserved. ServiceNow, the ServiceNow logo, Now, and other ServiceNow marks are trademarks and/or registered trademarks of ServiceNow, Inc. in the United States and/or other countries.