Outage Prevention with Operational Intelligence

aleck_lin · ‎03-01-2017

Imagine a scenario where you do not have to wait for a service to go down before taking action, but instead, you're proactively preventing an outage from occurring (think: Minority Report). A pipe dream? Maybe not!

A few months ago, my good friend ben.yukich made a blog post about Service Analytics: Rise of Machines, which touches on ServiceNow's leveraging of machine learning tools in the Event Management space. It's undeniable that we're quickly heading into an era where it's impossible to manage the vast amount of data without some type of machine learning capability. Using machine learning algorithms in our service analytics, we can filter through the noise of alerts and present relevant groups of alerts for you to look at based on identifying patterns it has learned in the past.

With our Istanbul release, we've made additional leaps by introducing Operational Metrics. Curious? Let me get right into it...

Rather than taking events from the monitoring tools, we're going to take the raw metric data instead. This allows us to apply machine learning algorithms to automatically discern the threshold for a given metric at a given time. For example, suppose we have the metric data on the DB transactions per second; with enough data points, we can begin to learn the data pattern and set what the upper and lower bounds (i.e., the threshold) should be. Additionally, this will also take seasonality into account so that the threshold will adjust accordingly.

With the automated threshold in place, we can apply the concept of an anomalous score, which is a function of how close the value gets to the threshold (or exceeds it). Put it simply, suppose my DB transactions per second is roughly around 1000 per second: the machine learning model suggests that my upper bound should be no more than 3000 and lower bound should be no less than 100. As the value creeps toward 3000 or 100, an anomalous event with a score (between 0-10) would be generated depending on how far the value deviates from the model. This then creates an anomaly alert which then allows us to tie it all together back to Event Management. It means instead of waiting for monitoring tools to tell you something is wrong based on a rule that is set, you can start seeing trends and anomalies in real time before things get worse!

Am I proclaiming that you will no longer suffer outages? Not quite. We all want to prevent outages rather than restore them, but the only reasonable way to make it a reality is to leverage machine learning to get us closer to that goal. I'm not saying that we have solved this issue completely, but we're on a journey to make this more of a possibility. It also speaks to our continual investment in A.I. because we see it as an absolute foundation for building better products in the operations world today.

To give you a taste of what this would look like, we have introduced dashboards like the Anomaly Map to show you the various anomalies CIs (servers, applications, etc) have experienced based on time. In the screenshot below, it looks like there was something anomalous going on with my SAP-SD-03 server between 2/21 and 2/23 around the response time. I can also see that it's currently experiencing something with the Disk Read Time as well.

To get a better idea of what's going on, I can drill into the specific metric and see the data trend. You can see the raw metric value (blue line), the threshold (red dotted line), and the anomaly score (the green, blue, yellow lines on the bottom) across time.

You may even want to superimpose metric data on top of one another to see if there are any correlations among them.

Finally, to tie this all together, here's an example of an alert created from the anomalous response time on the Tomcat web server and in turn affecting the overall service health of the business service!

With all these new capabilities, we're beginning to see the possibilities opening up when it comes to managing the health of your business services. This is where we're making very active investment with operational intelligence; by ingesting different types of data from different sources and overlay them on top of the CMDB, we can provide not only descriptive analytics, but also predicative analytics in order to prevent outages before they happen!

Outage Prevention with Operational Intelligence

2026 MVP Applications are open—we invite you to apply today!

Now Create Retirement FAQs and Introduction to the Best Practices site

Making use of AI Skills: Problem Affinity