The CreatorCon Call for Content is officially open! Get started here.

ben_yukich
ServiceNow Employee
ServiceNow Employee

Event Management has been a capability in ServiceNow for quite a few releases now. It's straightforward in its approach, and allows massive volume reduction from event > alert > task (e.g. Incident). Seriously, I'm talking order-of-magnitude compression ratios! But a while back we decided that we can do better. Starting in the Geneva release a new feature called "Service Analytics" was introduced, but only made available to select design partners. The goal? Do more than simple mapping and manual rules for compression of events and alerts - analyze them on a time-series and topology basis to create logical groupings and identify probable root cause. Beginning with the Helsinki release, it is available to all Event Management customers. Before we get into the details, let's talk a little bit about machine learning more broadly.

Machine Learning is an incredibly hot topic right now across many industries. According to Gartner's 2016 Hype Cycle, Machine Learning is squarely at the "peak of inflated expectations", on target to fall into the trough of disillusionment (for the second year in a row). It may soon follow in the footsteps of augmented reality, software defined anything, and autonomous vehicles - slowly but surely back up into the real world, hopefully being a transformative technology on the plateau of productivity at some point in the future.

So, why so much hype? What's so hard? Can't Watson just answer everything for me?

terminator.jpg

Well, first and foremost, we must ensure we don't create a sentient super-intelligence that harvests our species for parts. Secondly, while there have been phenomenal advancements in machine learning, it's an even more phenomenally complex space. Even "simple" questions are deceptively complex... for example, what is learning? There are many ways to interpret this question and apply solutions. For any given problem, perhaps something as simple as linear regression is sufficient. If you're making a chat bot, you'll certainly want to employ a Markov chain. But when should you use a supervised learning model like an artificial neural network, or perhaps a probabilistic graphical model? Would a simple logic based deduction suffice, or is inductive reasoning needed? There are usually (for me), far more questions than answers when trying to make practical use of machine learning techniques. I am no expert; I won't even claim to be an amateur — I'm a casual observer at best.

Thankfully, there are folks who live and breathe this stuff. Some of them have been working away at our Service Analytics capabilities. There are 2 primary areas of Service Analytics you ought to know about: correlated alert groups, an automated way to correlate time-correlated alerts; and root-cause CI analysis, which provides an automated way to identify the root cause CI for Service Mapping topologies.

CAG.png

Correlated Alert Groups

The first step of machine learning for correlated alert groups in event management is, not surprisingly, the learning part. In ServiceNow, the past 90 days of events are re-assessed on a daily basis using a proprietary probability model to group time-correlated events into learned patterns. You can think of this as analyzing which events frequently co-occur over a reasonable window of time.

The second step is actually querying these learned patterns for new events after they arrive. Over a configurable time window, all events will be measured against the learned rules for best fit. It's this fit against a learned rule that generates a "Correlated Alert Group". This group behaves in the same way as an alert group created by manual alert correlation rules, except with no human intervention required.

RCA.png

Root Cause CI

First off, root cause analysis is only available if you have Service Mapping enabled. Once again, the first step in this approach is learning patterns. This case is slightly different in its approach involving Bayesian networks, and uses an "offline" learner that runs on a defined MID server that is given the "Analytics" capability in your environment. The learner algorithm overlays historical alerts on discovered service topologies and outputs causal models. In addition to leveraging a directed acyclic graph representation of impact relationships in the service topology, the precise approach is configurable with "RCA Configs" to allow for rule based or multi state analysis models.

The second step of root cause analysis is similar to correlated alert group queries. The causal models are applied on the real-time stream of alerts to determine root cause CIs and assign probabilistic scores to root cause CIs.

Going Forward

The two techniques outlined above help provide another level of event compression (with no human intervention), and rapid identification of which service impacts should be addressed first. This is just the beginning. Service Analytics is an area of the product that's getting a huge amount of interest and very active development. Future releases will build on this foundation in ways I probably can't imagine yet. But how do you get started today? It's actually really easy, but you'll have to tune in next time to find out.