AIOps provides more value when combined with better experience

Hakan Isik · ‎03-07-2019

AIOps is not the new kid on the block anymore

It indeed has come a long way since it was first announced in 2017 by Gartner; and evolved based on real-life use-cases. Even its name has changed from “Algorithmic IT Operations” (which I personally like better) to “Artificial Intelligence for IT Operations”.
Today I am not going to talk about AIOps though; instead I will touch on an interesting point we’ve realized over the course of its evolution: “Importance of user experience in AIOps”.

But before that, if you want to familiarize yourself with AIOps, you might want to;

Take a look at the attached market guide for general information,
Check this link to see how it can help you with your business issues,
And of course Google it for latest trends and examples.

It is not easy to be “IT”

AIOps has been pointing out an obvious fact about IT and proposing a very reasonable solution, a framework for the challenges associated with that fact. The fact is that, IT is the engine of our business; it is even the business itself in some cases. So it is essential for our success. But IT has its own challenges to overcome. It needs to deal with an ever increasing complexity both in terms components to manage and the data to process and interpret. They’re both continuously getting bigger and more complex and they don’t have any intention to get smaller or less complex in near future. So it is not really feasible trying to handle that complexity and that huge amount of data manually. Or we can find ourselves in situations like this:

And then as a solution, AIOps says that, we need to automate and streamline handling of that complexity using big data and machine learning techniques. And that’s the only feasible way to manage all those components and analyze that huge amount of data so that we can answer our questions with the right level of visibility and automation. That made perfect sense to everybody and quickly became a very popular topic among many enterprises. They finally had the answer they’ve been looking for decades. And the celebrations began:

When theory and practice go their own separate ways

Without losing any time, they sent their best hunters and gatherers in order to capture the finest examples of big data and machine learning. And you know what, those hunters and gatherers didn’t return empty-handed. They actually brought in many shiny, promising tools. Everybody was happy and they right away started to work on an architecture to put all those tools together. That didn’t took too long either and it finally came down to implementing and actually using what they have in order to achieve desired outcomes as AIOps suggested.

They were expecting to have something like this:

Or at least something like that:

But this was what they got instead:

They were getting some sort of a value out of their new AIOps environment, but;

That value was quite small comparing their investment,
Only technical users were able to use the environment while there were many other stakeholders they needed to satisfy,
The environment was hard and costly to maintain,
It kept having performance and availability issues due to multiple custom integrations,
And it was disconnected from the business.

User experience matters, it matters more than you think

There was obviously something wrong there. And they quickly figured out what was wrong with their approach as they are a bunch of very smart people. Problems they were having weren’t about the functionalities in the tools they brought in. Those tools were actually equipped with some quite powerful capabilities. But at the same time each of them introduced different types of complications into the architecture which eventually created a Frankenstein with some serious usability issues. The end architecture was simply too technical, complex and disconnected from business reality.

They also realized that AIOps was very clear about “WHAT” to do, but not so much about “HOW” to do part. And that was what they were missing as well. So they decided to write their brief manifesto about the “HOW” part based on their experiences; and that’s how it goes:

AIOps provides more value when combined with better experience. We need to isolate technical capabilities and provide users with user-friendly, intuitive interfaces so that they can focus on outcomes rather than technicalities.
The rest of this manifesto is here to ensure principle #1.
Data is the fuel of AIOps. So we should ensure and should never compromise on data quality as it has a direct impact on the quality of the answers we’re getting from an AIOps platform. That’s why we should avoid integrations at data source level and have a single data source which is able to consume and normalize data from other data sources/providers.
Business outcomes over features and functions. Of course we need to have certain capabilities to fulfill requirements of an AIOps platform but we should keep our technical and business goals in mind while choosing those capabilities. There may be many different ways/capabilities which can satisfy a particular use-case. So we should shape our architecture according to our business requirements not the other way around.
Integrations should be avoided as much as possible. Each integration introduces another layer of complexity and causes hard to maintain, troublesome bottlenecks in the architecture. That should be an important consideration in choosing its components and, data and process level integrations should be avoided.
Customizations should be avoided as much as possible. For a future proof architecture, we should stick to the out of the box versions of the components we’re using. Each customization brings additional workload in forms of regression and integration tests, extra development, etc. and additional risk in terms of performance, compatibility and security.
Data should be accessible by anyone who needs it to do their jobs. An AIOps platform serves many different stakeholders with different questions to answer. And they might need different parts of the data. The platform should allow users to not only access whatever data they need, to do their job, but also allow them to use that data, visualize it in a secure way as they need.
No AI for the sake of AI. AI is valuable as long as it delivers answers we need. If it doesn’t then it is nothing but noise and complexity. We should pick our AI capabilities considering the problems we need to solve not the algorithms we think will look cool on the paper.

And they achieved some tangible positive outcomes as soon as they redesigned their AIOps architecture based on the principles above; outcomes such as;

Increased company-wide adoption
Reduces MTTR and increased MTBF
Increased service quality
Increased employee and customer satisfaction

Interestingly, they realized another positive side-effect of those principles. They were applicable not only to AIOps but also to many other IT initiatives they have in their organization; IT initiatives that can help them throughout their “Digital Transformation Journey”.

And yes, ServiceNow did read that manifesto too

Well, to be honest we didn’t actually read it (mainly because it is a fictional manifesto) but we’ve been listening to our customers closely and kept hearing similar things from hundreds of them. So we put our heads together and worked on those requirements in order to provide our customers with something close enough to an “easy button”, a consumer-like user experience, an AIOps environment where they can get the answers they need without getting bothered with technicalities.
And we achieved some pretty good results keeping ease of use in mind and reducing AIOps to 3 simple real life steps. Let me walk you through those 3 steps:

#1: Shared Data Model & Shared Intelligence:

The first and the most important capability ServiceNow provides, is “The Now Platform” coming with a shared data model and a shared intelligence supported by cross enterprise workflows and business-aware visibility:

The Now Platform doesn’t only act as single source of truth which we can automatically populate and keep up to date in real time, but also acts as an automation engine allowing us to add new capabilities without requiring any process level integrations. This way, whenever we add a new machine learning capability, it automatically starts working on that always normalized and up to date data as a natural cell of the platform. And that gives us required visibility and automation to achieve the agility we need for our business. This kind of a strong foundation is essential for AIOps as well as any other processes and workflows we have in our IT.

Ok, now we have a strong base; so what’s next?

#2: Put Everything in Business Context:

Alright, let’s remember our goal.
We want to give a big red “make it easy button” or something similar to our users; so that they can answer their questions with a push of a button. Questions such as “What is the root cause of that slow transaction in my most critical application”?
Well, the bad news is that we haven’t developed that magic button yet, but the good news is that we believe we have the next best thing for you, which actually looks quite similar to a big red button.

Here is what we’ve done;

We received all your events, normalized them and grouped them using machine learning.
We did the same for metrics; learned their behaviors and generated anomaly alerts, again using machine learning.
And finally, we bound those alert groups to the right CIs which were already associated with business services supporting our business.

I know, that sounds like a lot of work but you don’t really need to worry about it since most of it will be automatically taken care of in the background without you noticing, thanks to shared data model and shared intelligence.
So now all you need to do is clicking the big red box representing your high priority impacted business service or application;

In order to drill into its details and detect the problematic components with associated alerts:

#3: Answer Your Questions:

But some questions are harder than the others and we might require more information to answer them. For example in our case, even though we managed to isolate impacted CIs and impacting alerts with the step 2 (which BTW reduces MTTR significantly), we still need to get deeper to find the real root cause and fix it for good before it turns into a “problem”:

And for that, the Now Platform gives us a specialized unified interface called “Agent Workspace” where we not only find every bit of required information (alert, event, CI, business service details; associated incidents, changes, problems, knowledge articles, tasks, remediation actions, etc.) to solve our issues as fast as possible, but also;

Detect similar cases (alerts, incidents, problems, changes) via machine learning
Attach accurate knowledge articles automatically, again via machine learning:

Collaborate with our colleagues in real-time:

And trigger remediation actions:

And that’s how it looks when we put them all together:

You most probably noticed that I didn’t put a screenshot under one of those capabilities above: “Detect similar cases (alerts, incidents, problems, changes) via machine learning” a.k.a. Similarity Framework. Because I want to touch one last very important point.
The Agent Workspace significantly facilitates root cause analysis process and helps us to reduce MTTR up to 90%. It does a great job of solving issues in real-time. But how about mid and long term issues we’re having, things that we can consider as “problems”. How we’re going to know if the alert or the associated incident we’re working on has or hasn’t happened before. How we’re going to know if we keep getting “similar” type of issues impacting our business services, our applications.

Well, remember the “shared data model”; we already have the required data to answer those questions. The only excuse for us to not to use that data may be because it is simply too big and it is not feasible to scan through it manually. But then we have the “share intelligence” to help us to automate these kind of scenarios. And the “Similarity Framework” is the capability under our shared intelligence that can answer those questions. It uses natural language procession techniques and CI based analysis to identify similar records automatically (alerts, incidents, problems, changes) and puts them together under alerts. All you need to access it is clicking the “Insight” button:

And there you go:

Now we can automatically detect our recurring issues and go ahead and deepen our analysis including data from other processes to fix them for good.

Keep simple and carry on…

In summary, that’s indeed what we tried to do; we took all the technical complexity and boxed it in 3 logical steps so that our users can answer their question with only a few clicks:

#1: The Platform > to provide a shared data model and a surrounding shared intelligence to support digital transformation initiatives including AIOps.
#2: The Easy Button > to detect issues and diagnose the root cause automatically.
#3: The Solution > to dive deeper and fix the issues permanently.

And that kind of an simplified automation helped organizations to reduce Mean Time To Repair and increase Mean Time Between Failures leading them to better quality services and to happy users.

AIOps provides more value when combined with better experience

Now Create Retirement FAQs and Introduction to the Best Practices site

Data at the Core—Integrations, Federations, and Workflow Data Fabric

Making use of AI Skills: Problem Affinity