darius_koohmare
ServiceNow Employee
ServiceNow Employee

Today, I'm excited to highlight four of the AIOps capabilities enhancing our Site Reliability Operations workflows. With these capabilities we can help you gain historical insight and reduce noise proactively using similarity and clustering machine learning algorithms trained on your data, all within the confines of our own datacenters. We aimed to seamlessly add both automated and recommendation based insights into your existing incident response workflows to further drive down MTTR with improved context and focus. 

 

From an academic perspective, we wanted to align the behavior of the AI to use cases aligned to the technologies strengths. When there is high volume of correctly labelled data and the predictions have high precision, then its more likely that an AI can take autonomous actions, especially if the impact and consequence of the action is low. Automated grouping can be a good example of a decision the system can do on the users behalf, as a user can always manually group or ungroup something if the AI was incorrect. For use cases where there is less labelled data and the more action is more consequential relative to the critical major incident workflows, then the more likely the AI should provide passive recommendations for a human to take judgement and action on the recommendation from the AI. Humans will take advantage of improved judgement and decision making, informed by the AI in human AI teams. 

 

darius_koohmare_0-1666638326938.jpeg

 

Source: MIT quoted Ted Talk 

 

While developing our AIOps capabilities, another consideration was the historical challenges we’ve seen when deploying and realizing value from AI is the configuration, maintenance, and ease of use required to begin benefiting from the added intelligence. That’s why we’ve hidden all the data science complexity behind a simple non-technical UI, with just a few input fields where you can select the data and field to train on. We’ve also provided out of the box solution definitions to simply enable and train. 

 

The second challenge to valuable AI/ML capabilities is around identifying the right use cases for the technology, based on the existing workflows and the amount of data available. AI is great for some applications, and unnecessary or improper for others. We believe in applying AI to areas where it exceeds alternatives in performance, and stay away from implementing AI for marketing or feasibility sake. Just because we ‘have the technology’ doesn’t mean we will force it into an experience without identifying the right problem and outcomes that AI can help deliver. And so when prioritizing opportunities for AI/ML in our workflows, that’s exactly where we started – conducting research with AppDev and SRE incident response practitioners to identify challenges that you still face in your current processes. The challenges we heard primarily centered around the following questions: 

  • How can we better reduce noise? 
  • How can we improve diagnosis of root cause? 
  • How can we help learn from past data for resilience? 
  • How can we reduce MTTR? 

As a result, we've deployed and productized prebuilt AI powered use cases aligned to these business problems.

 

To help you improve diagnosis of root cause and reduce noise: 

Title: Similar Alerts Related List 

Short description: Utilizes AI to present alerts on the similar alerts tab that are similar to a current alert based on fields such as description, short description, metric name, and service.  

Description: When diagnosing and working on an active alert, it's valuable to have the context of past alerts that occurred and that closed that where similar to read details for resolution. It can be useful to identify similar open alerts to identify grouping opportunities and to consolidate focus and work, and identify additional context for diagnosis. By enabling Similar Alerts, we add a sidepanel of AI identified similar alerts at or above the defined confidence threshold, on the related alert tab of your alerts. Increase the confidence threshold to display less, but more similar results and decrease the confidence threshold to display more, but less similar results.  

Where you’ll see it in the app: When opening the related alerts tab on an alert record, you’ll find a new sidepanel with the similar alerts for manual grouping and context. 

darius_koohmare_1-1666638326941.png

 

To help you learn from past events to more quickly resolve repeat issues: 

Title: Similar Incidents Sidepanel 

Short description: Utilizes AI to present incidents via a sidepanel flyout that are similar to a current alert based on fields such as description, short description, and service.  

Description: When diagnosing and working on an incident alert, it's valuable to have the context of past incidents that occured and that closed that where similar to read timeline and postmortem for resolution. By enabling Similar Incidents, we add a sidepanel flyout of AI identified similar incidents at or above the defined confidence threshold. Increase the confidence threshold to display less, but more similar results and decrease the confidence threshold to display more, but less similar results. The system will also recommend actions to propose major incidents if a defined threshold of similar incidents in a recent period has been breached, which is a common indicator of a mass outage. 

Where you’ll see it in the app: You’ll find a new search result source and insight in the agent assist sidepanel of incidents to identify incidents similar to your current incident. 

darius_koohmare_2-1666638326944.png

 

To help you prioritize proactive resilience opportunities based on past data: 

Title: Alert Cluster Visualizations 

Short description: Uses AI to identify clusters of frequent similar alerts. 

Description: Utilizes similarity AI to identify clusters of similar alerts based on the text description and other columns defined (resource, metric, node, ci). Visualizes clusters of alerts based on defined minimum cluster sizes and offers drilldown into underlying records for analysis. The largest clusters identified become automation and resilience opportunities for teams to focus on proactively preventing to reduce noise and improve system resilience. Filters can be done on cluster size and quality to prioritize larger, higher quality clusters first. 

Where you’ll see it in the app: You will find the data surfaced in a cluster visualization on the solution model after running. 

darius_koohmare_3-1666638326949.png

 

To help you reduce alert noise and improve diagnosis context: 

Title: Automated Text Based Grouping Rule 

Short description: Uses AI to determine if the text within two or more opened alerts are similar, to automatically group them. 

Description: Utilizes clustering AI to identify similar alerts to the current alert based on text similarity found in fields such as the description, service, and metric name. If similarity is identified and the alerts are opened within a near timeframe, the alerts are automatically grouped under a virtual parent alert. This reduces the notifications that would be sent for standalone alerts, while also providing diagnosis context of multiple alerts to see the forest from the trees. It also reduces the time to implement grouping rules with a generalized policy. Additional automated grouping logic based on cmdb are also available. 

Where you’ll see it in the app: You’ll find the text based and ci based automated grouping rules under the alert correlation properties and rules. You can enable it to have the system initiate auto grouping, which will add the secondary or child alerts to the alerts related list. 

darius_koohmare_4-1666638326955.png

 

We’ve made sure to utilize AI/ML algorithms with low data requirements to ensure fast value realization early in your site reliability journey. While this summarizes just four of a few of our AIOps use cases delivered with ITOM & ITSM’s Site Reliability Operations using predictive intelligence, numerous other AI platform capabilities are also available to be configured and consumed. From a global AI powered text search, to automated alert and incident enrichment via classification, to natural language querying of data, to raw event anomaly detection, we believe AIOps capabilities are core to augmenting your workflows and maximizing your incident response potential.