benny_makovsky
ServiceNow Employee
ServiceNow Employee

In my previous blogs I shared an overview  of AIOps solution and the critical role it plays in today's modern world. In this blog we get under the hood and cover the different ML techniques we have implemented in our AIOps solution.
The world of Artificial Intelligence for IT Operations (AIOps) is vast and complex, with countless discussions dedicated to understanding its critical role in modern IT environments. In this article, we will delve deeper into AIOps and explore concrete examples of how it works its magic, particularly focusing on ServiceNow's predictive AIOps capabilities.

To begin, let's examine the main areas in which ServiceNow predictive AIOps applies machine learning (ML) logic:

 

1.1 Alert Grouping and Correlation

At the core of ServiceNow IT Operations Management (ITOM) AIOps is its capacity to intelligently group and correlate alerts through advanced machine learning algorithms. This sets ServiceNow apart from competitors in several ways:

 

  • Historical and Real-Time Context: The ML algorithms employed by ServiceNow analyze both historical and real-time data to recognize patterns and relationships between alerts. This enhances accuracy and ensures that the system can adapt to the evolving dynamics of the IT environment.
  • Topology-aware Grouping: When correlating alerts, ServiceNow ITOM AIOps takes into account the underlying infrastructure and application topology. This approach minimizes false positives and guarantees that correlated alerts are both meaningful and actionable.
  • Noise Reduction and Root Cause Analysis: ServiceNow's solution is designed to eradicate alert noise by identifying duplicate or irrelevant alerts, allowing IT teams to concentrate on critical issues. Additionally, it facilitates root cause analysis, making it simpler to identify and address the source of the problem.
  • Continuous Learning: As ServiceNow ITOM AIOps processes data and observes actions taken by IT teams, it continuously learns, refining its algorithms and improving alert grouping and correlation over time.

1.2 Metric Intelligence Anomaly Detection

ServiceNow ITOM AIOps distinguishes itself in metric intelligence anomaly detection through several key features:

  • Adaptive Thresholds: ServiceNow's solution uses dynamic, adaptive thresholds that automatically adjust to changes in the IT environment. This reduces false positives and ensures more accurate and relevant anomaly detection.
  • Multivariate Analysis: By employing multivariate analysis, ServiceNow ITOM AIOps considers multiple metrics and their relationships simultaneously. This helps identify complex, interrelated issues that may go undetected when analyzing individual metrics.
  • Seasonality and Trend Analysis: ServiceNow's algorithms can detect and account for seasonality and trends in the data, allowing them to identify anomalies more accurately and reducing false alarms.
  • Auto-Remediation and Actionable Insights: Not only does ServiceNow ITOM AIOps detect anomalies, but it also provides actionable insights and auto-remediation capabilities. This enables IT teams to promptly address issues and minimize downtime.

1.3 Predictive Log Analytics

ServiceNow ITOM AIOps offers several key differentiators and talking points related to log analytics:

  • Comprehensive Log Collection: By collecting log data from a wide variety of sources, including servers, applications, network devices, and cloud infrastructure, ServiceNow ITOM AIOps ensures that IT teams have a complete view of their environment and can make informed decisions based on all available data.
  • Intelligent Log Parsing: Advanced machine learning algorithms automatically parse and structure log data, regardless of the format. This intelligent log parsing capability allows IT teams to easily search, analyze, and visualize log data without having to manually define parsing rules or field extractions. The system also extracts negative sentiment keywords from log messages and clusters logs with similar pattern text, which are used for log-based anomaly detection, in addition to metrics extracted according to the automatic parsing.
  • Unsupervised Pattern Recognition and Log-based Anomaly Detection: Through the analysis of historical and real-time logs, it can identify abnormal activity, potential security risks, and performance problems before they become major incidents. The solution leverages unsupervised anomaly detection algorithms to detect anomalies in logs, enabling operator teams to uncover unfamiliar and emerging errors while proactively resolving known issues. 

 

Chapter 1: A Deep Dive into Alert Grouping and Correlation Techniques

As we explore the world of alert grouping and correlation, it's essential to understand the various techniques employed in this area. In this section, we will delve deeper into the different methods used for alert correlation and examine their advantages.

 

 

Automated Pattern-based Grouping using Conditional Probability (CP) and Mutual Information Graph Clustering

This method involves the use of conditional probability (CP) and mutual information for alert correlation. CP measures the probability of an event occurring, given that another event has already occurred. In the context of alert correlation, CP is used to determine the likelihood of two alerts being related. Mutual information, on the other hand, is a measure of the mutual dependence between two variables. In this case, it quantifies the amount of information gained about one alert from observing another alert. By combining CP and mutual information, we can group alerts based on the strength of their relationships.

Advantages of this method include:

  • Reduced false positives: By considering the relationships between alerts, this method can more accurately identify related alerts and reduce false positives.
  • Scalability: This approach works well with large datasets, efficiently handling the relationships between numerous alerts.
  • Provides insight into relationships: By analyzing the relationships between alerts, it helps identify root causes and dependencies among them.

Tag-based Clustering using Fuzzy Match Based on Levenshtein Distance Algorithm

When the Configuration Management Database (CMDB) is not fully matured, tag-based clustering can be used to group alerts based on their tags. The Levenshtein distance algorithm measures the similarity between two strings by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. By using a fuzzy match, we can group alerts with similar tags even if they are not exactly the same.

Advantages of this method include:

  • Tolerates imperfections: This approach is more forgiving when it comes to variations in tag names or spelling errors, making it suitable for environments with incomplete or inconsistent CMDB data.
  • Flexibility: Fuzzy matching can be adjusted to different levels of similarity, allowing for customizable clustering based on the specific needs of an organization.
  • Quick implementation: Since it doesn't rely on a fully matured CMDB, this method can be implemented and deliver value more quickly.

Text-based Clustering using K-Means Algorithm

Text-based clustering leverages the K-Means algorithm to group alerts based on the textual content of the alert messages. This method first converts text data into a numerical representation (usually through a vector space model) and then applies the K-Means algorithm to cluster alerts based on the similarity of their text representations.

Advantages of this method include:

  • Unsupervised learning: The K-Means algorithm does not require labeled data for training, making it a suitable option for environments where labeled data is scarce or unavailable.
  • Handles unstructured data: This method can handle and make sense of unstructured text data, which is common in alert messages.
  • Discovers hidden patterns: Text-based clustering can reveal patterns in the alert messages that might not be immediately apparent, leading to a better understanding of the underlying issues causing the alerts.

 

Chapter 2: A Deep Dive into Metric Intelligence Anomaly Detection

Metric Intelligence Anomaly Detection aims to identify and analyze anomalies in metric data collected from various monitoring sources. In this section, we discuss the working of the Metric Intelligence (MI) engine and different Machine Learning (ML) models and techniques used for anomaly detection.

How the MI Engine Works

The MI engine is a system that collects, processes, and analyzes metric data from various monitoring sources such as SCOM, SolarWinds, and Nagios XI server. It works through the following steps:

  1. Data Collection: Metric data is collected regularly by different monitoring systems like SCOM, SolarWinds, and Nagios XI server from the source environment. Some of these systems are partially configured for metric collection by default.
  2. Data Processing: Metric Intelligence captures the raw data from these monitoring systems and uses event rules and the CMDB identification engine to map the data to existing Configuration Items (CIs) and their resources.
  3. Data Analysis: Once the data is mapped to CIs, the engine analyzes the data to detect anomalies and provide other statistical scores.
  4. Building Statistical Models: Metric Intelligence uses historical metric data to create statistical models. These models help project expected metric values along with their upper and lower bounds.
  5. Anomaly Detection: The engine then uses the projected metric values to detect statistical outliers and calculate anomaly scores. Anomalies are scored on a range of 0-10.
  6. Risk Assessment: High anomaly scores for CI metrics can indicate that a CI is at risk of causing a service outage. By monitoring these scores, IT teams can take proactive measures to prevent outages and maintain the stability of their systems.

 

Different Types of ML Models Used

Due to the nature of the data and its diversity and verity, multiple techniques have been implemented to expand the coverage. Here are some of the key ML models and techniques used for anomaly detection:

  1. Time Series statistical model: Analyzes and forecasts time-dependent data. It identifies trends and seasonal patterns but is not adaptive to data pattern changes. The Time Series statistical models used include:
    • Weekly
    • Daily
    • Trendy
    • Noisy
    • Positive clipped noisy
    • Centered noisy
    • Skewed noisy
    • Skewed noisy - GEV Distribution
    • Accumulator
    • Near Constant
    • Multinomial
  2. Kalman Filter statistical model: Estimates the state of a linear dynamic system from a series of noisy measurements. It adapts to new data in real-time and is computationally efficient, but may not perform well with too much noise or unclear patterns.
  3. Local Level: Detects and adapts to permanent changes in the data. It identifies clusters of data points around a new value and updates the model to accommodate the change.
  4. Non-Parametric statistical model: Models data with an unknown or non-symmetrical noise distribution. It creates control bounds that better fit the actual data but does not adjust to changes in the data.
  5. Stationary Non-Parametric: Used for data that is not time-dependent. It models the relationships between data points without considering the time dimension.
  6. Median Absolute Deviation (MAD) statistical model: Deals with skewed or heavy-tailed noise distributions. It improves the detection of anomalies by approximately 30% but does not adjust to changes in the data.

These models and techniques are employed in different scenarios to analyze and detect anomalies in metric data, allowing organizations to proactively identify and address potential issues before they lead to service disruptions.

 

Chapter 3: A Deep Dive into Predictive Log Analytics

Predictive Log Analytics combines unsupervised machine learning techniques to analyze log data, identify patterns and anomalies, and predict potential issues. The process consists of four main steps:

  1. Log Parsing: Automatically parse log messages to identify metadata, important labels, and the human-readable log message.
  2. Log Message Clustering: Cluster similar log messages using an online graph-based dynamic learning algorithm.
  3. Anomaly Detection: Seven different types of anomaly detection are applied to each metric from every log source independently, using an online unsupervised learning approach. Anomalies are then passed through a decision tree to determine if they should be reported as alerts or considered irrelevant.
  4. Log-based Correlations: Log alerts are correlated based on extracted entities and temporal relationships, calculating a correlation score for each alert with every other open alert.

This section delves into a comprehensive layer of anomaly detection and automatic root-cause analysis in data environments. The layer deals with tracking metrics, identifying abnormalities, predicting trends, and determining correlations to pinpoint emerging or breaking issues. By employing various algorithms and techniques, this method helps minimize the number of false alarms and provides valuable insights for better decision-making.

This layer consists of two main steps:

  • Anomaly Detection
  • Automatic Root-Cause Analysis.

we will explore the intricacies of this sophisticated process and its various techniques.

 

Metric Classification

Before metrics can be analyzed for potential anomalies, they are classified into three general-behavior groups:

  • Lively metrics: Metrics with an average much larger than the variance, allowing for statistical testing.
  • Sparse metrics: Metrics that rarely appear, such as log lines written infrequently.
  • Stopped metrics: Metrics that toggle between active and flat states.

The primary goal is to identify when a metric should be active but isn't, which could indicate a fault in the source generating the metric.

 

Anomaly Detection Techniques

Lively metrics are tested for two types of abnormalities:

  • High-resolution anomalies: These include spikes, drops, and breakouts, typically observed at a resolution of seconds. A variety of algorithms are applied, resulting in soft-scores as output.
  • Trend prediction: This involves identifying changes in the baseline, such as a linear metric becoming asymptotic, adding or removing a frequency, or experiencing a phase shift.

Sparse metrics have a probability distribution function (PDF) calculated and tracked, with anomalies identified as highly improbable occurrences of the metric.

Stopped or flickering metrics are passed to the signal-flow analysis module. Here, low-pass filtering is applied, followed by an analysis of not only the timing of the drop but also its longevity. An anomaly is flagged if the signal drop extends beyond the normal pause time.

The various anomaly detection modules output soft-scores, which estimate the probability of the metric being anomalous at that moment.

 

SVM-Based Boosting Phase

The soft-scores serve as features for an SVM-based boosting phase. The SVM weights are continuously adjusted using self-tuning, user-direct feedback, and user indirect feedback. This process helps to minimize the number of false alarms and improves overall accuracy.

 

Automatic Root-Cause Analysis

The anomaly detection step typically produces multiple detections when something breaks. The purpose of the root-cause analysis step is to reduce these detections into a small number of hypotheses (up to 4) regarding the actual root cause. Algorithms are applied to establish relationships between anomalies, either through correlation or causality. The "causing" anomalies are then separated from the "symptoms," prioritized, and presented to the user.

 

Correlation Algorithms

Five different algorithms are applied to determine correlation:

  • Time-based correlation: Scores relationships based on the proximity of anomalies.
  • Anomaly-shape: Scores relationships based on the correlation between the amplitudes of the anomalous metrics.
  • Business-context: Scores higher if anomalies originate from the same application, service, or host.
  • Textual-attributes: Boosts scores if logs related to the anomalies share similar textual attributes, such as keywords or sentences.
  • Entity analysis: Scores relationships based on the presence of common entities in the logs of the anomalies, such as URLs, IP addresses, or filenames.

The algorithms output their respective scores, and anomalies are grouped

 

Summary

 

In this blog, we explored the different machine learning techniques implemented in AIOps solutions, focusing on ServiceNow's predictive AIOps capabilities. We delved into three main areas: Alert Grouping and Correlation, Metric Intelligence Anomaly Detection, and Predictive Log Analytics. In each area, we examined various techniques and algorithms used to enhance accuracy, minimize false positives, and provide actionable insights. These advanced methods, combined with continuous learning and adaptation, enable AIOps solutions to effectively manage complex IT environments and address critical issues proactively

#AIOps #MachineLearning #ServiceNow #ITOM #AlertGrouping #AnomalyDetection #LogAnalytics #ITOperations #ITInfrastructure #DevOps