Operational Intelligence in world of ML & AI

puru1 · ‎04-09-2017

Industry Trend

Over the last two decades the introduction and evolution of virtualization, cloud, mobile, big data and IoT technologies have been driving modernization and innovation in enterprises. Innovation has shifted from spending millions of dollars in R&D on new or different products to changing the business to accommodate volatile, complex and uncertain circumstances and increase competitiveness or improve market leadership. Companies like Netflix and Salesforce used technology to disrupt established industries and reinvent business models. Mobility has also placed pressure on enterprises to deliver services more reliably than before and allow services to be consumed with far greater flexibility than was technically and economically possible with legacy technologies.

"New Era" of IT

As enterprises have modernized to be more competitive, IT has moved away from managing services based static and monolithic infrastructure with infrequent releases, to continuous delivery models involving dynamic environments using a mix of physical and virtual resources. Enterprises are increasingly looking to IT for thought leadership in the application of technology to innovatively solve business problems — this dramatic departure from being viewed as a cost center to a strategic partner is heralding a new era in IT. A key focus for many enterprises today is analyzing the vast quantities of data collected from business services to gain insights into customer behavior and market trends. While analytical tools are valuable for processing data but data scientists and business analysts are needed to identify key metrics, cleanse and normalize data , then build the mathematical and statistical models to maximize the value of these tools.

In addition to the data captured in business services, IT also captures vast amounts of data generated by the IT and application infrastructure associated with business services. This machine data is typically captured using tools monitoring IT infrastructure such as servers, network, storage, middleware databases and applications. Performance metrics, log file data, infrastructure and security events, and — increasingly — application transaction performance are continuously collected. Often, processing of the data is limited to filtering, normalization and threshold crossing and basic correlation. Analyzing these data sets, helps the bottom line for enterprises by improving operational efficiencies and reducing the cost of service downtime (MTR, MTBF) caused by infrastructure failures or cyber threats.

IT is typically challenged with relating the resulting events and metrics to business services and is reliant on the native analytics in each monitoring tool that vary in capability. While IT may seek out and adopt analytical tools, introducing data scientists and business analysts is typically a luxury IT cannot afford.

Need for Operational Intelligence

First-generation event management tools simply took events and log entries and presented them as alerts to hapless operations personnel. The result: alert storms consisting of far too much information to be usable and along with important information that gets lost in a sea of noise. The issue for ITOps/DevOps has compounded as the technology stack has evolved in the last decade: the new norm for enterprise applications involves cloud infrastructure, containers, micro-services and mobility leading to explosive growth in machine data. Clearly, IT needs a better way to gain insights to optimize the performance and availability of business services, minimize or avoid impacts and risks and do this in a cost-effective manner.

Every day IT Operations personnel are drowned in hundreds of alerts that leads to alert fatigue. Personnel are spending most of their time sifting through false alerts, identifying, prioritizing and triaging real alerts and assessing their impact, and finally applying manual corrective actions. Enterprises embracing DevOps with continuous delivery models typically sees an increase in the number of IT resources that need to be managed. As the number of resources to manage increases, IT Operations is inundated with a deluge of operational events and metrics and struggles to derive insights to make decisions.

Service Analytics and Operational Metrics

ServiceNow helps IT Operations deal with this challenges by initially consolidating events from separate monitoring tools in a single platform. Events are collected or sent to Event Management where they are filtered, normalized, de-duplicated and correlated to produce actionable alerts. Alerts are also correlated with CIs allowing Event Management to evaluate the impact on any business services the CI may be related to. Alerts can then be processed by rules to generate tasks (such as an Incident for a critical severity alert) or run an automated remediation task (such as starting a stopped service or process). While Event Management has been successful in helping organizations improve the availability of their business services and reduce MTTR, ServiceNow has identified the need to further automate the processing of alerts allowing IT Operations to avoid defining complex event rules and relying on specialists.

At ServiceNow, the Service Analytics team has been working for the past year to help solve these problems for our customers with an automated solution that uses statistical and machine-learning techniques. Three new capabilities have been introduced to the Event Management application with the following goals:

Improve signal-to-noise ratio generated by alerts
Reduce troubleshooting time and SME involvement
Finding issues before they impact a service and avoid generating an incident

The first new capability correlates alerts into groups based on time-based correlation and structural features. The main objective of Correlated Alert Groups is to reveal significant relationships between alerts so that:

Alerts can be combined and treated as a single phenomenon.
Key relationships between alerts can be discovered and stored for analysis.
Noise can be eliminated by only displaying related alerts.

The next capability is aimed at identifying the root causes of faults or problems within a business service. The Root Cause Analysis ("RCA") capability assigns probability values to the root causes of an impacted service, enabling IT Operations personnel to focus investigations on the more probable root causes than the less likely ones, helping reduce the MTTR. How it works: the RCA algorithm leverages service map (discovered or manual) relationships against the alerts to determine the CI(s) that are the most probable root cause of the current business service impact. RCA discovers causal relationships between alerts that are unknown to the designer of the business service.

The final capability enables IT Operations personnel to use Event Management's Operational Metrics to identify potential service outages and prevent them from occurring. Operational metrics, based on historical threshold data, indicate anomalous behavior of CIs which may not be captured by events. High anomaly scores for CI metrics can indicate that a CI may be at risk to cause a service outage. Anomalies can be promoted to become alerts on the alert console and service health dashboard for preventative action. Operational Metrics can be collected in a similar way to events and forwarded to a ServiceNow instance for processing. In the Istanbul release, ServiceNow provides an OOTB connector to collect metrics from Microsoft SCOM.

Next Steps:

Please read awesome blogs by Aleck Lin & Ben Yukich on Service Analytics & Operational Metrics.
There are multiple sessions & hands-on labs to learn more about Operational Intelligence in Knowledge 17 .
Stay tune to learn ground breaking features which we are releasing in Jakarta release.

Operational Intelligence in world of ML & AI

2026 MVP Applications are open—we invite you to apply today!

Now Create Retirement FAQs and Introduction to the Best Practices site

Data at the Core—Integrations, Federations, and Workflow Data Fabric