Developing a Self-Healing IT Environment

Subscribe

Home

Conversations On

AI

App Development

CRM

Enterprise IT

Ethics & Governance

Futures

HR

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All topics

For Leaders In

IT & Dev

Customer Experience

Finance, Operations & Strategy

Employee Experience

Security & Risk

News & Events

People & Culture

My List

Explore All

May 7, 2020

3 min

ServiceNow is developing a self-healing IT environment

ServiceNow on ServiceNow

Company Story

Tomer Mekhty

VP, Global Technology Operations, ServiceNow

Man working on a tablet while standing in a bright office

ServiceNow is facing one of the biggest opportunities to date: developing a self-healing IT environment that makes proactive IT support a reality.

Although the concept of self-healing has been around for at least a decade, the ability to achieve it has fallen short. A lack of system intelligence stopped us from predicting and preventing many issues without human intervention somewhere in the process.

AI technology is changing that paradigm. Thanks to intelligent operations, we can now provide proactive support with limited or no human interaction. Data-driven workflows can be used to automatically detect, analyze, and remediate issues before or just after they occur.

Thanks to intelligent operations, we can now provide proactive support with limited or no human interaction.

Moving from reactive to proactive

To guide the journey toward self-healing, we needed a framework, a structured, data-driven approach that would help us shift as many issues as we could from a reactive, human response to a proactive, automated response. It’s a practical framework for AIOps that classifies IT issues into three categories:

1. Respond only

In this category, issues are submitted by people. These issues usually get routed to the IT Service Desk, which assesses the extent of the impact and calculates the priority. Even though this scenario is reactive in nature, I believe that we can be intelligent about the actual impact and priority and assign it to most qualified operational team to accelerate resolution.

The information and data on the Now Platform^® enable us to be intelligent about estimating the impact. For example, if finance notes that an enterprise resource planning (ERP) system is down during month end close, it automatically becomes a P1 priority.

Another recent example is customer support. After we mobilized our customer support folks to work from home, any voice issues reported by support engineers become a P1. You can correlate many different data points such as persona, time, location, service, and application, to better understand the impact. This approach is better than asking an employee about the impact, which is usually subjective.

After the issue is resolved, we look at the root cause, again, in a data-driven way. If the issue is a systematic one, we trigger a process or technology improvement to capture the missing signal, bring that data into ServiceNow^® Event Management, and push it into the next category in the framework—prepare and respond.

2. Prepare and respond

In this category, we use ServiceNow IT Operations Management to first reduce monitoring noise by almost 99%. Then we generate real, actionable incidents by using event correlation, pattern recognition, and anomaly detection.

The ultimate outcome of AIOps in my opinion is our ability to understand the exact impact of an infrastructure-related issue on a critical service, application, or an end user. In comparison to the previous category, IT is better prepared to respond; our teams can quickly react and minimize the impact on end users.

We perform the same impact analysis and dynamic prioritization as described above, but the resolution is still manual.

Many of our use cases are in this category. Take, for example, critical third-party software as a service (SaaS) applications. We can’t prevent apps like video conferencing from going down, but we can be smart about triggering workflows, such as failover processes or even proactively ordering new hardware if it is an edge issue. It helps us quickly mobilize operational teams and focus on the right thing.

3. Predict and prevent (self-healing)

In this category, a full-cycle AIOps process comes into play. IT can both predict and prevent issues using machine learning to identify anomalies, then proactively take a fully automated action. There is zero impact on end users and zero touch by the ops teams. Our operations are much more efficient because we’ve removed the human factor.

One of our most complex use cases in this category was also one of the first we could resolve proactively—our virtual private network (VPN) service. By identifying abnormalities and correlating them with endpoint device data, we were able to automate the restoration of VPN services.

Another use case was the wireless network connectivity. We reduced the amount of Wi-Fi related issues by almost 70% in one year while our company size increased by 30%. Needless to say, by proactively remediating these issues, we bring operational costs down and employees productivity up.

Our ultimate goal is to reduce issues reported by employees to as close to zero as possible.

Implementing the framework at ServiceNow

We try to map all IT operational issues in one of these three buckets. The objective is to move as many as possible to the predict and prevent category for self-healing, especially those that directly impact critical services or applications. These issues usually require qualified L2 or L3 engineers to resolve.

So far, we can predict and prevent more than 20% of issues, focusing primarily on network connectivity, infrastructure resources allocation, and critical SaaS applications.

Our ultimate goal is to reduce issues reported by employees to as close to zero as possible. To achieve this stretch goal requires a significant shift in approach, technology, and sometimes people.

IT needs to embrace a data-driven culture and evolve from analyzing post-failure metrics to real-time data analytics for accurate prediction of future failures. Only then can self-healing take center stage.

Find out how ServiceNow can help self-heal your IT.

Next up

Dive into more conversations

AI

App Development

CRM

Enterprise IT

Ethics & Governance

Human Resources

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All Topics

Stay in the know

Join Us

Your work email puts us to work

Automotive

Banking

Consumer Packaged Goods

Healthcare

Insurance

Life Sciences

Manufacturing

Nonprofit

National Government

Retail

Technology Providers

Telecom

Find a partner

Become a partner

Partner awards

Partner portal

Partner applications

Careers

Investors

ServiceNow AI Research

Leadership

Locations

Newsroom

Analyst Reports

Global impact

Trust and compliance

ServiceNow Shop

AI Agents

IT Service Management

ServiceNow AI Control Tower

IT Operations Management

Customer Service Management

Strategic Portfolio Management

IT Asset Management

Governance, Risk, and Compliance

Security Operations

Field Service Management

HR Service Delivery

ServiceNow EmployeeWorks

AI

Data

Workflows

ServiceNow Otto

RaptorDB

Process Mining

AI Agents

ServiceNow AI Control Tower

Security

App Engine

ServiceNow Store

Responsible AI

Provide better experiences

Resolve issues faster

Create and automate workflows

Enterprise Architecture

Service Operations Workspace

Cloud Governance Suite

Operational Technology Management

IT Asset Management

IT Operations Management

IT Service Management

ServiceNow Cloud Observability

Strategic Portfolio Management

Digital End-user Experience

Customer Service Management

Field Service Management

Sales and Order Management

Configure, Price, Quote

Financial Services Operations

Healthcare and Life Sciences Service Management

Sales and Order Management for Technology Providers

Sales and Order Management for Telecommunications

Public Sector Digital Services

Telecommunications Service Management

Technology Provider Service Management

Security Operations

Security Incident Response

Unified Security Exposure Management

Threat Intelligence Security Center

Integrated Risk Management

Third-party Risk Management

Security Posture Control

Privacy Management

Identity Security

HR Service Delivery

Talent Development

Legal Service Delivery

Workplace Service Delivery

Accounts Payable Operations

Sourcing and Procurement Operations

Supplier Lifecycle Operations

ServiceNow EmployeeWorks

Enterprise Service Management

App Engine

Build Agent

Automotive

Banking

Consumer Packaged Goods

Healthcare

Insurance

Life Sciences

Manufacturing

Nonprofit

National Government

Retail

Technology Providers

Telecom