An overview of alerts for Event Management operators

Xanadu IT Operations Management

Release

xanadu

ft:locale

en-US

ft:publication_title

Xanadu IT Operations Management

ft:clusterId

itom

bundleId

itom

workflow

Technology

An overview of alerts for Event Management operators

Release version: Xanadu

Updated August 1, 2024

4 minutes to read

As an Event Management operator, you need to understand how an alert is generated from an event, what to look for in an alert, and how alerts can be grouped together.

This is the first lesson in the Event Management tutorial.


Lesson 1		An overview of events and alerts
Lesson 2		An overview of application services
Lesson 3		Event Management operator workspaces
Lesson 4		What operators do

Your organization already has an event monitoring tool in place, such as Microsoft System Center Operations Manager (SCOM), Nagios, SolarWinds, and so on. When an issue occurs on your network, such as a computer going down or a database failure, the event monitoring tools send events to your ServiceNow instance. The Event Management application processes the events according to the settings that your administrator configured, and then generates alerts. An alert is an indicator that the issue requires some type of action.

An operator view of Event Management — Figure 1. Alert generation

As an Event Management operator, your role is to view alerts and, depending on how Event Management is implemented in your organization, take an action to help resolve the underlying issue or notify someone who can. Later in this tutorial, you will see the phases of a typical alert management process.

Alert priority and severity

The two most common characteristics of an alert are the priority and the severity.

The priority of an alert is a score that helps you determine how important the impact is to application services. Multiple factors determine the alert priority score. Your Event Management administrator can configure the algorithm that the Event Management application uses to calculate priority.

The severity of an alert is an indicator of how serious the underlying issue is. The event monitoring tool in your organization usually sends severity values with the event, which then gets carried over in the alert. These are the default severity types that you will see in this tutorial:


Severity	Description
Critical	The resource is either not functional or critical problems are imminent.
Major	Major functionality is severely impaired or performance has degraded.
Minor	Partial, non-critical loss of functionality or performance degradation occurred.
Warning	Attention is required, even though the resource is still functional.
OK	No severity. An alert is created. The resource is still functional.
Clear	The alert no longer needs action.

Correlated alerts

Some alerts are related to each other. For example, if a router goes down, several separate alerts could be generated, one for each server connected to the router. All of these alerts are related, or correlated. To help you manage correlated alerts, Event Management can automatically group them and establish a two-level hierarchy with one root alert, called the primary alert, at the top, and other related alerts, called secondary alerts, under the primary alert. When you view alerts, primary alerts stand out by default so you know which alert to focus on without being distracted by the secondary alerts.

In our example, if a router goes down on your network, network communication is also affected for connected servers, assuming they cannot reach any other routers. The router outage becomes the primary alert and the alerts generated on the server are secondary alerts that are correlated under the router alert.

Correlated alerts — Figure 2. Secondary alert generation

Depending on your organization's Event Management implementation, alerts might be grouped automatically based on correlation rules that your administrator sets up. Your instance can also learn how to improve the way it correlates alerts based on these rules that you can give. As an operator, you should still verify the accuracy of the correlation and, if necessary, manually correlate additional alerts with the primary alert. Later in the tutorial, you will learn how to do this.

In this tutorial, you will learn how to manually correlate alerts.

Alert flapping

An alert can flap, meaning that it gets multiple open-close events in rapid succession. Flapping indicates that Event Management does not know whether the underlying events are genuine or not. The events could indicate small issues with the way CIs are configured, or larger issues, like network outages.

For example, if a server that hosts a web service has too many active processes, it might trigger an event about excessive CPU usage. Since CPU usage can fluctuate rapidly depending on web service requests, several events might be triggered, leading to the alert being put in the flapping state. As an operator, you might need to create an incident to have the server restarted, or someone might have to reconfigure the CPU, or possibly make a hardware change on the device.

As another example, consider a loose network cable that causes momentary, repeated network outages. The thresholds that your administrator configures might not be optimal for this kind of alert and Event Management considers it a flapping alert.

Continue the tutorial

Proceed to the next lesson: Application services for Event Management operators.