An overview of alerts for Event Management operators
Summarize
Summary of An overview of alerts for Event Management operators
This guide introduces Event Management operators to the fundamentals of alerts within the ServiceNow Event Management application. It explains how alerts are generated from events sent by external monitoring tools such as Microsoft SCOM, Nagios, or SolarWinds, and describes the operator’s role in viewing and acting on these alerts to resolve network or service issues.
Show less
Alert Generation and Operator Role
- Events are generated by monitoring tools when network or service issues occur.
- These events are processed by ServiceNow Event Management, which creates alerts indicating that action is needed.
- Operators monitor these alerts, take remediation actions, or escalate issues based on their organization’s implementation.
Alert Priority and Severity
- Priority: A calculated score indicating the impact of the alert on application services, configurable by administrators.
- Severity: Indicates the seriousness of the issue, typically passed from the monitoring tool. Default severity levels include Critical, Major, Minor, Warning, OK, and Clear.
Correlated Alerts and Alert Grouping
- Related alerts arising from a common root cause (e.g., a router failure affecting multiple servers) are grouped automatically or manually by operators.
- A two-level hierarchy is established with a primary alert at the top and secondary alerts underneath, helping operators focus on the main issue.
- Operators can verify and manually adjust alert correlations to ensure accuracy.
Alert Flapping
- Flapping occurs when alerts open and close repeatedly in a short time, indicating uncertain underlying conditions.
- Causes might include fluctuating resource usage or intermittent hardware/network issues.
- Operators may need to investigate further, potentially creating incidents or recommending configuration/hardware changes to resolve flapping alerts.
Next Steps
After understanding alert basics, operators are encouraged to proceed to lessons on application services to deepen their knowledge of Event Management workflows.
As an Event Management operator, you need to understand how an alert is generated from an event, what to look for in an alert, and how alerts can be grouped together.
This is the first lesson in the Event Management tutorial.
| Lesson 1 | An overview of events and alerts |
|
| Lesson 2 | ||
| Lesson 3 | ||
| Lesson 4 |
Your organization already has an event monitoring tool in place, such as Microsoft System Center Operations Manager (SCOM), Nagios, SolarWinds, and so on. When an issue occurs on your network, such as a computer going down or a database failure, the event monitoring tools send events to your ServiceNow instance. The Event Management application processes the events according to the settings that your administrator configured, and then generates alerts. An alert is an indicator that the issue requires some type of action.
As an Event Management operator, your role is to view alerts and, depending on how Event Management is implemented in your organization, take an action to help resolve the underlying issue or notify someone who can. Later in this tutorial, you will see the phases of a typical alert management process.
Alert priority and severity
- The priority of an alert is a score that helps you determine how important the impact is to application services. Multiple factors determine the alert priority score. Your Event Management administrator can configure the algorithm that the Event Management application uses to calculate priority.
- The severity of an alert is an indicator of
how serious the underlying issue is. The event
monitoring tool in your organization usually sends
severity values with the event, which then gets
carried over in the alert. These are the default
severity types that you will see in this tutorial:
Severity Description Critical
The resource is either not functional or critical problems are imminent. Major
Major functionality is severely impaired or performance has degraded. Minor
Partial, non-critical loss of functionality or performance degradation occurred. Warning
Attention is required, even though the resource is still functional. OK
No severity. An alert is created. The resource is still functional. Clear
The alert no longer needs action.
Correlated alerts
Some alerts are related to each other. For example, if a router goes down, several separate alerts could be generated, one for each server connected to the router. All of these alerts are related, or correlated. To help you manage correlated alerts, Event Management can automatically group them and establish a two-level hierarchy with one root alert, called the primary alert, at the top, and other related alerts, called secondary alerts, under the primary alert. When you view alerts, primary alerts stand out by default so you know which alert to focus on without being distracted by the secondary alerts.
In our example, if a router goes down on your network, network communication is also affected for connected servers, assuming they cannot reach any other routers. The router outage becomes the primary alert and the alerts generated on the server are secondary alerts that are correlated under the router alert.
Depending on your organization's Event Management implementation, alerts might be grouped automatically based on correlation rules that your administrator sets up. Your instance can also learn how to improve the way it correlates alerts based on these rules. As an operator, you should still verify the accuracy of the correlation and, if necessary, manually correlate additional alerts with the primary alert. Later in the tutorial, you will learn how to do this.
In this tutorial, you will learn how to manually correlate alerts.
Alert flapping
An alert can flap, meaning that it gets multiple open-close events in rapid succession. Flapping indicates that Event Management does not know whether the underlying events are genuine or not. The events could indicate small issues with the way CIs are configured, or larger issues, like network outages.
For example, if a server that hosts a web service has too many active processes, it might trigger an event about excessive CPU usage. Since CPU usage can fluctuate rapidly depending on web service requests, several events might be triggered, leading to the alert being put in the flapping state. As an operator, you might need to create an incident to have the server restarted, or someone might have to reconfigure the CPU, or possibly make a hardware change on the device.
As another example, consider a loose network cable that causes momentary, repeated network outages. The thresholds that your administrator configures might not be optimal for this kind of alert and Event Management considers it a flapping alert.
Continue the tutorial
Proceed to the next lesson: Application services for Event Management operators.