An overview of alerts for Event Management operators
Summarize
Summary of An overview of alerts for Event Management operators
As an Event Management operator in ServiceNow, you handle alerts generated from events sent by external event monitoring tools like Microsoft SCOM, Nagios, or SolarWinds. These events are processed within ServiceNow’s Event Management application based on configured settings, producing alerts that indicate issues requiring attention or action.
Show less
Your role involves reviewing these alerts and either taking steps to resolve the underlying issues or escalating them to the appropriate team. Understanding how alerts are generated, prioritized, and correlated helps you manage incidents effectively.
Alert Priority and Severity
Alerts have two key attributes to help you assess their importance:
- Priority: A calculated score reflecting the impact of the issue on application services. This score is determined by an administrator-configured algorithm.
- Severity: Indicates the seriousness of the problem, usually passed from the event monitoring tool. Common severity levels include Critical, Major, Minor, Warning, OK, and Clear, each describing the impact from total failure to no action needed.
Correlated Alerts and Grouping
Multiple alerts related to a single root cause can be automatically grouped to simplify management. For example, if a router failure affects connected servers, the router alert appears as the primary alert, with server alerts grouped as secondary alerts underneath.
This grouping helps you focus on resolving the main issue first without distraction. Although correlation rules are generally automated and can improve over time with learning, you should verify correlations and manually adjust them if necessary.
Alert Flapping
Alerts may "flap," meaning they rapidly open and close multiple times, which suggests uncertainty about the underlying problem's validity. Flapping can result from transient issues like fluctuating CPU usage or intermittent network outages caused by hardware issues such as loose cables.
When alerts flap, you may need to initiate further investigation or incidents to address potential configuration or hardware problems.
Next Steps
To deepen your understanding of Event Management, proceed to the next tutorial lesson on application services, which expands on how services relate to alerts and events.
As an Event Management operator, you need to understand how an alert is generated from an event, what to look for in an alert, and how alerts can be grouped together.
This is the first lesson in the Event Management tutorial.
| Lesson 1 | An overview of events and alerts |
|
| Lesson 2 | ||
| Lesson 3 | ||
| Lesson 4 |
Your organization already has an event monitoring tool in place, such as Microsoft System Center Operations Manager (SCOM), Nagios, SolarWinds, and so on. When an issue occurs on your network, such as a computer going down or a database failure, the event monitoring tools send events to your ServiceNow instance. The Event Management application processes the events according to the settings that your administrator configured, and then generates alerts. An alert is an indicator that the issue requires some type of action.
As an Event Management operator, your role is to view alerts and, depending on how Event Management is implemented in your organization, take an action to help resolve the underlying issue or notify someone who can. Later in this tutorial, you will see the phases of a typical alert management process.
Alert priority and severity
- The priority of an alert is a score that helps you determine how important the impact is to application services. Multiple factors determine the alert priority score. Your Event Management administrator can configure the algorithm that the Event Management application uses to calculate priority.
- The severity of an alert is an indicator of
how serious the underlying issue is. The event
monitoring tool in your organization usually sends
severity values with the event, which then gets
carried over in the alert. These are the default
severity types that you will see in this tutorial:
Severity Description Critical
The resource is either not functional or critical problems are imminent. Major
Major functionality is severely impaired or performance has degraded. Minor
Partial, non-critical loss of functionality or performance degradation occurred. Warning
Attention is required, even though the resource is still functional. OK
No severity. An alert is created. The resource is still functional. Clear
The alert no longer needs action.
Correlated alerts
Some alerts are related to each other. For example, if a router goes down, several separate alerts could be generated, one for each server connected to the router. All of these alerts are related, or correlated. To help you manage correlated alerts, Event Management can automatically group them and establish a two-level hierarchy with one root alert, called the primary alert, at the top, and other related alerts, called secondary alerts, under the primary alert. When you view alerts, primary alerts stand out by default so you know which alert to focus on without being distracted by the secondary alerts.
In our example, if a router goes down on your network, network communication is also affected for connected servers, assuming they cannot reach any other routers. The router outage becomes the primary alert and the alerts generated on the server are secondary alerts that are correlated under the router alert.
Depending on your organization's Event Management implementation, alerts might be grouped automatically based on correlation rules that your administrator sets up. Your instance can also learn how to improve the way it correlates alerts based on these rules that you can give. As an operator, you should still verify the accuracy of the correlation and, if necessary, manually correlate additional alerts with the primary alert. Later in the tutorial, you will learn how to do this.
In this tutorial, you will learn how to manually correlate alerts.
Alert flapping
An alert can flap, meaning that it gets multiple open-close events in rapid succession. Flapping indicates that Event Management does not know whether the underlying events are genuine or not. The events could indicate small issues with the way CIs are configured, or larger issues, like network outages.
For example, if a server that hosts a web service has too many active processes, it might trigger an event about excessive CPU usage. Since CPU usage can fluctuate rapidly depending on web service requests, several events might be triggered, leading to the alert being put in the flapping state. As an operator, you might need to create an incident to have the server restarted, or someone might have to reconfigure the CPU, or possibly make a hardware change on the device.
As another example, consider a loose network cable that causes momentary, repeated network outages. The thresholds that your administrator configures might not be optimal for this kind of alert and Event Management considers it a flapping alert.
Continue the tutorial
Proceed to the next lesson: Application services for Event Management operators.