Alert Correlation behavior and possible solutions (Rule BAsed)

Piyush26
Tera Contributor

For Rule based correlation we have a situation where a closure of Primary alert causes closure of the Secondary alerts, while this works in most of the cases in some cases this causes serious concerns.

We receive alerts from Application Monitoring sources for the Systems which are customer facing.

 

A system has various components such as multiple instances and central service , so in general if a System goes through a degradation or Outage, we may receive multiple alerts, today we group them via rule-based correlation into Primary and Secondary. But we have a situation where if one instance out of many instances is up and if the corresponding alert was a Primary alert, it will simply close the Primary alert and the respective secondary alerts, but the System might still not be up, and the team may lose insight on the System Outage/degradation. Therefore, there is a following ask from the Customer.

 

  • If the Primary Alert is getting closed do not close the secondary alert rather promote the next secondary (based on time of arrival of the alert) as Primary and keep the association with the INC as it is. (We cannot implement Tag based clustering due to the limitation of Tag based clustering only allowing configuring of Tags and not scripting)
  • Moreover Tag Based clustering solution is facing an issue from the platform as there are some delays observed with this approach i.e. the tag based clustering in itself takes time and until then the incident creation happens.

 

Considering the scale at which ServiceNow ITOM event Management is implemented, the out of the box central settings are proving to be a hindrance.

 

Can someone please advise us on the above-

2 REPLIES 2

Fabian Kunzke
Mega Sage

Hello,

 

This is such an interesting requirement! Also, you can't just solve this with an alert management rule for when the primary alert closes, because alert management rules don't run on closed alerts.

 

However, i would also not use a business rule for this. Ideally we don't use additional business rules on the em_event nor em_alert table.

 

From my perspective, your best approach is to actually address the grouping issue at hand. From what i can gather, the issue is, that the grouping promotes one of the alerts to be a primary alert. Instead - in an ideal scenario - none of these alerts would be a primary, but instead there would be a new alert generated with all the existing ones grouped under it.

That way, if any of the alerts are closed, none of the other are getting closed. Still, one can close the newly created primary alert, if an actual solution is found. However, this new alert must get created for this specific use-case BEFORE any outage/incident is generated.

 

And this is the exact alert management rule i would create to solve this issue. Whenever an alert of this specific type of alert is promoted to a primary alert. Before all other alert management rules run (e.g. for incident generation), this new rule will create a copy of the primary alert & demote the original one to be secondary. Then, all secondary alerts will re-reference to the newly created alert.

 

Usually i am not a big fan of these "dummy alert" solutions. But in this case i'd do an exception. I feel like it is probably the best solution for this exact case moving forward. However, it may require some changes to how outages are generated at the moment. So some more detailed concept work will be needed.

 

In reagards to your second point i'd reach out to the ServiceNow support team. These issues are likely a scalability issue of the platform resources.

 

Hope this helps

Regards

Fabian

Thanks Fabian for your insightful response, let me explore this option on our sub production instance.