Event Management - Questions about de-duplication and correlation.

dan_tembe · ‎03-05-2018

I am using Service Now to collect events from Data & UC devices.

I am trying to figure out if there is a way to create a one incident for each scenario below. I searched but couldn't find much that pointed me in the right direction.

Scenario 1 -

Carrier has a issue and every trunk group has a failure sends a separate event. I want to collapse all trunk groups outages into a single Incident, for a period of 5m, but if there is a new trunk group alert that happen after the 5m interval, then create a new incident.

I can write a event rule to match trunk group wildcard for a particular device and create a single alert which will in turn create a single incident. but I cannot figure out how to put a time limit before treating this as separate Incident.

Scenario 2 -

Tool A monitors a device via polling and there is a event reporting a device interface is down. That event is sent for every poll. So I get a "device down" alert for each and every polling cycle. Right now I am able to create a single Alert and then a Single Incident, but every subsequent event is de-duplicated under the original alert and then the incident.

I am trying to create an alert on the 1st down event but then ignore the remaining down events until the clear alert comes through or the alert is manually acknowledged when the incident is closed. Just to keep EEM more efficient.

Scenario 3 -

Tool A and Tool B are monitoring same device but different parts of it. For example, Tool A is monitoring the HW and Tool B is monitoring at the application level. Has anyone used the alert correlation rules to write rules that are not extremely specific to lower Incident counts?

What I mean is, I am trying to see if there is a way to match an alert from Tool A for a "node" value if it matches "node" value from Tool B event. Keeping in mind that I am trying to avoid writing these correlation rules for each node and instead define if tool A alert node field value is equal to Tool B alert node field value create only 1 incident or deduplicate Tool B alerts into Tool A alert.

As always, appreciate any help provided. Thanks in advance for looking into my questions.

Best Regards,
Dan

P.S. - Just want to put a note in that I am looking for a way to do this with native ServiceNOW ITOM features, not looking for a paid add-on.

Thanks!

Dan

robertgeen · ‎03-05-2018

Dan,

I'm going to do my best to take a shot at this but what you have asked here is some pretty complex examples that usually require quite a bit of planning time. I'll do my best to let you know some of my ideas off the top of my head.

Scenario 1 -

Carrier has a issue and every trunk group has a failure sends a separate event. I want to collapse all trunk groups outages into a single Incident, for a period of 5m, but if there is a new trunk group alert that happen after the 5m interval, then create a new incident.

I can write a event rule to match trunk group wildcard for a particular device and create a single alert which will in turn create a single incident. but I cannot figure out how to put a time limit before treating this as separate Incident.

-> This is a very interesting requirement you will most likely have to modify the out of the box schedule job that actually does the incident creation. By doing this you can add code to check for the like alerts and assign them all the same incident and also do a check to see if more then 5 minutes has passed then create a new incident. This actually wouldn't be too bad to right as you are just doing a simple check before you open the incident if it equals a certain source/type. Checked out the scheduled job Event Management - create/resolved incident as this will most likely be where you will put the code in (or in one of the helper script includes).

Scenario 2 -

Tool A monitors a device via polling and there is a event reporting a device interface is down. That event is sent for every poll. So I get a "device down" alert for each and every polling cycle. Right now I am able to create a single Alert and then a Single Incident, but every subsequent event is de-duplicated under the original alert and then the incident.

I am trying to create an alert on the 1st down event but then ignore the remaining down events until the clear alert comes through or the alert is manually acknowledged when the incident is closed. Just to keep EEM more efficient.

-> Your making work for yourself here I would just let the de-duplication do it's thing as it will just match to the same alert each time and will update the severity if it changes. Once the clear comes through everything will close out nicely.

Scenario 3 -

Tool A and Tool B are monitoring same device but different parts of it. For example, Tool A is monitoring the HW and Tool B is monitoring at the application level. Has anyone used the alert correlation rules to write rules that are not extremely specific to lower Incident counts?

What I mean is, I am trying to see if there is a way to match an alert from Tool A for a "node" value if it matches "node" value from Tool B event. Keeping in mind that I am trying to avoid writing these correlation rules for each node and instead define if tool A alert node field value is equal to Tool B alert node field value create only 1 incident or deduplicate Tool B alerts into Tool A alert.

-> You could do this with a correlation rule that matches on the source and type field where the CI is the same CI. This will fold the one alert into the other alert and only open up 1 incident for them. This requires you to have good naming for your types within your event sources so you can clearly just say filter on source = this and type = this for parent and child would be source = this and type = this where it has the same CI. If there is something more advance with this then just create a script in the correlation rule and return the IDs you want to be tied together.

I hope this helps and should meet your criteria of using all the out of the box mechanisms.

View solution in original post

robertgeen · ‎03-05-2018

Dan,

I'm going to do my best to take a shot at this but what you have asked here is some pretty complex examples that usually require quite a bit of planning time. I'll do my best to let you know some of my ideas off the top of my head.

Scenario 1 -

Carrier has a issue and every trunk group has a failure sends a separate event. I want to collapse all trunk groups outages into a single Incident, for a period of 5m, but if there is a new trunk group alert that happen after the 5m interval, then create a new incident.

I can write a event rule to match trunk group wildcard for a particular device and create a single alert which will in turn create a single incident. but I cannot figure out how to put a time limit before treating this as separate Incident.

-> This is a very interesting requirement you will most likely have to modify the out of the box schedule job that actually does the incident creation. By doing this you can add code to check for the like alerts and assign them all the same incident and also do a check to see if more then 5 minutes has passed then create a new incident. This actually wouldn't be too bad to right as you are just doing a simple check before you open the incident if it equals a certain source/type. Checked out the scheduled job Event Management - create/resolved incident as this will most likely be where you will put the code in (or in one of the helper script includes).

Scenario 2 -

Tool A monitors a device via polling and there is a event reporting a device interface is down. That event is sent for every poll. So I get a "device down" alert for each and every polling cycle. Right now I am able to create a single Alert and then a Single Incident, but every subsequent event is de-duplicated under the original alert and then the incident.

I am trying to create an alert on the 1st down event but then ignore the remaining down events until the clear alert comes through or the alert is manually acknowledged when the incident is closed. Just to keep EEM more efficient.

-> Your making work for yourself here I would just let the de-duplication do it's thing as it will just match to the same alert each time and will update the severity if it changes. Once the clear comes through everything will close out nicely.

Scenario 3 -

Tool A and Tool B are monitoring same device but different parts of it. For example, Tool A is monitoring the HW and Tool B is monitoring at the application level. Has anyone used the alert correlation rules to write rules that are not extremely specific to lower Incident counts?

What I mean is, I am trying to see if there is a way to match an alert from Tool A for a "node" value if it matches "node" value from Tool B event. Keeping in mind that I am trying to avoid writing these correlation rules for each node and instead define if tool A alert node field value is equal to Tool B alert node field value create only 1 incident or deduplicate Tool B alerts into Tool A alert.

-> You could do this with a correlation rule that matches on the source and type field where the CI is the same CI. This will fold the one alert into the other alert and only open up 1 incident for them. This requires you to have good naming for your types within your event sources so you can clearly just say filter on source = this and type = this for parent and child would be source = this and type = this where it has the same CI. If there is something more advance with this then just create a script in the correlation rule and return the IDs you want to be tied together.

I hope this helps and should meet your criteria of using all the out of the box mechanisms.

dan_tembe · ‎03-05-2018

Hello Robert,

Thanks for the input you provided. Taking your advice, I am going to proceed as below -

Scenario 1 -

I am going to read through the Incident creation rules and figure out where it makes best sense to update my rules. I have made some changes to the incident handler and the custom populator in the past to pass alert data and event data into various custom fields in the Incident, so I need to read through the scripts to understand where I can add this logic. Appreciate this pointer so I now know where to focus to deliver this issue.

Scenario 2 -

The auto resolve / Ack at Alert & Incident level is working, so I am going to leave this alone. I just edited the properties in Event management to display 10 work notes instead of the default 20 or so.

After reading your response, I think it is best for me to leave this to work as designed instead of trying to "optimize" unnecessarily and break the current down/clear logic. Thanks!

Scenario 3-

This is the one that I am having the hardest time getting my head around. I think I understand your response, but I need to refocus on this tomorrow with a clear head. Just because, we don't use the CMDB yet, or tie CI's to alerts or incidents yet. We just have a lookup table to match customer names based on values in event field from various tools. I need to re-think this some more to get a handle on your recommendation.

Thanks! again for your response and guidance. I think I have some reading and testing in Dev to do.

Best Regards

Dan

robertgeen · ‎03-06-2018

No problem Dan. For Scenario 3 since you don't have CIs you will definitely have to use a script in the correlation rule. Then you can do some sort of pseduo code like below:

If currentalert source is <values to correlate> and types are <values to correlate>

glide record query for any other records with that source and type combination that have the same node value as currentAlert (I'm assuming node is set even knowing CIs don't exist)

add sysId of found alerts to result object under Secondary (note if you are doing it the opposite way where you are looking forward to be safe then add the result to Primary)

Either way play with this and you should be able to get this working the way you want it too 🙂