Bulk alerts checks before triggering auto-incident workflow

SNExploreGuru
Tera Expert

Hi team,

I have below setup,

couple of on-premise and SaaS based monitoring tools data ( alerts) are integrated with ServiceNow ITOM, alerts are received at ITOM and getting converted to incident via Create incident workflow triggered from Alert mgmt rule.

 

The scenario I have is, sometimes due to network glitch or other issues, we get bulk alerts which are false alerts and with the existing setup all those gets converted to incidents automatically. 

I would like to have a check on bulk alerts before converting them into incidents?

( example: check if you received 20-30 alerts for same metric in 30-60 seconds time then mark it as bulk alert scenario and create only one incident stating Bulk alerts received -please check  and tag all those alerts to this incident)

 

Please let me know if this can be feasible and if yes then really appreciate the code or steps.

 

Thanks,

Guru

5 REPLIES 5

pratiksha5
Mega Sage

On alert management rule filter it by message key. That should minimize the incident creation. Plus we have also updated the flow to check if we already have open incidents on the same CI. If it exist then we are updating the existing incident and if not we are creating a new one. 

 

Hi Pratiksha, thanks for your response.

 

Message key keeps changing for different CIs, which in turn mark that alert as unique not as a duplicate. 

 

Suggested check in the flow does not solve my problem, still it keeps create incidents for all affected CIs ( at least one incident per CI, I have 8000 servers under monitoring, in case of network glitch ~8000 incidents)

pratiksha5
Mega Sage

In that case, you can try stopping to trigger the flow if the state is re-open. But then you need to find a way to update the incident with a different flow...

For CI with already existing alerts the incident already present check applied and not creating another InC--> this is already in place.

But in bulk alerts scenario, most of the alerts are new which are causing rise in incidents volume.

 

So, we need some mechanism like check the alerts table for high volume of alerts hit for same metric ( basically availability alert type) in 60 seconds then trigger some other flow/action/notify the responsible team.