- Post History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
on 09-10-2018 04:09 AM
Why are we discussing this?
I had previously posted this on Thwack and then on LinkedIn. It was suggested that I should also post this here. I'm not sure how many of you are using the event management suite in ServiceNow or something similar, but I wanted to put this out there to help you avoid some of the pitfalls we ran into when we rolled it out. There are a lot of nice features available to help reduce outages and increase efficiency for your team and (if you have one) the EOC/NOC. Our journey started to find a way to address the creation of multiple incidents being created for the same outage due to alerts from multiple tools. By having these alerts go through an event management process we are able to correlate and consolidate the alerts, have them display on a service map, and only generate a single incident now.
Definitions you need to know
- Event Management – The process responsible for managing events throughout their lifecycle. Event management is one of the main activities of IT operations. It is a way to consolidate all events/alerts from disparate monitoring systems in one place to give you both more information and reduce noise for your teams. Not all events should become an alert and not all alerts should become incidents
- Event – A change of state that has some significance for the management of an IT service or a configuration item. These records can vary greatly in their importance from “telling you of the addition of a device to monitoring” to “telling you a Data Center is offline”.
- Alert – A notification of a threshold breach, something has changed, or a failure has occurred. Monitoring tools create and manage alerts. The event management process manages the lifecycle of an alert. An alert must have first been an event.
- Incident – An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident.
- Noise – Alerts that are unneeded, duplicated, or correlated to a larger issue.
- Signal – Unique alerts that are usable, actionable, and result in either the creation of an incident or automated remediation.
Why ServiceNow?
There are other tools that can do many of the same things that I will be talking about here. I am focusing on ServiceNow, because I have experience in building out this integration from monitoring tools to ServiceNow. Regardless of the tool you use to handle event management, the discussion would still help your journey.
Why would I want to use an event management tool?
While tools have their own ways of handling events, alerts, and creating an incident; they do not talk to each other. Unless you have a single tool handling all your monitoring, you will likely run into issues where two or more tools generate an incident for the same thing. This is avoidable using tools like the Event Management module inside ServiceNow to reduce noise. Other benefits include the ability to start tracking alerts for devices, change records can mute alerts during the change window, look for trends, create reports, and of course build automation to eliminate repetitive tasks caused by alerts (i.e. a service hung and needs to restarted).
Where do I start?
When you have multiple tools handling your monitoring for the enterprise, which tool should you start with? Well, that depends on a lot of factors. Is there a tool that’s super easy to integrate, is there a tool that has the most reliable alerts, or is there a tool that most of IT is using? To throw some buzz worthy phrases at you, “Don’t try to boil the ocean”, “Get the low-hanging fruit”, and “What gets you the most bang for your buck”. Meaning simply this; I would start small as you can always expand, knocking out the easy stuff if it makes sense, but ultimately what provides the greatest benefit with the least amount of effort. I will leave that decision to you, but I will tell you what we did to get where we are today.
We started with a single tool that was hosting a lot of our monitoring. There was no out of the box connector for it in ServiceNow and that tool was sending emails to open incidents. We found that while we were unable to pull data from that tool using an API call, it did support sending the data through API. We built out the integration so that when the tool generates an alert, it sends this to ServiceNow via the API to the event table. This allowed me to then build rules on how to handle these events.
The next decision was to integrate with SCOM. There was an out of the box integration for this one and it was in place fast. The same process followed: cleaning up noisy events and building rules to provide better alerts.
Various other tools used an email integration to event management. The emails were sent in plain text in a JSON format. These were fairly easy, but not a preferred method due to relying on additional points of failure by including our (or tool vendor) email servers and ServiceNow email servers.
Next in the line was SolarWinds. SolarWinds was interesting for two reasons. The first was that there was an out of the box connector. The second was SolarWinds had a plugin for ServiceNow integration. What I came to find was that the plugin was for incident creation and not event creation and the out of the box connector worked but needed tweaking.
I will explain more later in the lessons learned section. We found a few issues along the way, but it came together nicely. I am now able to build reports for teams that show their health as reported from all tools, associate devices in a change window to their alert and mark it as in maintenance, build a customer experience dashboard, and (thanks to the work of our CMDB guy) we can feed these alerts to service maps.
What is the plan moving forward?
There are other features we haven’t started playing with, yet. Operational Intelligence would be the feature I am most interested in pursuing. This is a portion of the suite that collects the metric data from your tools, looks for anomalies, and proactive alerts based on machine learning.
What lessons did you learn?
Integrating alerts was not as simple as we originally thought it would be. I want go over each of the tools lessons learned and end on what we learned about the ServiceNow platform itself. Hopefully sharing this information will save you from the same issues when you do your integration.
Having Vistara send the events via an API call to ServiceNow worked well. We had a few instances of their API service dying, but over two years that’s not bad. When we started receiving the events from Vistara in ServiceNow, we found many of these weren’t actionable and built rules to silence them. The remaining events then became alerts. For the alerts, I made a different set of rules that provided additional information for our EOC (Enterprise Operations Center) about the alert. That information could be things like a knowledgebase article that tells them how to fix the issue, who to contact, if this should become an incident, what the severity of the incident should be, and much more.
SCOM was a bit of a pain. We found with our instance there were two different places we had to build out the integration. To pull the alerts from SCOM we had the connector hit one of our web servers. The metric data was not accessible from the same server and that connector had to point directly at the DB server. This worked well until security locked down the ports and we couldn’t connect to it anymore. The alert portion has since been moved to email integration to work around the security “features” blocking our API connection. The metric collection was deactivated.
The email integrations are a stop gap until the monitoring these tools provide is moved to SolarWinds. The plus side is that these are easy to customize and quick to set-up, but the flip side is that they have additional points of failure. Another issue we have encountered is with getting the clear messages to work. This comes down to the message key. A message key is what you will use in event management to separate different occurrences of the same issue and to have the clear messages associate properly with the triggered alert. If you run into this issue work with the team sending the email to work on a unique message key for their alerts.
The SolarWinds connector pulls data from the event table in SolarWinds instead of the alerts. Events in SolarWinds can be triggered either by the thresholds assigned to the machine directly or by a something forcing the system to write to the event log. This means that you will need to add a filter to hide the noise you were already filtering by setting up alert actions in SolarWinds. One of the ways we combatted that was to block any alert without and eventType of 5000 or 5001. Event type in SolarWinds is a number that identifies what triggered the event. A 5000 event says that an alert rule caused an entry to be written to the event log. A 5001 event type says the issue is cleared. That simple change in ServiceNow stopped over 9000 additional noisy alerts per day. The biggest thing we found is that the “swEventId” does not make a good message key. This forced us to create our own message key using the initial event time field. An example would be like this: {“initial_event_time”:”5/19/2018 16:43:00”, "netObjectId":"10053"}becomes a message key of 2018.170.16.43.00.10053. I have brought this up with the ServiceNow folks and they are working on a better connector that can address these issues.
ServiceNow has several connectors out-of-the-box, however; I would still recommend having someone that can program using JavaScript go through the default connector or building your connector. I would not change the default connector, but instead make a copy of the default if you want to make changes. Code patches could over-write your changes to a default connector definition. Here are a few other quick hints:
- Map out which fields you want to use on the alert form before you start
- -- Default for the node field is the hostname
- -- Default for resource is what on the device/application has an issue (i.e. CPU for a High CPU alert)
- -- Type needs to be something in the CMDB (i.e. server/application/network)
- -- Severities need to be in number format (1-5 Exception to Informational)
- -- Custom fields can be added to the alert form
- Event and Alert Rules can let you change the entire alert message and field data
- Technical services are for things like Exchange not a custom monitor
- Discovered Services are for service maps
- Manual Services can be for custom monitored service
- If you plan to use the dashboard to display the health of your services, ensure that all services are using the default numbered Business criticality as the non-standard criticalities will break the dashboard display
- Link KB articles to alerts and provide instructions for handling the alert or how to fix the issue
- You can build out automation around what to do with alerts
Where does this leave us?
The ITOM suite has provided us a wealth of information to improve our services, increase first to know, and identify trends to avoid issues in the future. While we encountered some issues when we began this journey, where we are now is a far better place and the final destination will be well worth any trouble. Planning how you want to use this information is the key to making it successful.
- 10,022 Views
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
A good write up - well done! I am slightly curious as to why you decided to map to Resource and Type in the way in which you did. When EM was initially rolled out, Type was supposed to indicate the kind of Alert which was raised or the condition for which it was raised for. In your analogy, this would be High CPU or CPU/Low disk space. Whereas resource was meant to indicate what/where on the Node i.e CPU 4/C:/etc.
I have recently been looking in to the out of the box Azure integration, and found in interesting that they mapped the Type field to the ResourceType in Azure, which is in line with your mapping. ServiceNow apparently does not like consistency in their solution (don't even get me started on their out of the box connectors that don't bring in Closing events or bring Closing events in as new *cough* vRealize *cough* AWS Cloudwatch).
The reason why I dislike this approach is because the Incident Short Description becomes meaningless. All of the Incidents would come across as "CPU" in your case or "microsoft.compute/virtualmachine" from Azure - neither of which are remotely descriptive of the reason for raising the Alert.
Just some food for thought......
-Dom
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
We have some NOC Event Monitoring tools. Can these monitoring tools possibly be integrated with ServiceNow Event Management. We are trying to improve from being on a reactive environment to being predictive environment.
My question is, does ServiceNow Event Management be able to solve this predicament?
Your response to my query is highly appreciated. Thanks in advance.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
I have found if there is an API available for the tool, then building the integration is possible. For example, we have PagerDuty built into ServiceNow to allow for the automated notification of incidents created. The ITOM suite allows you to utilize the events generated via other tools to build out things like anomaly detection, escalation from events to alerts, automated ticket creation, and automation of remediation (using orchestration).
The biggest thing you can gain from the ITOM suite is the ability to look for leading indicators. By identifying these leading indicators, you can make that move to being more predictive.
For example, you have an application that starts crashing everyday. By looking at other things going on within the application and the network around it prior to the start of the 404 errors, you can find other signs that a problem could be coming your way and use that as a trigger for action to remediate before the crash happens. You could also find that the indicator for that issue is that a different issue happens prior to that.
This is one of the ways it can help you become more predictive. A good Root Cause Analysis (RCA) should hopefully lead you to finding those leading indicators and ITOM can provide more information for those RCAs.
Hopefully that helps.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
I agree. We mostly used what the default mappings ended up being between the different tools. Over time we have started to work on a better methodology for how we would use those fields and building a standard. The issue is not all tools are built equal and the level of detail is sometimes lacking.
That being said; with the ability to build maps for these fields you can make them more meaningful to the team responsible for fixing it.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
thank you for your feedback. that's the direction our team is going. you've been very helpful, much appreciated. thanks.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content