What is the current best practice for ACC Agent or Server Down alerting?

Paul Bloem · ‎08-31-2023

Prior to our latest upgrade of the ACC-F and ACC-M plugins, there was a "Self-Healing Events" policy that would start pinging servers if the agent was disconnected or went down, and throw an event if it was unreachable.

After upgrading, that policy's name has been updated to "Deprecated policy: Self-Healing Events" and the only events we see when an agent is down are generic "There are disconnected agents from the MID <mid_name>" that is tied to the MID server.

Is anyone aware of the current best practice for alerting when a server goes down? Should we be creating our own policy that constantly performs the ping checks? Some configuration setting I'm missing?

tschneider · ‎01-31-2024

Hi Paul,

we are looking for a solution to this problem as well. What did you do to solve this?

Thanks

Thorsten

Paul Bloem · ‎01-31-2024

Hi Thorsten,

Because we have limited resources at the moment we actually ended up re-enabling the deprecated ping checks and agent status jobs. Finding an alternative solution is on our list of things to do, but this works well enough for us at the moment.

If you want to re-enable these, be aware that you'll either need to remove the "- Deprecated" from the name of the self health monitor, or you'll need to update the monitoring script to include it. The name of the monitor is hard coded into the script include.

Best,
Paul

tschneider · ‎02-01-2024

Hi Paul,

thanks for your quick response. I've now renamed and activated that monitor. I've done a test by stopping the agent on one machine and disconnecting a second from the network but have not seen any other events apart from the "There are disconnected agents from the MID <mid_name>". Is there a way how I can check the monitor is active?

The "Last Run" status on the "Monitoring Agent Status " is still 2018, so looks like it was never run.

Thanks

Thorsten

Paul Bloem · ‎02-01-2024

Hi Thorsten,

I'm not sure what would cause it not to run... Some things I would check:

Are any of the other Monitoring Configuration (em_monitor_conf) records showing a Last Run value that is recent? Sounds like there should be, as I believe the "There are disconnected agents" check is similar.
On the "Monitoring Agent Status" record, if you open the record referenced in the "Script" field, is that also active?
Does the value of the "MONITOR_NAME" variable in line 14 of that script match exactly the name "Monitoring Agent Status"
If you look at the Monitoring States table (em_monitor_state) do you see any records where the value of the "Monitoring Configuration" field references your Monitoring Agent Status record?

Unfortunately there seem to be quite a few levels of configuration necessary for these to work, so it's tricky to debug 🙂

Happy to take a look at some other configuration in our instance if one of those doesn't reveal the issue.

Best,
Paul