How can we monitor MID servers with a scheduled job?

Casey23 · ‎05-02-2022

We had setup notifications in our system to let us know when the MID server status is down or up. The problem with this is that we're seeing the MID server go down and come back up within the same second in almost every case. ex. 10:42:13 the server goes down and then comes back up at 10:42:13 as well. We talked to HI about this, sent them our logs and they confirmed that they weren't seeing any issues in the logs.

So we got to thinking that a better approach would be to check and see if the mid server is down for at least 15 minutes, and then send a notification so that we can check on it. I found the article below that briefly talks about this:

https://community.servicenow.com/community?id=community_question&sys_id=fb27c729db1cdbc01dcaf3231f96...

The issue I'm running into, is I'm not exactly sure how to setup a scheduled job to call the function in that script. I'm wondering if anyone out there has done this before and would be willing to share their setup? Or, if there is a better way to do this than a scheduled job, I'm open to options to try to get legitimate downtime notifications on MID server.

TrevorK · ‎05-04-2022

I love a good challenge!

So to start, my Script Action had one flaw. I messed up my math. 8000 seconds is 533 minutes. Needless to say, that notification will not be executing any time soon.

Here is what I have now for the Script Action:

var mid_name = "MIDServer1Sandbox";

mid_server_down(mid_name);

function mid_server_down(name) {
	var fifteen_minutes = new GlideDateTime();
	fifteen_minutes.addSeconds(100);
	//fifteen_minutes.addSeconds(1);

	var gr = new GlideRecord("ecc_agent");
	gr.addQuery("name", name);
	gr.addQuery("status", "Down");
	gr.query();
	
	if (gr.next()) {
		gs.eventQueueScheduled('MIDServerDownNotification', gr, '', '', fifteen_minutes);
	}
}

I used 100 seconds for my test, but modify to 900 for 15 minutes. I double checked that number this time too 🙂

I then modified the Notification to use a self-invoking function. Now it does work:

(function () {
	var gr_mid = new GlideRecord("ecc_agent");
	
	if (gr_mid.get(current.sys_id)) {
		if (gr_mid.status == "Down") {
			return true;
		}
	}
	return false;	
}) ();

By using the Advanced Conditions, I think we are able to do a real time lookup (when the notification is processed). I think using the Conditions it will be based on whatever the passed in object has.

I am not sure if the notification advanced conditions can invoke a function. Perhaps that's why the self-invoking function works.

Can you try the above and see if it now works for you?

Edit: In case it's not clear, this is how the self-invoking function looks:

View solution in original post

TrevorK · ‎05-02-2022

First, let me preface my reply. There are so many features within the platform, there may be better ways to achieve this than what I recommend. This is just how I would go about doing it off the top of my head

Idea #1:

A simple way to achieve this would be to have the MID Server going down trigger a notification that is sent out in 15 minutes. When that notification is to go out, it runs a quick check to see if the MID Server is up and if it is, the notification does not send.

You will turn off the out of box notification(s). You will then create a Script Action to run off the event mid_server.down (I think that's out of the box). The Script Action will trigger an email notification you create 15 minutes in the future (gs.eventQueueScheduled). Use the Advanced Conditions within the email notification to determine whether the MID Server has been down long enough to send.

Script Actions: https://docs.servicenow.com/en-US/bundle/sandiego-platform-administration/page/administer/platform-events/reference/r_ScriptActions.html

gs.eventQueueScheduled: https://therockethq.gitbooks.io/servicenow1/content/index/index/scripting/scripting-concepts/script-an-event/generate-the-event-at-a-fixed-time.html

The advantage to this is that the idea is incredibly simple. You log an email to go out in 15 minutes, and then when it's time to go out, it decides whether to process based on your logic. The downside is you need a little bit of a check in there to see if the MID Server went up OR if the MID Server went up, then down again (not exceeding your 15 minutes threshold).

We do something similar with our notifications that a Change Task is assigned to someone because people were doing it from a list of records view (so it was assigned to a group first, which triggered a notification, then a second later assigned to a person, which triggered another notification).

Idea #2:

You can have a Scheduled Job run. The Scheduled Job could simply look to the ecc_queue for a mid_server.down event in the last 15 minutes. If found, it looks for a mid_server.up event after. If the mid_server.up event is now found, trigger the email message.

This is very similar to what you have linked. What I don't like is that it seems like there is some room for error with the timing of it all.

Overall:

Idea #2 is from your link. I like that it's simple, but I don't like it runs every 15 minutes (inefficient) and that we need to carefully consider the timing of everything. It seems too .... unrefined.

Idea #1 takes it a step further. We schedule an email to happen in 15 minutes and then when the email is ready to process, the email decides whether to send. It seems a bit more refined because we are reacting to an event (which should happen rarely) and we simply check in 15 minutes if an "up" event was logged since. I think it's a little more efficient.

There are various additions to Idea #1 you could make to have the logic run in a Script Action that is scheduled in the future too (rather than the notification).

I can help you map out either of them with the code, I just need an idea of which you are looking for. Or, if you want to take a stab at one of them I can help with ideas along the way. Just let me know!

Casey23 · ‎05-03-2022

I like what you mentioned with idea #1. I agree that being event driven is much better than a job that running on a schedule when it's unnecessary in 99.9% of cases. If you're able to provide more info on that method, I'm all ears! I'll also take a look at those links you provided. Thank you!

TrevorK · ‎05-03-2022

Step 1:

Confirm that you have events created for mid_server.down and mid_server.up. I think these are OOB, but we have been a client for so long we do have some things in our instance from many years ago that no longer exist.

Step 2

Create an Event Registry entry for your MID Server notification. Let's call it mid_server.future_email. This is just to trigger our email.

Step 3

Create a Script Action. This is what runs every time the mid_server.down event is detected (which is OOB on my instance, so we are just piggybacking on that)

https://docs.servicenow.com/en-US/bundle/sandiego-platform-administration/page/administer/platform-events/reference/r_ScriptActions.html

Step 4:

Create a notification. You want to check if the MID Server is currently down in Advanced condition. I assume the rest of the notification stuff you know.

This is the step I see areas for improvement (sorry, just wanted to get you something to work with for now). The areas of improvement I see:

a) I would want to verify I need to do that glide record lookup in the Advanced Conditions. I assume if I pass the object, then it would pass the "down" status with it (the object is not updated I assume). This code here doesn't hurt regardless, I am just working off the assumption when I pass the object it is the object at the time and is not updated when the script actually runs. If this doesn't make sense, ignore it. Otherwise, feel free to test it out (or not care that it might run an extra glide record lookup - that's fine too).

b) I think it would be nice to take the time in the Advanced Condition to see if there was a series of up/down/up/down events. This script lacks the logic to say "has it been down for 15 minutes". It merely says "15 minutes ago you detected a down server. Is the server still down?". Slight difference.

If you can get the above working and need help with the optimization I suggested in Step 4 just let me know. However, focus on getting it working first because the optimization (specifically 4b) will not hold you back from it. It's merely to avoid a scenario where the MID Server goes up and down often within the 15 minute window.

Hope that all makes sense. Try out those basics, and we can refine the Advanced Condition to look into the event queue and see if the server has come up (and gone back down) within the 15 minutes. In that case, we would not want to send an email (ultimately if it stays down for 15 minutes another subsequent notification would handle it).

Casey23 · ‎05-04-2022

Thank you for the follow up Trevor! I actually had started something similar yesterday, but I like that you're checking for the name of the MID server in your script. We only have one MID server for each of our environments, but I'd like to use that functionality in the event that we add more.

One problem I'm running into, that I didn't have yesterday, is that if I have any conditions (advanced or otherwise) on the email notification it doesn't send. As soon as I remove the conditions, it's completely fine. I'm still doing some troubleshooting on that, but wanted to reply to let you know I'm still looking into it.

Again, appreciate the reply!