Discovery Optimization

Mike_R · ‎09-03-2019

HI,

I'm looking for advice on optimizing our Discovery schedules.

Right now, my company is setup with two discovery schedules per day, 12am, 12pm, both two hour max run time.

Both schedules contain the same 65 range sets.

I'm pretty new to the company but this was how it was setup by our vendor.

The issue is that discovery doesn't seems to finish in the two hour time windows. Would it make better sense to have schedules based on location, and CI class (servers, VDIs, computers, switches, etc), rather than have everything thrown into one schedule?

Also, if I break up the discovery schedules, can you have two schedules running at the same time or is this not recommended?

Just looking for some best practices advice.

Thanks,

Mike

Mike
* My Collection of ServiceNow Stuff *

DaveHertel · ‎09-03-2019

Hi Mike -- Welcome to Disco land... sounds like you inherited some setup issues that are curious... A few pointers that stand out to me in your questions.

1. No, don't even try to divide schedules by the type-of-thing... thats a waste and frankly impossible to sustain long term, not to mention pointless. With Disco, you really just know the IPs and before a scan is ever done you don't know the type of device to be scanned. And it'll be hopelessly difficult to architect and maintain a scheduled strategy based on types-of-things to be scanned.

2. Why run schedules so often? Unless you have an EXTREMELY dynamic, very fluid changing environment - which is hard to believe considering its only 65 range sets - this sounds excessive. large IT shops run scan's once a day or in many cases, once a week. I'd challenge the business case for scanning 2x/day. Maybe there is a good reason... but.... I'd challenge that premise.

3. You could have dualing schedules, overlapping... but why? Similar to my point #2, its suspicious (read: unlikely) that the environment really, truly needs to be scanned that often.

4. you didn't mention the # of MID servers, cluster usage(?), topology, # of devices to be scanned, etc. These are key factors in deciding how to architect (or re-do) the disco setup. Without knowing more detail, its tough to recommend details... but my back of the napkin suggestions: A) Consider a few like-setup MIDs, in a load-balanced cluster. have your schedules leverage the cluster so jobs can run faster. B) Look at the MIDs themselves.. how much memory? CPU? are they restricted access to X devices? or can several MIDS equally participate in scanning targets? C) how many devices need to be scanned on X routine basis? (this can inform # of MIDS). in general, you should expect to be able to scan many, many thousands of devices daily... but it'll heavily depend on your topology, MID capacity, WAN/LAN speed, etc.

There are many ways to optimize schedules and MIDS based on the ecosystem...

Hope this enlightens a bit and maybe helps too 🙂 ?

View solution in original post

DaveHertel · ‎09-03-2019

Hi Mike -- Welcome to Disco land... sounds like you inherited some setup issues that are curious... A few pointers that stand out to me in your questions.

1. No, don't even try to divide schedules by the type-of-thing... thats a waste and frankly impossible to sustain long term, not to mention pointless. With Disco, you really just know the IPs and before a scan is ever done you don't know the type of device to be scanned. And it'll be hopelessly difficult to architect and maintain a scheduled strategy based on types-of-things to be scanned.

2. Why run schedules so often? Unless you have an EXTREMELY dynamic, very fluid changing environment - which is hard to believe considering its only 65 range sets - this sounds excessive. large IT shops run scan's once a day or in many cases, once a week. I'd challenge the business case for scanning 2x/day. Maybe there is a good reason... but.... I'd challenge that premise.

3. You could have dualing schedules, overlapping... but why? Similar to my point #2, its suspicious (read: unlikely) that the environment really, truly needs to be scanned that often.

4. you didn't mention the # of MID servers, cluster usage(?), topology, # of devices to be scanned, etc. These are key factors in deciding how to architect (or re-do) the disco setup. Without knowing more detail, its tough to recommend details... but my back of the napkin suggestions: A) Consider a few like-setup MIDs, in a load-balanced cluster. have your schedules leverage the cluster so jobs can run faster. B) Look at the MIDs themselves.. how much memory? CPU? are they restricted access to X devices? or can several MIDS equally participate in scanning targets? C) how many devices need to be scanned on X routine basis? (this can inform # of MIDS). in general, you should expect to be able to scan many, many thousands of devices daily... but it'll heavily depend on your topology, MID capacity, WAN/LAN speed, etc.

There are many ways to optimize schedules and MIDS based on the ecosystem...

Hope this enlightens a bit and maybe helps too 🙂 ?

christianmalone · ‎09-03-2019

Mike, I agree with Dave and wanted to add that I would first try to determine why the Discovery is continuing past two hours? Is Shazam cluster support enabled? Are there lots of time-outs due to missing creds or errors? How many IP’s and devices are you loading into 2hrs and for how many and what spec of MID Server? I would consider turning on Shazam cluster support, increasing MID server threads, staggering the schedules a bit more (maybe several groups every hour after noon and midnight. And of course read the Discovery Best practices docs: https://community.servicenow.com/community?id=community_blog&sys_id=4efc26a5dbd0dbc01dcaf3231f96191a...

tim_broberg · ‎09-04-2019

A few comments to mull over:

No, it's not a problem to run overlapping discoveries.
Watch out for huge subnets. It takes a very long time to shazzam /16's, /19's, etc. If you can break the ranges down to /24's, do it.
Go to one of your discovery statuses, and look at the ecc_queue. Sort it so that the last probe to finish is first. Are there a few probes that take a long time to run? Figure out why those are taking so long.
Consider putting mids in a load-balancing cluster. This also gives you some redundancy in case problems arise.
65 ranges? You may want to break those down into smaller schedules.
You can speed up the rate at which probes are processed by increasing the max thread counts on the mid (or the number of mids per cluster), which will then produce more load in the nodes processing the sensors. This is a balancing act and depends on how much other scheduled job traffic you have and how many nodes.

I would suggest:

Breaking down ranges to /24's (or whatever size gives you full subnets such that you don't waste time scanning empty space.)
Breaking down schedules so that they complete in 60 to 90 minutes max.
2 to 4 mids per cluster.
Keep mids on the same local network with targets to minimize snmp (UDP) packets dropped.
Debugging any probes that are running long / timing out.

Hope this helps,
- Tim.