Discovery stops In between ; doesn't complete

Abhijeet Yadav · ‎01-27-2020

Hi All,

We are running few discovery schedules. However, some of them are still in 'Active' state after 48 hours ! I don't see any changes in the Stared / completed probes number either. It just halts for infinite time. These are individual schedules,not dependant on each other.

I have proposed MID server config changes as,

Any other suggestions?

Thanks,

Abhijeet

DaveHertel · ‎01-27-2020

If they've hung that long, kill (stop) the jobs. No need to restart MID usually... don't expect the stalled jobs to ever finish at this point.

Ankush13 · ‎01-28-2020

Thats good to know. I used to restart them when all that good stuff wasn't available in the UI.

tim_broberg · ‎01-27-2020

A few diagnostic steps you can take:

Go to your discovery status, in the ecc_queue related list, filter for status != processed. You will probably see some outputs in processing state. Those are the guys that are hanging you up.
Alternately, go to your midservers and look at the stack to see what is stuck where. Any one of these can help with this:
1. On the mid server record (ecc_agent table), look at the threads related list for threads containing "worker" and not containing "idle." This gets you the thread name which includes the type of probe and the sys_id of it's ecc_queue record.
2. Dump to the wrapper log (in agent/logs directory on the mid server) by either:
3. 1. Hacking the stack dump UI action to allow access to admins (or mid server admins?) instead of maint, and then poking that button. (No idea why that should only work for maint.)
  2. Manually create a SystemCommand with source = threaddump and agent = mid.server.<your mid server name>
4. Manually dump the stack with jstack. You'll have to make sure you have an appropriate version of the jdk installed, and you'll have to track down the PID of the mid JVM.

If you get the stack dump, you'll be looking for worker threads which look like this:

Worker-<queue>:(<Topic>-<sys_id>|"Idle")

queue = {Standard, Expedited, Interactive}
Topic = probe topic like SSHCommand, Shazzam, HorizontalDiscoveryProbe
sys_id is the id of the ecc_queue output record for the probe

For example, here's a recent problem I had with discoveries stalling on probes:

MyLaptop:Downloads tim.broberg$ grep -v Idle stack.txt | grep -A 10 Worker-
"Worker-Standard:SSHCommand-3654b541dbc2cc58219f904bdb961929" #185 daemon prio=5 os_prio=0 tid=0x00007fccb4063000 nid=0x6dfc waiting for monitor entry [0x00007fcd659a8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   at com.service_now.mid.creds.provider.standard.StandardCredentialsProvider.iterator(StandardCredentialsProvider.java:154)
   - waiting to lock <0x00000000c0467e48> (a com.service_now.mid.creds.provider.standard.StandardCredentialsProvider)
   at com.snc.core_automation_common.util.AKeyedConnectionFactory.getCredentialsIterator(AKeyedConnectionFactory.java:252)
   at com.snc.core_automation_common.util.AKeyedConnectionFactory.getConnectionIterateOverCreds(AKeyedConnectionFactory.java:204)
   at com.snc.core_automation_common.util.AKeyedConnectionFactory.getConnection(AKeyedConnectionFactory.java:158)
   at com.snc.core_automation_common.util.AKeyedConnectionFactory.getConnection(AKeyedConnectionFactory.java:145)
   at com.service_now.mid.services.SSHSessionPoolFactory.createConnection(SSHSessionPoolFactory.java:43)
   at com.service_now.mid.services.SSHSessionPoolFactory.createConnection(SSHSessionPoolFactory.java:21)
   at com.snc.core_automation_common.util.AConnectionPoolFactory.create(AConnectionPoolFactory.java:27)

...

From this, I can see that:
A) An SSHCommand is stuck.
B) His sys_id is 3654b541dbc2cc58219f904bdb961929
C) What he's stuck on is waiting for the credential iterator.

Now I know what to try to debug to untangle the thing: what's wrong with the credential iterator?

Hope this helps,
- Tim.

Abhijeet Yadav · ‎01-28-2020

Hi Tim,

I could get the probes run manually. However I don't see anything under MID Server Thread matching 'contains worker but not idle' criteria. Is it normal ? Does it mean something else?

Thanks,

Abhijeet

tim_broberg · ‎01-28-2020

On the mid server record, those threads update every 10 minutes, assuming the mid is alive, so allow some time for them to update.

If the probes are not getting stuck, you'll want to look at the ecc_queue records for the discovery status and see if any are not in processed state.