MID Server takes extremely long time to return results

aaronliprime · ‎11-30-2017

Hi,

I have a workflow that has extremely inconsistent runtimes. The expected runtime is around 40 seconds, but it can take upwards of 20 minutes, so I put some timestamps everywhere.

I isolated the problem to having the MID server return the results of the script after it's finished running. For example: in a run that took 22 minutes everything leading up to the Powershell activity took 1 second, the script itself took 42 seconds (there's a get-date as the very last line of the script), but the MID server didn't return a response until over 21 minutes later. The rest of the workflow took less than a second. Out of a 21:55 runtime 21:12 was spent waiting for a response and only 43 seconds was spent actually running stuff.

tim_broberg · ‎11-30-2017

Here's the inside story on what happens with probes.

(When you look at the ecc_queue, make sure you display columns for created, processed, updated, and state.)

Here's the flow:

Somebody creates an output record in the ecc_queue (output created time = now, state = ready)
Load balancing of mids within a cluster occurs (currently in a business rule)
An AMB message is sent to the assigned mid server notifying him that he has work to do
The mid queries the last 30 minutes of ecc_queue output results, retrieving any new work onto an internal queue
The mid sends a queue.processing message back indicating which ecc_queue outputs got picked up for processing
A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
The mid has some number of worker threads grinding away emptying the internal queue
The time reported in the input message is the time spent in the actual worker thread
On completion, the result is inserted in an ecc_queue input record (input created time = now, state = ready)
When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
An asynchronous sensor job is scheduled (sys_trigger).
The sensor starts, (input processed time = now, state = processing)
The sensor completes, (input updated time = now, state = processed)

So, we can derive how much time a pobe spends where from the created, processed, and updated fields:

Output queue time = output processed - output created
Mid server queue time = output updated - output processed - mid server exec time (from input payload)
Mid processing time = mid server exec time (from input payload)
Input queue time = input processed - input created
Input processing time = input updated - input processed

(You can find this logic in code form in the Discovery Timeline implementation.)

Diagnosing problems:

If the output queue time is huge, you may be seeing congestion in the API_INT semaphores due to excessive SOAP / REST traffic.
Excessive Mid server processing or queueing time tends to indicate that your mids are overloaded or have stuck worker threads.
Excessive input queue / processing times indicate too much traffic for the number of nodes you have, or perhaps load balancing issues.

Once you've narrowed in on which area is getting bogged down, we can focus in on debugging that area.

- Tim.

View solution in original post

tim_broberg · ‎11-30-2017

Now, it sounds like you're seeing slowness on the mid server side.

If you go to the agent log and grep for "LogStatusMonitor" for the times things are busy, you can look at standard.queued and standard.processing to see how busy it is. If .queued is non-zero for much of the, you have a sizable backlog of work which will result in the kinds of latencies you're seeing. This may be because of running a large number of probes or a number of slow probes.

Also, try looking at an idle time. Do the various .processing counts all go to zero? If not, you have threads stuck and may want to restart the mid server.

Paul Ciarfella · ‎12-11-2017

Check the memory usage of the mid server application to see if its high.

I have recently run into an issue where an orchestration powershell script can take up to 2 minutes to be run. In looking at the midserver, I see that the Max Allocated Bytes is over 80% (about 820K) on an idle server. Restarting the midserver brings this back down to about 200K and then the script runs in under 10 seconds.

CPU utilization on my mid server is good, nothing out of whack. Which is why I think its a memory issue and possibly memory leak, as this is a heavily used server for discovery.

I'm running Istanbul patch 9 and I suspect a memory leak.

tim_broberg · ‎12-11-2017

Usually, allocated memory isn't a concern, it's used I worry about. Garbage collection brings down the allocated memory.

It is possible it's a leak, or it may be some stuck worker thread.

If you find this is happening consistently

you can try switching from j2ssh to sncssh if you're not on it already.
you can collect a stack trace, either directly from the mid or by running a SystemCommand with MIDServerManage.thread_dump(midName), and look at the Worker-Standard threads to see where they might be stuck.
you can create an incident before restarting and work with CS to collect a heap dump of your mid JVM to enable development to track down any leak.

Paul Ciarfella · ‎12-11-2017

Timothy -

Thanks for the feedback. We are using sncssh. This is definitely consistent on our midservers. max used is way up there too - over 800MB.

Its also repeatable - we reset a server and within a day its back to the same levels.

I'll check the stack trace. Its on my to-do list to file a HI ticket.

Paul

tim_broberg · ‎12-11-2017

It's expected behavior that the allocated memory would float up. It's just the way Java works.

There is a known issue, fixed in Kingston, where sncssh can leak resources on error. See KB0598937 for a workaround.