MID Server takes extremely long time to return results

aaronliprime · ‎11-30-2017

Hi,

I have a workflow that has extremely inconsistent runtimes. The expected runtime is around 40 seconds, but it can take upwards of 20 minutes, so I put some timestamps everywhere.

I isolated the problem to having the MID server return the results of the script after it's finished running. For example: in a run that took 22 minutes everything leading up to the Powershell activity took 1 second, the script itself took 42 seconds (there's a get-date as the very last line of the script), but the MID server didn't return a response until over 21 minutes later. The rest of the workflow took less than a second. Out of a 21:55 runtime 21:12 was spent waiting for a response and only 43 seconds was spent actually running stuff.

tim_broberg · ‎11-30-2017

Here's the inside story on what happens with probes.

(When you look at the ecc_queue, make sure you display columns for created, processed, updated, and state.)

Here's the flow:

Somebody creates an output record in the ecc_queue (output created time = now, state = ready)
Load balancing of mids within a cluster occurs (currently in a business rule)
An AMB message is sent to the assigned mid server notifying him that he has work to do
The mid queries the last 30 minutes of ecc_queue output results, retrieving any new work onto an internal queue
The mid sends a queue.processing message back indicating which ecc_queue outputs got picked up for processing
A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
The mid has some number of worker threads grinding away emptying the internal queue
The time reported in the input message is the time spent in the actual worker thread
On completion, the result is inserted in an ecc_queue input record (input created time = now, state = ready)
When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
An asynchronous sensor job is scheduled (sys_trigger).
The sensor starts, (input processed time = now, state = processing)
The sensor completes, (input updated time = now, state = processed)

So, we can derive how much time a pobe spends where from the created, processed, and updated fields:

Output queue time = output processed - output created
Mid server queue time = output updated - output processed - mid server exec time (from input payload)
Mid processing time = mid server exec time (from input payload)
Input queue time = input processed - input created
Input processing time = input updated - input processed

(You can find this logic in code form in the Discovery Timeline implementation.)

Diagnosing problems:

If the output queue time is huge, you may be seeing congestion in the API_INT semaphores due to excessive SOAP / REST traffic.
Excessive Mid server processing or queueing time tends to indicate that your mids are overloaded or have stuck worker threads.
Excessive input queue / processing times indicate too much traffic for the number of nodes you have, or perhaps load balancing issues.

Once you've narrowed in on which area is getting bogged down, we can focus in on debugging that area.

- Tim.

View solution in original post

Dave Smith1 · ‎11-30-2017

It is worth looking at the logs on the MID to see what's taking the time?

I'm curious to know if it's the PowerShell script itself that takes ages, or if the MID is struggling to get the response up.

If you disable that particular PowerShell script, what is the duration like? Can you isolate which one it is?

aaronliprime · ‎11-30-2017

I tried looking at the MID server logs but it only seems to create logs for Discovery events and not Orchestration.

The last line in the Powershell script is a timestamp that points to the MID server being the problem. The delay is after the Powershell script is done and before Post-processing starts, so I don't really have any visibility into what's going on during that time.

The entire workflow is 2 small Javascripts and 1 big Powershell activity so it's definitely the Powershell.

Dave Smith1 · ‎11-30-2017

Aaron Li wrote: