
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-30-2017 10:40 AM
Hi,
I have a workflow that has extremely inconsistent runtimes. The expected runtime is around 40 seconds, but it can take upwards of 20 minutes, so I put some timestamps everywhere.
I isolated the problem to having the MID server return the results of the script after it's finished running. For example: in a run that took 22 minutes everything leading up to the Powershell activity took 1 second, the script itself took 42 seconds (there's a get-date as the very last line of the script), but the MID server didn't return a response until over 21 minutes later. The rest of the workflow took less than a second. Out of a 21:55 runtime 21:12 was spent waiting for a response and only 43 seconds was spent actually running stuff.
Solved! Go to Solution.
- Labels:
-
Orchestration
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-30-2017 01:42 PM
Here's the inside story on what happens with probes.
(When you look at the ecc_queue, make sure you display columns for created, processed, updated, and state.)
Here's the flow:
- Somebody creates an output record in the ecc_queue (output created time = now, state = ready)
- Load balancing of mids within a cluster occurs (currently in a business rule)
- An AMB message is sent to the assigned mid server notifying him that he has work to do
- The mid queries the last 30 minutes of ecc_queue output results, retrieving any new work onto an internal queue
- The mid sends a queue.processing message back indicating which ecc_queue outputs got picked up for processing
- A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
- The mid has some number of worker threads grinding away emptying the internal queue
- The time reported in the input message is the time spent in the actual worker thread
- On completion, the result is inserted in an ecc_queue input record (input created time = now, state = ready)
- When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
- An asynchronous sensor job is scheduled (sys_trigger).
- The sensor starts, (input processed time = now, state = processing)
- The sensor completes, (input updated time = now, state = processed)
So, we can derive how much time a pobe spends where from the created, processed, and updated fields:
- Output queue time = output processed - output created
- Mid server queue time = output updated - output processed - mid server exec time (from input payload)
- Mid processing time = mid server exec time (from input payload)
- Input queue time = input processed - input created
- Input processing time = input updated - input processed
(You can find this logic in code form in the Discovery Timeline implementation.)
Diagnosing problems:
- If the output queue time is huge, you may be seeing congestion in the API_INT semaphores due to excessive SOAP / REST traffic.
- Excessive Mid server processing or queueing time tends to indicate that your mids are overloaded or have stuck worker threads.
- Excessive input queue / processing times indicate too much traffic for the number of nodes you have, or perhaps load balancing issues.
Once you've narrowed in on which area is getting bogged down, we can focus in on debugging that area.
- Tim.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-30-2017 10:59 AM
It is worth looking at the logs on the MID to see what's taking the time?
I'm curious to know if it's the PowerShell script itself that takes ages, or if the MID is struggling to get the response up.
If you disable that particular PowerShell script, what is the duration like? Can you isolate which one it is?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-30-2017 12:00 PM
I tried looking at the MID server logs but it only seems to create logs for Discovery events and not Orchestration.
The last line in the Powershell script is a timestamp that points to the MID server being the problem. The delay is after the Powershell script is done and before Post-processing starts, so I don't really have any visibility into what's going on during that time.
The entire workflow is 2 small Javascripts and 1 big Powershell activity so it's definitely the Powershell.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-30-2017 12:06 PM
Aaron Li wrote:
I tried looking at the MID server logs but it only seems to create logs for Discovery events and not Orchestration.
Where are you looking for these? I was thinking the text files found in the "logs" directory on the MID itself, rather than stuff found in the platform - agent0.log, etc.
The last line in the Powershell script is a timestamp that points to the MID server being the problem. The delay is after the Powershell script is done and before Post-processing starts, so I don't really have any visibility into what's going on during that time.
Yeah, that feels odd. It's not a 3P script, or there's no post-processing going on somewhere that's interrupting the response at all?
Could you try building another MID - either same box or another Winboxen - and try the same experiment? Just curious to compare and contrast different processing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-30-2017 01:42 PM
Here's the inside story on what happens with probes.
(When you look at the ecc_queue, make sure you display columns for created, processed, updated, and state.)
Here's the flow:
- Somebody creates an output record in the ecc_queue (output created time = now, state = ready)
- Load balancing of mids within a cluster occurs (currently in a business rule)
- An AMB message is sent to the assigned mid server notifying him that he has work to do
- The mid queries the last 30 minutes of ecc_queue output results, retrieving any new work onto an internal queue
- The mid sends a queue.processing message back indicating which ecc_queue outputs got picked up for processing
- A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
- The mid has some number of worker threads grinding away emptying the internal queue
- The time reported in the input message is the time spent in the actual worker thread
- On completion, the result is inserted in an ecc_queue input record (input created time = now, state = ready)
- When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
- An asynchronous sensor job is scheduled (sys_trigger).
- The sensor starts, (input processed time = now, state = processing)
- The sensor completes, (input updated time = now, state = processed)
So, we can derive how much time a pobe spends where from the created, processed, and updated fields:
- Output queue time = output processed - output created
- Mid server queue time = output updated - output processed - mid server exec time (from input payload)
- Mid processing time = mid server exec time (from input payload)
- Input queue time = input processed - input created
- Input processing time = input updated - input processed
(You can find this logic in code form in the Discovery Timeline implementation.)
Diagnosing problems:
- If the output queue time is huge, you may be seeing congestion in the API_INT semaphores due to excessive SOAP / REST traffic.
- Excessive Mid server processing or queueing time tends to indicate that your mids are overloaded or have stuck worker threads.
- Excessive input queue / processing times indicate too much traffic for the number of nodes you have, or perhaps load balancing issues.
Once you've narrowed in on which area is getting bogged down, we can focus in on debugging that area.
- Tim.