Probe Life-cycle and Timing on the ecc_queue

tim_broberg · ‎02-26-2018

This is often useful to me, but I'm forever forgetting the details, so here's a handy reference for myself and the few others who care about tracking down where mid server performance is leaking away.

(MID log messages for a particular probe will all contain the sysid of the ecc output probe. Search logs/agent0.log.* for ecc output sysid to find all traffic for a probe.)

Here's the flow of a probe:

Somebody creates an output record in the ecc_queue (output created time = now, state = ready)
Load balancing of mids within a cluster occurs (legacy in a business rule where mid server is assigned randomly, but various applications do load balancing in various ways)
An AMB message is sent to the assigned mid server notifying him that he has work to do
The mid queries up to the last 30 minutes of ecc_queue output results, retrieving any new work onto an internal queue
The mid sends a queue.processing message back indicating which ecc_queue outputs got picked up for processing. (MID log message "Worker starting: <topic> source: <source>")
A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
The mid has some number of worker threads grinding away emptying the internal queue. Currently, by default, this is 25 for priority 2, and 10 each for priorities 0 and 1.
On completion, an xml file is created on the mid server file system in work/monitors/ECCSender/output_<priority>/ecc_queue.<ecc output sysid>.xml. (The time reported in the input message is the time spent in the actual worker thread.) (MID log message "Enqueuing: <mid path>/work/monitors/ECCSender/output_2/ecc_queue.<ecc output sysid>xml"
MID execution thread completes the command (MID log message "Worker completed: SSHCommand source: 10.152.175.24 time: 0:00:29.192")
The ECCSender monitor sends ecc xml files in priority order, inserting the contents of the file in an ecc_queue input record (input created time = now, state = ready) (MID log message "Sending ecc_queue.<ecc output sysid>.xml")
When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
An asynchronous sensor job is scheduled (sys_trigger).
The sensor starts, (input processed time = now, state = processing) (App node localhost log message "SensorProcessor - Processing sensor: <sensor> (<source>) <ecc input sysid>")
The sensor completes, (input updated time = now, state = processed) (App node localhost log message "SensorProcessor - Processed sensors: <sensor> [<sensor sysid>] (<source>) in <time_s>")

So, if you look at the ecc_queue and watch the columns created, processed, updated, and state, you can see how long it takes the probe to move through each stage in the pipeline:

Output queue time = output processed - output created
Mid server queue time = output updated - output processed - mid server exec time (from input payload)
Mid processing time = mid server exec time (from input payload)
Input queue time = input processed - input created
Input processing time = input updated - input processed

(You can find this logic in code form in the Discovery Timeline implementation, and you can see this graphically, for small discoveries.)

If you look at these times and see one or more of them shoot up to 45 minutes at a certain time every day, you have a problem.

Diagnosing problems:

If the output queue time is huge, you may be seeing congestion in the API_INT semaphores due to excessive SOAP / REST traffic. Generally, API loading issues are rare until you push beyond about 100 mid servers at which point, you may want to start tuning more carefully.
- consider decreasing the number of mid servers to reduce API load on the instance
- upgrading to Kingston to get improved mid server load balancing
Excessive Mid server processing or queueing time and / or outputs stuck in ready state tends to indicate that your mids are overloaded or have stuck worker threads.
- consider pausing that mid server, waiting a few minutes for active probes to complete, and taking a stack trace. If there are worker probes that are not idle, you may need to restart the mid and consider upgrading to get the latest bug fixes.
- otherwise, you may want to try increasing the number of mid servers.
Excessive input queue / processing times indicate too much traffic for the number of nodes you have, or perhaps load balancing issues. Options to consider:
- rebalancing discovery schedules to spread out activity more evenly throughout the day
- increasing node count
- upgrading to Kingston to get improved mid server load balancing
Excessive latency on the MID sending xml files where the time between output processing and input created is long but the mid processing time is short, and the time for a probe between enqueueing and sending is long indicates that a large number of probes have piled up in the mid. The can often happen during discovery for mids with significant latency back to the instance because the ECCSender sends results back serially whereas it fetches and processes them in parallel.

- Tim.

DaveHertel · ‎05-12-2018

Thanks Tim - a nice, handy reference for the lifecycle flow as well as useful tuning tips.

Thanks for sharing - Dave

yuan1 · ‎07-01-2019

Very helpful. It explained for output message the 'Processed' time is actually the time message state change to 'processing'. And 'Updated' time is actually the message got response back and being processed.

'

A business rule in the instance marks all the outputs in processing (output processed time = now, state = processing)
The mid has some number of worker threads grinding away emptying the internal queue
On completion, the result is inserted in an ecc_queue input record (input created time = now, state = ready) (The time reported in the input message is the time spent in the actual worker thread.)
When the instance sees the input, a business rule marks the output processed (output updated time = now, state = processed)
An asynchronous sensor job is scheduled (sys_trigger).
The sensor starts, (input processed time = now, state = processing)
The sensor completes, (input updated time = now, state = processed)
'

Vijay Kumar7 · ‎05-16-2022

Thanks Tim.