MID Server resource threshold alerts
The instance displays warnings when a MID Server breaches its resource thresholds for CPU and JVM memory usage, enabling users to create email notifications or custom scripts when a breach occurs.
The MID Server Issue [ecc_agent_issue] table warns users when a MID Server exceeds configured thresholds of its allocated CPU and memory resources. These warnings are published before the MID Server experiences performance degradation or an out-of-memory error, enabling the administrator to increase resources and avoid downtime. Administrators can use a registered event to send email notification to selected recipients, advising them of any threshold breaches, or to create a custom script to do some other type of work. The instance continues to update the MID Server Issue [ecc_agent_issue] table to keep unresolved issues current.
- mid.threshold.resource.breach.enable.cpu.alerts
- mid.threshold.resource.breach.enable.memory.alerts
Evaluation process
- Every 10 minutes, each MID Server transmits its CPU and memory consumption metrics to the instance. The instance inserts CPU metrics into the Mean CPU used % field of the ECC Agent Scalar Metrics [ecc_agent_scalar_metric] table and memory metrics into the Max memory used % field of the ECC Agent Memory Metrics [ecc_agent_memory_metric] table.
- After a successful insert, the following business rules run on each table, invoking a
script include that calls an appropriate function. Each function takes an average of the
metric sets inserted into the tables, based on the configured sampling intervals.
- Update cpu mean on MID Server Status: Calls the MIDResourceThresholdBreach.checkCpuUsage() script include.
- Update max memory on MID Server Status: Calls the MIDResourceThresholdBreach.checkMemoryUsage script include.
Each function takes an average of the metric sets inserted into the tables, based on the configured thresholds and sampling intervals. The instance first looks at each MID Server for configuration parameters that set custom threshold values or sampling intervals for that MID Server. If no configuration parameters for these attributes are found, the instance looks in the System Properties [sys_properties] table for custom values to use. If no properties are found, the instance uses the default threshold and interval values from the code.주:Both the threshold percentages and the sampling intervals are configurable. See Configuring thresholds and sampling intervals for details.
Alerting process
- If the aggregated average metric value equals or exceeds the configured percent threshold, the instance triggers the mid.threshold.resource.breach event. Administrators can use this event to create email notifications for threshold breach alerts or to create a custom script.
- The instance inserts a record of the breach into the MID Server Issue
[ecc_agent_issue] table with a State value of
New and a Count of 1, and then publishes
a message containing all the pertinent details of the breach. An example of this message
is
Mean CPU used % has exceeded threshold (96>=91) for a 40 minute interval span, occurring after start date 2017-01-11 14:25:19. This message appears in the Short description field of the MID Server Issue form and in the event. You can copy any part of the message into your email notifications.
MID Server issue states
Recommendations for resolving resource issues
- JVM memory:
- Allocate more max memory to the MID Server. For more information, see Set the MID Server JVM memory size.
- Add additional MID Servers to share the workload. For more information, see MID Server clusters.
- Reduce the amount of concurrent processing for the MID Server. This includes segmenting IP Address ranges into smaller segments for a Discovery schedule or loading smaller segments of data within an import job.
- CPU: Reduce the activity on the host or migrate the MID Server
to a new host with more available resources. 주:MID Server can create a resource usage spike during Discovery, especially discovering against a large number of targets or executing multiple Power Shell sessions concurrently. The MID Server host’s resource utilization automatically returns to normal after the Discovery execution successfully stops. To decrease CPU utilization, host the MID Server on a dedicated machine. If you encounter resource usage issues, make sure only one MID Server is run on each dedicated host machine. If the MID Server is hosted on a public cloud, add more CPU resources and avoid the noisy neighbor issue. For more information, see High CPU Usage on Host with MID Server(s) [KB0597639].
Tables used for resource threshold evaluation
| Table | Description |
|---|---|
| MID Server Issue [ecc_agent_issue] | Stores data on various types of MID Server issues, including breaches of
configured CPU and memory thresholds. Fields used for resource threshold breaches
are:
|
| MID Server Status [ecc_agent_status] | Stores the percentages used for the CPU and memory resources, averaged over
configurable intervals for each resource. The fields used are:
|
| ECC Agent Scalar Metric [ecc_agent_scalar_metric] | Stores the CPU usage data inserted by each MID Server every 10 minutes. The table field used by resource threshold alerting is mean. |
| ECC Agent Memory Metric [ecc_agent_memory_metric] | Stores the memory usage data inserted by each MID Server every 10 minutes. The table field used by resource threshold alerting is max_used_pct. |
Business rules that check for threshold breaches
| Business rule | Description |
|---|---|
| Update cpu mean on MID Server Status | Runs after the MID Server inserts a record into the ECC Agent Scalar Metric [ecc_agent_scalar_metric] table. This business rule triggers the MIDResourceThresholdBreach script include function that evaluates threshold settings to determine if the MID Server has breached its configured CPU resource thresholds. |
| Update max memory on MID Server Status | Runs after the MID Server inserts a record into the ECC Agent Memory Metric [ecc_agent_memory_metric] table. This business rule triggers the MIDResourceThresholdBreach script include function that evaluates threshold settings to determine if the MID Server has breached its configured memory resource thresholds. |
Configuring thresholds and sampling intervals
- Add system properties to the instance and change the default values for all MID Servers.
- Add configuration parameters to change the default resource values for individual MID Servers.
| Property/configuration parameter | Description |
|---|---|
| mid.threshold.mean_cpu.aggregate_interval_span | Number of 10 minute units in the interval for sampling CPU usage data. The
default interval is 30 minutes (3 x 10 min.) Default: 3 |
| mid.threshold.mean_cpu.percent | Usage percentage of the total CPU resources that initiates a threshold breach
alert. Default: 95 |
| mid.threshold.mean_max_memory.aggregate_interval_span | Number of 10 minute units in the interval for sampling memory usage data. The
default interval is 30 minutes (3 x 10 min.) Default: 3 |
| mid.threshold.mean_max_memory.percent | Usage percentage of the total memory resources that initiates a threshold
breach alert. Default: 95 |
MID Server resource reporting
- Avg Percentage of CPU Used: Trending the daily average on CPU usage helps illustrate the amount of CPU processing that the MID Server host consumes. MID Servers deployed on the same host will report the same CPU usage.
- Avg Percentage of Max Memory Used: The maximum used percentage (max_used_pct) is a useful metric for determining if the MID Server has enough memory resources. This metric is a percentage of the max used memory over the total available memory. Trending this over time provides a visualization of how much memory is needed by the MID Server.