CMDB Health “failure threshold reached”: What it means and what to do about it

Claire_Conant · a month ago

If you’re seeing “failure threshold reached” or a “Max Failures” status on your CMDB Health Dashboard, you’re not alone. This is one of the more common CMDB Health messages, and it tends to catch people off guard because scoring simply stops without much explanation.

The good news: This is almost always fixable by addressing the underlying data, not by raising the threshold. Here’s how to figure out what’s going on and where to go from there.

What does “failure threshold reached” actually mean?

The CMDB Health Dashboard evaluates your CIs against health metrics: Completeness (required and recommended attributes), Correctness (duplicates, orphan CIs, and stale CIs), and Compliance. Each metric has a configured failure threshold (the default is 50,000). When the number of CIs failing a metric hits that threshold, the system stops processing and sets the status to Max Failures.

The score for that metric may be incomplete or missing until the underlying failures are resolved. If left unresolved, the same thing happens again on the next cycle.

Why you should fix the data and not raise the threshold

The most common first instinct is to increase the threshold to make the error go away. That approach creates more problems than it solves. Processing large volumes of failures is expensive, slows down health jobs, and the failure data often goes unused. The stronger approach is to reduce the number of CIs actually failing, which also improves your CMDB data quality.

Which metric is affected?

The resolution path depends entirely on which metric is hitting the threshold. To find it: Open CMDB Workspace → Health Dashboard and look for the metric card showing a pink Max Failures status. Then follow the path here that matches your scenario.

Metric showing Max Failures	Most common cause	Where to start
Completeness	Missing or inapplicable attributes	Review health config in CI Class Manager. See Remove unnecessary attributes from health configuration
Correctness: Duplicates	Dedupe backlog building up	Check top 10 classes in failure scorecard. See Use dedupe template remediation
Correctness: Stale	Threshold too tight or source not running	Check the Staleness Rule in CI Class Manager. See Adjust the staleness threshold
Correctness: Orphans	Rule conditions too broad	Review orphan conditions in CI Class Manager. See Adjust orphan conditions
Compliance	Audit conditions don’t match environment	Review audit template conditions. See Adjust audit conditions

Completeness: missing attributes

Completeness failures mean required or recommended metrics are missing across a large number of CIs. The CMDB Health Dashboard evaluates Completeness as two separate sub-metrics: Required and Recommended. Check the info icon on the Completeness card to see which sub-metric hit the threshold. The most common root causes:

Attributes that don’t apply to your environment. The score for that metric may be incomplete or missing until the underlying failures are resolved. If left unresolved, the same thing happens again on the next cycle.
Data source mapping gaps. If the attribute should be populated but isn’t, check your discovery source or connector configuration; the mapping for that attribute may be missing or misconfigured.
Classes that shouldn’t be evaluated. Some classes are expensive to process or don’t contain meaningful data for health assessment. Use inclusion rules to exclude them from Completeness evaluation.

Correctness: duplicate CIs

Duplicate failures mean the system has identified more CIs as potential duplicates than the threshold allows. Start by examining the top 10 classes in the failure scorecards, then:

Resolve legitimate duplicates using deduplication template remediation. Look for dedupe tasks already attached to the affected CIs.
Exclude classes that can’t be easily resolved by applying inclusion rules at the global level for classes that consistently produce duplicate noise.

Correctness: stale CIs

Staleness failures appear when CIs haven’t been updated within the configured staleness threshold. Common causes:

Staleness threshold is too restrictive. Adjust the effective duration for the affected class (CI Class Manager > Health > Correctness > Staleness Rule).
Data sources aren’t running frequently enough. Check when the source last ran and whether the CI is still being discovered. Rerunning the connector often resolves the immediate failures.
Stale records that are no longer valid. Archive or delete CIs that no longer exist in your environment to reduce the failure count.
Classes that are intentionally static. Some classes don’t change often by design. Use inclusion rules to exclude them from staleness monitoring.

Correctness: orphan CIs

Orphan CI failures appear when CIs fail the orphan check. This typically happens because relationship records reference CIs that no longer exist, or because the orphan rule conditions are producing false positives. Check the failure description for each CI to understand why it’s failing, then:

Refine the Orphan Rule conditions. If the conditions are too broad or don’t match your environment, adjust them for the affected class (CI Class Manager > Health > Correctness > Orphan Rule).
Exclude classes where orphan checking isn’t meaningful. Apply inclusion rules at the global level for classes that can’t be easily resolved.

Compliance: audit failures

Compliance failures come through audits, including desired state and scripted audits, so the resolution approach differs from the other metrics. The failure description on each CI identifies which specific audit is causing the issue.

Adjust audit conditions. This is the primary fix. Go to the audit template causing failures and review the certification conditions. Adjusting conditions to better match your environment reduces failures while maintaining Compliance requirements.
Narrow the audit scope. If too many CIs are being evaluated, add more specific filter conditions to the audit template to reduce the evaluation scope.

If the data fixes aren’t enough

If you’ve worked through the root causes and still can’t get below the threshold, increasing it is available as a last resort. Keep in mind that thresholds are configured per metric in the CMDB Health Metric Preferences [cmdb_health_metric_pref] table, so you're adjusting each one individually. Higher thresholds mean more failures to process, which directly affects health job duration. There’s no hard maximum, but the system impact scales with the value you set.

As a last resort, increase the failure threshold.

Running Xanadu before Patch 4? Read this first

A change introduced in Xanadu GA capped failure thresholds at 100,000 (via the glide.cmdb.health.max_failure_threshold system property), overriding any higher value you had configured. This was resolved in Xanadu Patch 4. If you’re running Xanadu before Patch 4, upgrade to restore expected threshold behavior. See Xanadu-specific behavior for details.

Where to go from here

Most threshold errors are resolvable by addressing the underlying data quality issues described above. If you’ve worked through the resolution steps and the issue persists, the ServiceNow Community CMDB forum is a good next step for environment-specific questions.