- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Every Dashboard Is Green. The Service Is Down. Explain That.
Your monitoring stack is world-class. Your observability platform ingests millions of events per minute. Your on-call team is experienced and diligent. And somehow, citizens are calling to report that the benefits portal isn't working — and nobody on the operations team saw it coming. This happens more than anyone wants to admit. Here's why, and how to fix it.
The Monitoring System Knew. It Just Didn't Know What It Knew.
Picture a Thursday afternoon at a state agency. The benefits eligibility portal — used by thousands of residents to check case status, submit documentation, and initiate appeals — starts returning errors. Not dramatically. Quietly. Response times climbing. A few transactions failing. Then more.
The operations center gets its first indication not from a monitoring alert, but from a call to the service desk. Then another. Then a supervisor's email. By the time the incident is officially declared, two hours have passed and the portal has been degraded the entire time.
Post-incident review reveals something uncomfortable: the monitoring platform had flagged four separate alerts during that window. A database query timeout. A message queue backup. Two application server memory warnings. Every signal was there. None of them were connected to each other, none were associated with the benefits portal, and none triggered a priority response — because individually, on their face, none of them looked like a service-affecting incident.
The monitoring system was working perfectly. It was just watching infrastructure. Nobody had told it to watch services.
✦ ✦ ✦
The Problem
Infrastructure Metrics Are a Weather Report for the Wrong City
The fundamental problem with infrastructure-centric monitoring is not that it measures the wrong things — CPU, memory, latency, error rates are all genuinely important signals. The problem is that it measures those things in isolation, without any model of what those things collectively support.
A server at 87% memory utilization is a data point. Whether that server hosts a component of the SNAP eligibility API or a batch reporting job that runs at 3 a.m. is context. The data point tells you something is elevated. The context tells you whether to wake anyone up.
The Three Failure Modes of Infrastructure-Only Monitoring
Alert fatigue: When every system fires independently, a single service disruption can generate dozens of alerts across monitoring tools, ticketing systems, and dashboards. Operations teams learn — rationally — to wait for the flood to subside before investigating. The problem is that the most important alerts arrive in the middle of the flood.
False all-clears: Dashboards showing green infrastructure metrics while a service is degraded are not hypothetical. They happen whenever a service failure stems from a dependency issue — a shared component that looks fine on its own, but is creating downstream problems for the applications that depend on it. Infrastructure health and service health are not the same thing.
Reactive discovery: Without a service model, the relationship between an infrastructure alert and its service impact is something operations teams have to figure out manually during an active incident — consulting architecture diagrams, asking colleagues, tracing dependencies in real time. That's investigation time that could be resolution time.
The solution isn't more monitoring. It's a different kind of monitoring — one that measures infrastructure signals against a map of the services those signals affect. That map is what the Common Service Data Model provides.
The Explanation
From Signal to Meaning: How CSDM Closes the Gap
CSDM organizes the CMDB into a structured hierarchy: infrastructure configuration items at the base, technical services above them, application services above those, business applications above those, and business capabilities at the top. Each layer maps to the one above it, creating an unbroken chain from a physical server or container all the way up to the business function it ultimately enables.
This hierarchy does one thing that raw telemetry cannot: it gives every infrastructure signal a service address. When a database query slows down, CSDM tells the operations platform which technical service that database belongs to, which application services depend on that technical service, which business application delivers user-facing functionality through those application services, and which business capability — benefits eligibility, case management, document submission — is at risk.
The signal didn't change. The meaning changed entirely.
What This Looks Like in the Monitoring Layer
When CSDM relationships are integrated with event management and observability platforms, three specific capabilities emerge that infrastructure-only monitoring can't provide.
Correlated service events instead of isolated alerts. That Thursday afternoon scenario — four separate alerts, none connected — looks different with service architecture in place. The database timeout, the message queue backup, and the memory warnings all trace to configuration items that support the same benefits portal application service. Event management groups them into a single service event, classified at the appropriate priority, routed to the right team, with the affected service clearly identified. Four alerts become one incident.
The portal degradation is visible before the first citizen calls.
Service health indicators that aggregate meaningfully. An application service depends on multiple components — servers, databases, queues, downstream APIs. Service health monitoring aggregates the signals across all of those components into a single health status for the service itself. A component experiencing a minor fluctuation doesn't cause a false alarm. A pattern of degradation across several components tells a coherent story. Operations teams stop watching individual metrics and start watching the things that matter: whether services are healthy.
Impact analysis in business terms. When an incident is declared, the first question from leadership is never "which server is affected?" It's "what can't people do right now?" CSDM lets the operations team answer that question accurately and immediately — not by guessing, but by reading the service dependency map that was already built. The incident communication becomes "the document submission service is degraded, affecting an estimated 3,200 active sessions" rather than "we have a database performance issue under investigation."
"Infrastructure monitoring tells you what's wrong with the machines. Service-centric monitoring tells you what's wrong for the people using them. Both matter. Only one gets you out of bed for the right reasons."
The Solution
Building the Model That Makes Monitoring Meaningful
Getting from infrastructure-centric monitoring to service-centric operational visibility is less a technology project than an architecture discipline. The tooling — ServiceNow event management, observability platform integrations, service health dashboards — is ready when you are. What has to come first is the service model those tools will use.
That means ensuring three things are true about your CMDB.
First, configuration items are associated with technical services. Every server, container, database, and network device that supports a service should be linked to the technical service it belongs to. This is the foundational relationship from which everything else flows. Without it, infrastructure signals remain orphaned — facts without addresses.
Second, technical services are associated with application services. The shared platforms and infrastructure layers that support application delivery — messaging queues, database clusters, authentication platforms — need to be modeled as technical services with explicit relationships to the application services that depend on them. This is the relationship that enables cross-component alert correlation and blast-radius assessment.
Third, application services are associated with business applications and capabilities. This is the connection that translates operational events into business language — the one that allows a monitoring alert to be expressed as a service disruption rather than a technical anomaly. It's also the relationship that tells leadership which capabilities are at risk, without anyone having to assemble that picture manually during an active incident.
The Governance Prerequisite
None of this works if the service model isn't maintained. Service relationships drift as environments change — new deployments add dependencies, decommissioned systems leave orphaned relationships, team reorgs shuffle ownership. Regular data certification, clear service ownership accountability, and automated CMDB health monitoring are what keep the model trustworthy over time. An accurate service model from last year that nobody has updated is a liability, not an asset. The operations team will trust it — and it will mislead them.
Summary
Back to That Thursday Afternoon
The agency in our opening scenario had excellent monitoring. What it didn't have was a service model that connected the monitoring to the services being monitored. Four alerts fired, none were connected, and the impact accumulated invisibly until citizens reported it.
With CSDM in place, that afternoon looks different. The four alerts correlate into a single service event linked to the benefits portal. The event is classified at Priority 1 before a single citizen calls. The right team is notified with the service context already in the ticket. The portal is restored in twenty minutes. The incident report to leadership describes which service was affected, for how long, and how many sessions were impacted — all pulled from the service model automatically.
Same monitoring tools. Same operations team. Completely different outcome — because someone, at some point, did the work of building the map that turned infrastructure signals into service intelligence.
The signals were always there. Build the model that tells you what they mean.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
