Visualizations in the Service reliability dashboard
Summarize
Summary of Visualizations in the Service reliability dashboard
The Service reliability dashboard in Service Reliability Management (SRM) provides visualizations that help you monitor and analyze the health and performance of your services against their service level objectives (SLOs). These visualizations enable you to quickly identify which services are stable, at risk, or critical based on their error budget consumption, offering insights into service reliability trends over time.
Show less
Key Visualizations and Their Uses
- Service State Charts: Show the number of services in critical (0% error budget remaining), at risk (≤ 25% error budget remaining), and stable (> 25% error budget remaining) states. These charts allow you to pinpoint services needing immediate attention, monitor those approaching critical thresholds, and assess overall service health. Trend lines display changes over the past 12 months, with comparative figures indicating weekly changes.
- Risk Trends Over Time: Line charts track SLOs with high burn rates (≥1) and low error budget remaining (≤ 25%) over the last 12 months. A high burn rate signals that a service is consuming its error budget faster than allowed, potentially leading to an SLO breach. These charts help identify emerging or recurring reliability risks early, enabling proactive management.
- SLOs Table: Lists all defined SLOs with key details such as SLO name, current reliability state, measured reliability percentage, target objective, burn rate, percentage of error budget remaining, associated service, and assigned team. This table supports monitoring reliability status, identifying at-risk services, and determining responsible teams. The table is customizable for columns displayed and sortable by SLO name.
Dashboard Features and Customization
The dashboard is built on Platform Analytics and includes standard features such as filtering, time range adjustment, and detailed chart options. Changes made to this dashboard affect all SRM users in your instance. To create a personalized view, you can duplicate the existing dashboard or build a new one using the in-line editor. This flexibility allows you to tailor dashboards to your team’s specific monitoring and reporting needs.
List of visualizations and options on the Service reliability dashboard in Service Reliability Management (SRM).
Service state charts
Top-level charts show the number of services in critical, at-risk, and stable states. Their states are based on the error budget remaining on their service level objectives (SLOs). You can select the charts to view service names, adjust the time range, and access additional chart options.
| Chart | What it is | How to use it |
|---|---|---|
| Critical | Displays the number of services in a critical state. Critical services have 0% error budget remaining on their SLOs. | View how many services have consumed their error budgets and identify the services needing immediate attention. |
| At risk | Displays the number of services at risk. At-risk services have <= 25% error budget remaining on their SLOs. | Monitor how many services are approaching critical thresholds and find issues early. |
| Stable | Displays the number of stable services. Stable services have more than 25% error budget remaining on their SLOs. | Get insights into overall service health and identify if services are staying reliable over time. |
Risk trends over time
| Chart | What it is | How to use it |
|---|---|---|
| High burn rate (>=1) | Shows the number of SLOs with a burn rate >= 1 over time. A high burn rate indicates that the service linked to the SLO is likely to breach its error budget before the compliance period ends. For example, if a service has 30 days to meet its SLO but is using up its error budget in 15 days, the burn rate is 2. |
|
| Low budget remaining (<=25%) | Shows the number of SLOs with low or no error budget remaining over time. |
|
Service level objectives (SLOs) table
The SLOs table lists the SLOs defined in Service Reliability Management (SRM), and it’s sorted by SLO name by default. Use the table to monitor overall reliability, identify services at risk, and find the assigned teams.
- Name - Name of the SLO. You can select the arrow to sort the table by SLO name, and you can select the name to view the SLO record.
- Reliability - Current state of the SLO. For example, stable, at risk, or critical.
- Measured reliability - Percentage showing the actual performance of the service. For example, if your SLO is 99.9% success, and the actual performance for the month is 99.7%, the measured reliability is 99.7%.
- Objective (percentage) - Target SLO value.
- Burn rate - Numeric value showing how quickly the service is consuming its error budget.
- % Error budget remaining - Percentage of the error budget still available in the current compliance period.
- Service - Name of the service associated with the SLO. You can select the service name to view the service record.
- Assigned - Team responsible for the service.