Working with Reliability metrics
Summarize
Summary of Working with Reliability metrics
ServiceNow SRM (Service Reliability Management) uses reliability metrics to help you define Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget policies. These components enable you to monitor the health of your services, identify breaches, and trigger automated actions such as incident creation or notifications to maintain service reliability.
Show less
SRM aggregates signals from integrated sources and updates reliability indicators when qualified alerts occur. Error budgets are managed by category and help quantify allowable service errors before actions are taken.
Key Features
- SLI Signal Aggregation: Consolidates performance data to quantify service reliability.
- Service Level Objectives (SLOs): Defines duration and count-based targets that reflect your service agreements.
- Error Budgets: Calculates allowable error margins based on SLOs and compliance periods.
- Error Budget Policies: Automates remediation actions like incident creation or email alerts when error budgets are breached.
- Error Budget Visualization: Provides insights and dashboards for monitoring SLO compliance and error budget consumption.
Using the Reliability Metrics Interface
Within the SRM application, the Services > Overview tab displays critical reliability and error budget data once SLIs, SLOs, and error budgets are configured and active. The Services > Reliability metrics tab lists all SLOs for a service, showing:
- SLO Name and Objective (%): The target performance your service must meet.
- SLI Type: Metrics including Availability (uptime), Errors (frequency), Latency (response time), and Saturation (system resource usage).
- Compliance Period: Timeframes like monthly or rolling windows (7, 30, 90 days) used to evaluate SLO adherence.
- State: Indicates if an SLO is in Draft, Running, or Retired status. Editing a running SLO retires the old record and creates a new one for accurate tracking.
- Limit Occurrences and Remaining Breaches: Tracks how many breaches have occurred and how many are left before reaching the error budget limit.
- Error Budget and Remaining Budget: Shows how much error budget has been spent and what remains.
Note that historical SLO and SLI metric data is archived after one year and deleted after five years to optimize system performance.
Practical Application for ServiceNow Customers
By creating and managing SLIs, SLOs, and error budget policies, you can proactively monitor service reliability and automate responses to service degradations. This supports maintaining SLAs and improving customer satisfaction by addressing issues before they impact users significantly.
Use the reliability metrics interface to regularly review service health, update reliability indicators as your services evolve, and ensure error budget policies align with your operational needs.
Use the SRM reliability metrics to define service level indicators (SLI), service level objectives (SLO), and error budget policies to track your service health and take necessary actions.
High-level workflow
- SRM leverages integrations for signal aggregation.
- Reliability indicators containing SLIs and SLOs are created for the service in SRM.
- When a qualified alert is generated for a service, the cumulative breach and the error budget values are updated for the reliability indicators in SRM.
- An error budget policy is created for the service to trigger actions such as creating an incident or sending an email to remediate service issues. Error budgets are constrained by Category.
- SLI signal aggregation
- Create duration and count based service level objectives
- Calculate error budgets (EB)
- Error budget policies
- Error budget visualization
Reliability metrics tab
Navigate to the tab to view the service level objectives (SLO) for a service.
Reliability metrics
Service Level Objectives show the following details:
- Service level objective: Name of the SLO. The SLO is a target value or the objective that your team must hit to meet your service level agreement (SLA).
- SLI type: The real numbers on the performance of your service. The SLI types are:
- Availability: Percentage of time your service is available. Also called uptime. Availability is the basic metric for reliability. (Default).
- Errors: Measures the frequency of your service errors.
- Latency: Time it takes to service a request. The actual amount of time that elapsed.
- Saturation: Measures the “fullness” of your system, emphasizing the resources that are most constrained.
- Compliance period: How long the SLO is set to last.
- Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January until 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
- State: State of the SLO. Choices are:
- Draft: The SLO isn't running in your instance yet. You can add new SLIs or update existing SLIs and you can delete the SLO.
- Running: The SLO is active in your instance. You can edit, retire, or delete the SLO.Note:Editing an SLO in the running state retires it and a new copy is created.
- Retired: The SLO is no longer running in your instance. You can reactivate it.
- Objective (%): Percentage of the desired SLI performance.
- Limit occurrences: Number of limit breaches that have occurred. (Used by Count SLO types.)
- Service level indicator: Real numbers on the performance of your service. Measurable facts that indicate whether you’re meeting the customers’ expectations.
- Error budget: How much error budget you can spend. When creating a SLO, the error budget is calculated based on the provided Compliance period and Objective (%).
- Remaining error budget: How much error budget is left.
- Remaining breach occurrences: Number of breaches left before the limit is reached.