Navigating Service Reliability: Insights into SLOs, SLIs, and Error Budgets

Ankit K · ‎06-24-2024

Maintaining service reliability is crucial for the success of any organization, especially when launching new services. By leveraging Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets, teams can effectively monitor and manage the performance and reliability of their services. This article delves into the setup and benefits of these essential reliability concepts.

Service Class SLO in SRM

Within Service Reliability Management (SRM), the core elements revolve around teams and services, forming the foundation of activities within the SRM workspace. SRM supports various types of service classes within the CMDB:

Application Services
Technical Services
Mapped Application Service [cmdb_ci_service_discovered]
Calculated Application Service [cmdb_ci_service_calculated]

Additionally, SRM supports:

Application Service [cmdb_ci_service_auto]: No CI Associations are created.
Dynamic CI Group [cmdb_ci_query_based_service]: CI Associations are automatically created based on the specified CMDB Group.
Tag-Based Application Service [cmdb_ci_service_by_tags]: CI Associations are automatically created based on Tags.

All these service types are supported within SRM-SLO.

Setting Up SLOs

When setting up Service Level Objectives (SLOs), your SRE team aims to monitor the reliability of newly introduced services. SLOs are measurable targets that define the expected level of reliability for a service. For example, The desired uptime for the User Authentication service is set at 99.99%.

The compliance period is the timeframe over which you measure and assess whether your service is meeting its defined objectives and staying within its error budget. This period helps in evaluating the service’s performance consistently. The supported compliance periods are monthly, rolling 7 days, rolling 30 days, or rolling 90 days.

Screenshot 2024-06-24 at 10.55.05 PM.png

Types of SLOs

The type of SLO determines how you calculate your objectives. SLO management supports three types:

Duration
Count by duration
Count by occurrences

Duration SLOs: For Duration Service Level Objectives (SLOs), our current approach handles overlapping alerts in an efficient manner. If multiple alerts for the same Service Level Indicator (SLI) occur simultaneously, we merge them into a single event. This merged event starts when the first alert arrives and ends when the last alert closes. By doing this, we avoid adding up the durations of all overlapping alerts, which prevents reducing the error budget multiple times for the same downtime period.

Example: If three alerts overlap, instead of counting the downtime for each alert individually, we consider the total duration from the start of the first alert to the end of the last alert. This method ensures that the error budget accurately reflects the actual downtime.

Count-Based SLOs: For Count-based SLOs, we propose two types to cater to different reliability tracking needs:

Count by Downtime Periods: This method treats overlapping alerts similarly to Duration SLOs. Instead of counting each individual alert, it groups them by overlap into single downtime periods, which are then counted. This is useful for cases where the customer needs to specify a limit on downtime occurrences within a given period.

Example: A customer might state, "Service A can only experience 'HTTPResponse > 0.3s' downtime 1000 times this month." Here, overlapping alerts during the same downtime period are considered as one occurrence.
Count by Occurrences: This method counts each alert individually, regardless of any overlaps. This approach is crucial for scenarios where the impact on individual user requests needs to be tracked precisely.

Example: A customer might specify, "Service A should affect no more than 1000 user requests with 'HTTPResponse > 0.3s' this month." In this case, every single alert is counted to ensure a precise measure of user impact.

By implementing these SLO types, we offer flexible and accurate ways to monitor and maintain service reliability, ensuring that the application meets its performance and reliability targets.

Understanding SLIs

Service Level Indicators (SLIs) are specific, quantifiable metrics that measure the performance of the service. These metrics are crucial for evaluating whether a service meets its defined SLOs and for identifying areas needing improvement. An ideal metric for an uptime SLO would be an availability SLI, which measures the service's uptime accurately. A feature provided to end users is the ability to utilize filters to accurately assign alerts for SLO calculation. For instance, using the 'service.availability.99.percent' SLI, alerts tagged with 'app_version .1.0' are specifically designated for SLO EB calculation.

Screenshot 2024-06-24 at 11.09.00 PM.png Screenshot 2024-06-24 at 11.06.30 PM.png

**Note**The SLO management incorporates lifecycle management with states such as draft, running, and retired to ensure systematic tracking and governance throughout the service's operational journey.

Error Budgets Policy Action:

One crucial aspect of SLO management is staying informed about Error Budget breaches. Notifications alert stakeholders to any deviations in service status, helping maintain service health and reliability. Customers have the option to set up Error Budget policy actions to monitor their error budget status, such as receiving notifications via email or creating incidents for threshold breaches. In the upcoming release, they will also be able to alert on-call teams or send messages via Slack or Microsoft Teams.

**Note** If you prefer not to select either method or prefer not to record incidents as a system of record, the triggers of Error Budget policy actions are also logged in the Error Budget policy logs for auditing purposes, kept for two years.

Understanding BurnRates

There are two types of thresholds we monitor: burnrates and remaining error budget (EB). Burnrates indicate how quickly a service consumes its allotted error budget. An ideal burnrate is 1, meaning the error budget is fully consumed by the end of the compliance period. For example, a burnrate of 2 indicates the error budget is depleted twice as fast.

Screenshot 2024-06-24 at 11.27.29 PM.png

Setting clear goals with SLOs, SLIs, and error budgets is vital for keeping services reliable in fast-paced environments. By using specific metrics and managing resources well, organizations can lower risks, boost performance, and deliver high-quality services consistently. This method helps maintain top-notch operations and encourages ongoing improvements and innovations in service

Navigating Service Reliability: Insights into SLOs, SLIs, and Error Budgets

Service Class SLO in SRM

Setting Up SLOs

Types of SLOs

Understanding SLIs

Error Budgets Policy Action:

Driving optimized outcomes with ServiceNow ITOM Agentic workflows

Choose Your Applicative Credential

From Amateur to Pro: How ServiceNow's Zurich Release Elevates Your AIOps Game