Create an SLO, an SLI, and Error budget policies for SRM
Create an SLO, an SLI, and Error budget policies to help you and your team track your services and take necessary actions when required.
Before you begin
Role required: Responder, Manager, or Administrator
About this task
Procedure
-
Navigate to Workspaces > Service Operations Workspace.
You are taken to your SRM homepage.Note:If you have other SOW applications, and depending on your assigned roles, that homepage may not be the SRM homepage. It is the SOW homepage instead, with SRM alerts and incidents included in your metrics. In that case, to view SRM specific areas, select SRM modules from the left navigation pane.
-
Select Services
from the left navigation pane.
- On the Overview page, open the service for which you want to create an SLI.
-
Select the Reliability metrics tab.
-
Select Add SLO & SLI.
An SLO is a target value or range of values for a service level that is measured by associated SLI records.
-
In the Set up reliability metric > Set up your service level objectives (SLO) panel, fill in the appropriate fields.
You are defining a time period for compliance and a goal for reliable service.
Table 1. Service Level Objective form Field Description Set up your service level objectives (SLO) Name Name of the SLO. SLI type Type of the SLI based on which the metrics are calculated. The available types of SLI are as follows:- Availability: Percentage of time your service is available. (Default)
- Errors: Measurement of how frequently service error occurs.
- Latency: Time taken to service a request. The actual amount of time that elapsed.
- Saturation: Measurement of your system fraction, emphasizing the resources that are most constrained.
SLO type Type of SLO based on which metric you choose. The available types of SLO are as follows: - Duration: the amount of time the service spends without breaching. It is the only value available.
- Count: the number of occurrences in a given compliance period.
Duration settings Objective percentage Depends on Duration and SLI types. Percentage or count of the desired SLI performance. - Availability: How much of the specified time should the service be available. (Percentage)
- Errors: How frequently can service error occur before breaching. (Count)
- Latency: How long can a service request take. (Count in )
- Saturation: How constrained can your system be before breaching. (Percentage)
Compliance period Period for which the metrics is calculated. The available options are: - Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
Error budget Auto-populated. Note:While an error budget has not been created, these metrics remain at zero.Displays, in days and time, how much error budget there is.
Error budget is calculated based on the provided Compliance period and Objective (percentage) when creating an SLO.
For example, for an SLI type of Availability if you select Month and set Objective (percentage) to 95%, Error budget starts with 1 day which is 5% of the month. Basically, the service downtime should not be more 1 day a month.
Count settings Compliance period Period for which the metrics is calculated. The available options are: - Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October
Count by periods settings Limit (occurrences) The number of occurrences after which a breach occurs. Limit occurrences act as an error budget.
Compliance period Period for which the metrics is calculated. The available options are: - Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
Count by occurrences settings Limit (occurrences) The number of occurrences after which a breach occurs. Limit occurrences act as an error budget.
Compliance period Period for which the metrics is calculated. The available options are: - Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
-
Select Save.
For Duration type, the screen refreshes and you can see error budget has been updated, and you are automatically taken to the next step.
-
From Set up your service level indicator (SLI), select Add SLI
An SLI is a quantitative measure of some aspect of the level of service that is provided. These metrics are used to define SLO targets. At least one SLI is required.
- Enter a name for the SLI
-
Set conditions to select the relevant SLI from the available alert sources.
When you set the conditions, any alert matching those conditions are shown automatically in the Results from conditions set used to select the SLI list. Use this list to verify that you are setting up the SLI you want.
-
Select Save.
The new SLI is listed in the Setup your Availability service level indicators (SLI) panel.Note:The header in your Service level indicator pane will update depending on whether you chose Duration or one of the Count settings. The following shows Duration for your SLO.
Figure 1. SLI example -
Add another SLI or select Next.
Next takes you to Set error budget policies (optional).
An error budget is the amount of SLO that you can spend over a specified time. SRM aims to minimize error budget consumption to maximize reliability.Note:If you don't choose to add an error budget policy now then Objective (percentage),, which triggers actions, remains informational. Creating one gives you some remediation options.You can add an error budget policy later by opening the SLO and selecting the Error budget policies tab.
Note:The header in your Error budget policies pane will update depending on whether you chose Duration or one of the Count settings for your SLO.
- From Set error budget policies (optional)., select Add a threshold.
-
In the pop-up window, select the following Threshold type:
- Burn rate: Percentage rate at which an error budget is consumed.
- Error budget remaining: Percentage of error budget left to spend for the specified period.
Note:Burn rate and Error budget remaining values support up to 4 decimal places and then round up.You can add multiple thresholds for a policy. But you can't add duplicate thresholds. If you try adding duplicate thresholds, you see an error message.
-
Enter a value for Threshold.
For Burn rate: The Ideal burn rate is 1 or below because it indicates that the error budget is being consumed at an expected pace. The threshold is breached when the burn rate is equal to or greater than the specified value.
For Error budget remaining: The threshold is breached if the percentage remaining is equal to or less than what is specified.
-
Select action you want to take when a breach occurs:
You can pick one or both.
- Create incident
- Send an email
-
Select Save.
You are returned to the Set error budget policies (optional). panel where you can edit or delete your policies.
- Select Next.
-
Review your SLO.
-
Select Activate.
See View an SRM SLO for more detailed information on the final SLO page.