Create an SLO, an SLI, and Error budget policies for SRM

  • Release version: Washingtondc
  • Updated February 1, 2024
  • 7 minutes to read
  • Create an SLO, an SLI, and Error budget policies to help you and your team track your services and take necessary actions when required.

    Before you begin

    Role required: Responder, Manager, or Administrator

    About this task

    Note:
    You can set up only one SLO for an SLI.

    Procedure

    1. Navigate to Workspaces > Service Operations Workspace.
      You are taken to your SRM homepage.
      Note:
      If you have other SOW applications, and depending on your assigned roles, that homepage may not be the SRM homepage. It is the SOW homepage instead, with SRM alerts and incidents included in your metrics. In that case, to view SRM specific areas, select SRM modules from the left navigation pane.
    2. Select Services Services icon. from the left navigation pane.
    3. On the Overview page, open the service for which you want to create an SLI.
    4. Select the Reliability metrics tab.

      Box highlights the Reliability metrics tab for an application service.

    5. Select Add SLO & SLI.

      An SLO is a target value or range of values for a service level that is measured by associated SLI records.

      The Set up reliability metric tab shows the forms to fill out, including forms for the SLO and SLI.

    6. In the Set up reliability metric > Set up your service level objectives (SLO) panel, fill in the appropriate fields.
      You are defining a time period for compliance and a goal for reliable service.
      Table 1. Service Level Objective form
      Field Description
      Set up your service level objectives (SLO)
      Name Name of the SLO.
      SLI type
      Type of the SLI based on which the metrics are calculated. The available types of SLI are as follows:
      • Availability: Percentage of time your service is available. (Default)
      • Errors: Measurement of how frequently service error occurs.
      • Latency: Time taken to service a request. The actual amount of time that elapsed.
      • Saturation: Measurement of your system fraction, emphasizing the resources that are most constrained.
      SLO type Type of SLO based on which metric you choose. The available types of SLO are as follows:
      • Duration: the amount of time the service spends without breaching. It is the only value available.
      • Count: the number of occurrences in a given compliance period.
      Duration settings
      Objective percentage Depends on Duration and SLI types. Percentage or count of the desired SLI performance.
      • Availability: How much of the specified time should the service be available. (Percentage)
      • Errors: How frequently can service error occur before breaching. (Count)
      • Latency: How long can a service request take. (Count in )
      • Saturation: How constrained can your system be before breaching. (Percentage)
      Compliance period Period for which the metrics is calculated. The available options are:
      • Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
      • Rolling 7 days: The duration is considered to be 7 days from the current date.
      • Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
      • Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
      Error budget Auto-populated.
      Note:
      While an error budget has not been created, these metrics remain at zero.

      Displays, in days and time, how much error budget there is.

      Error budget is calculated based on the provided Compliance period and Objective (percentage) when creating an SLO.

      For example, for an SLI type of Availability if you select Month and set Objective (percentage) to 95%, Error budget starts with 1 day which is 5% of the month. Basically, the service downtime should not be more 1 day a month.

      Count settings
      Compliance period Period for which the metrics is calculated. The available options are:
      • Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
      • Rolling 7 days: The duration is considered to be 7 days from the current date.
      • Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
      • Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October
      Count by periods settings
      Limit (occurrences) The number of occurrences after which a breach occurs.

      Limit occurrences act as an error budget.

      Compliance period Period for which the metrics is calculated. The available options are:
      • Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
      • Rolling 7 days: The duration is considered to be 7 days from the current date.
      • Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
      • Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
      Count by occurrences settings
      Limit (occurrences) The number of occurrences after which a breach occurs.

      Limit occurrences act as an error budget.

      Compliance period Period for which the metrics is calculated. The available options are:
      • Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
      • Rolling 7 days: The duration is considered to be 7 days from the current date.
      • Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
      • Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
    7. Select Save.

      Saved SLO form shows an updated error budget of 1 day, 13 hours, and 12 minutes.

      For Duration type, the screen refreshes and you can see error budget has been updated, and you are automatically taken to the next step.

    8. From Set up your service level indicator (SLI), select Add SLI

      An SLI is a quantitative measure of some aspect of the level of service that is provided. These metrics are used to define SLO targets. At least one SLI is required.

      1. Enter a name for the SLI
      2. Set conditions to select the relevant SLI from the available alert sources.

        When you set the conditions, any alert matching those conditions are shown automatically in the Results from conditions set used to select the SLI list. Use this list to verify that you are setting up the SLI you want.

        Sample SLI named Availability has several results from conditions set used to select the SLI.

      3. Select Save.
        The new SLI is listed in the Setup your Availability service level indicators (SLI) panel.
        Note:
        The header in your Service level indicator pane will update depending on whether you chose Duration or one of the Count settings. The following shows Duration for your SLO.
        Figure 1. SLI example
        Sample SLI for the HTTP response time SLO.
      4. Add another SLI or select Next.

        Next takes you to Set error budget policies (optional).

        An error budget is the amount of SLO that you can spend over a specified time. SRM aims to minimize error budget consumption to maximize reliability.
        Note:
        If you don't choose to add an error budget policy now then Objective (percentage),, which triggers actions, remains informational. Creating one gives you some remediation options.

        You can add an error budget policy later by opening the SLO and selecting the Error budget policies tab.

        Note:
        The header in your Error budget policies pane will update depending on whether you chose Duration or one of the Count settings for your SLO.
    9. From Set error budget policies (optional)., select Add a threshold.
    10. In the pop-up window, select the following Threshold type:
      • Burn rate: Percentage rate at which an error budget is consumed.
      • Error budget remaining: Percentage of error budget left to spend for the specified period.
      Note:
      Burn rate and Error budget remaining values support up to 4 decimal places and then round up.

      You can add multiple thresholds for a policy. But you can't add duplicate thresholds. If you try adding duplicate thresholds, you see an error message.

    11. Enter a value for Threshold.

      For Burn rate: The Ideal burn rate is 1 or below because it indicates that the error budget is being consumed at an expected pace. The threshold is breached when the burn rate is equal to or greater than the specified value.

      For Error budget remaining: The threshold is breached if the percentage remaining is equal to or less than what is specified.

    12. Select action you want to take when a breach occurs:
      You can pick one or both.
      • Create incident
      • Send an email
    13. Select Save.
      You are returned to the Set error budget policies (optional). panel where you can edit or delete your policies.

      Error budget policies panel shows multiple existing policies and an option to add more.

    14. Select Next.
    15. Review your SLO.

      Review panel shows the configured SLO, SLIs, and error budget policies.

    16. Select Activate.
      See View an SRM SLO for more detailed information on the final SLO page.