Exploring Service Reliability Management

  • Release version: Australia
  • Updated March 12, 2026
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Exploring Service Reliability Management

    Service Reliability Management (SRM) is a unified, self-serve platform designed to help teams monitor and improve the health and reliability of digital services using site reliability engineering (SRE) practices. Built within the Service Operations Workspace, SRM integrates IT Operations Management (ITOM) and IT Service Management (ITSM) capabilities into a streamlined workflow. It enables proactive management of service health by combining service level objectives (SLOs), real-time monitoring, on-call escalations, and incident resolution across distributed teams with minimal central IT governance.

    Show full answer Show less

    Key Features

    • Team-Based Roles: SRM defines distinct roles—administrators, managers, and responders—each with tailored permissions to configure, manage, and respond to service reliability tasks within their teams.
    • Service Registration and Monitoring: Teams can register services and define SLOs to align with business outcomes. Monitoring integrations collect real-time data and generate alerts when services underperform.
    • Incident Response and On-Call Management: Automated alerting triggers incident creation and notifies on-call SREs to ensure timely remediation of issues impacting service health.
    • Reliability Metrics and Error Budget Policies: SRM supports configuring reliability metrics and error budgets that help teams monitor compliance and prioritize improvements.
    • Guided Self-Service Onboarding: Simplifies setup for distributed teams by providing guided workflows that separate data access and reduce dependency on central IT governance.

    SRM User Roles and Responsibilities

    • Administrators: Manage platform configuration, integrations, users, and SRM settings via the Service Operations Workspace Admin Center.
    • Managers: Oversee SRE teams, assign on-call schedules, monitor team performance, and define procedures for incident handling and resilience improvement.
    • Responders: Service Reliability Engineers who diagnose and remediate incidents, manage alerts, and update configurations within their team scope.

    SRM Workflow

    SRM supports continuous delivery and management of services by enabling teams to:

    • Register services and define measurable SLOs to meet business goals.
    • Integrate monitoring tools to track service health via SLIs and trigger alerts on deviations.
    • Automate incident creation and notify on-call responders for timely issue resolution.
    • Collaboratively diagnose incidents and implement system improvements to enhance resilience.
    • Review ongoing SLO performance and manage error budgets to prevent risky changes and prioritize remediation efforts.

    Benefits for ServiceNow Customers

    • Unified workspace: Consolidates ITOM and ITSM workflows for efficient service reliability management.
    • Improved service health: Enables teams to proactively identify and remediate issues to maintain uptime and performance.
    • Enhanced collaboration: Facilitates coordinated incident response and continuous improvement within defined team roles.
    • Scalable self-service: Simplifies onboarding and management of distributed teams with role-based access and configuration controls.

    Next Steps for Customers

    To maximize SRM’s benefits, customers should explore configuration and usage guides to:

    • Set up services, teams, and on-call schedules.
    • Integrate monitoring tools and configure SLOs and error budgets.
    • Manage incidents and alerts effectively within the Service Operations Workspace.

    By adopting SRM, IT Operations and DevOps teams can accelerate their ability to maintain service reliability, meet business objectives, and enhance customer satisfaction through improved service performance and uptime.

    Service Reliability Management (SRM) provides a self-serve, guided experience for teams to manage service health. The experience is built using the Service Operations Workspace application and combines ITOM and ITSM capabilities into a single workflow.

    SRM overview

    Optimize service health with site reliability engineering (SRE) practices. SRM is a single operations workspace that empowers teams to improve the reliability of digital services with SRE.
    • Use on-call escalations to respond to issues in a timely manner.
    • Reduce setup friction with guided self-service to onboard distributed teams with separated data, empowered access, and minimal governance from central IT.

    When SRM is installed, several plugins and applications are also activated. For more information, see Plugins or applications installed with ITOM AIOps.

    SRM users

    Table 1. Users
    Users Description Contains Roles
    admin
    ServiceNow administrators manage, configure, and maintain the ServiceNow platform. In SRM, they can access and work in the Service Operations Workspace Admin Center. Only administrators can do the following:
    • Install SRM.
    • Add and manage SRM administrators.
    • Create and manage integration users.
    All
    SRM administrator [srm_admin]
    Note:
    This role differs from the ServiceNow admin role.

    SRM administrators can manage account settings, configurations, and users.

    Administrators can perform the following actions:
    • Access, create, edit, or delete all SRM configurations.
    • Add or manage integrations.
    • Create integrations with Application Performance Monitoring (APM) tools.
    • Set up and maintain reliability metrics.
    • Set up and maintain error budget policies.
    • Manager
    • Responder
    SRM manager [srm_manager] Managers oversee a team of SREs. Managers assign SREs to the team on-call schedule, monitor their performance, and create procedures to handle incidents and develop solutions. Managers promote resilience across all the systems and the DevOps workflows.
    Managers can perform the following actions within the context of their teams:
    • Define and set up teams, on-call schedules, and services.
    • Add and delete users such as responders and managers for the teams they're a part of.
    • Add or manage integrations.
    • Create Integrations with Application Performance Monitoring (APM) tools.
    • Set up and maintain reliability metrics.
    • Set up and maintain error budget policies.
    Responder
    SRM responder [srm_responder]

    A Service Reliability Engineer (SRE) that uses SRM to perform everyday tasks. Responders are the individuals who are on call and diagnose and remediate incidents.

    Responders can only access configurations that they’re a part of. They can only access the alerts or incidents for which they have permissions.

    SREs can perform the following actions, within the context of their teams:
    • Set up services, teams, and integrations.
    • Confirm their on-call schedules.
    • Manage incident and alert records.
    • Update teams that they’ve created.
    • Add other responders.
    • Create integrations with Application Performance Monitoring (APM) tools.
    • Set up and maintain reliability metrics.
    • Set up and maintain error budget actions.
    Inherits 17 roles including the following:
    • cmdb_read
    • sn_sow.sow_user
    • sn_sow_srm.srm_responder
    • workspace_user
    • slo_operator

    For more information, see SRM roles and responsibilities.

    SRM workflow

    Infographic showing how responders, managers, and administrators manage teams, register services, define SLO, monitor integrations, respond to notifications, and remediate incidents. For details, refer to the following description.
    1. Product teams in IT or Lines of Business continuously deliver new service instances and technology management services. Example: New customer billing portal.
    2. Along with SLO Management, teams can register services and define service level objectives (SLOs), helping them reach business outcomes. Example: 95% monthly availability for the billing portal.
    3. Monitoring integrations are set up by the teams to collect the real-time health of these services. Example: Cloud Observability.
    4. Monitoring creates service level indicators (SLIs) impacting alerts when services are underperforming. Automation groups and enriches. Example: Billing portal latency is exceeding 7 s.
    5. When the alerts indicate an outage or customer-impacting degradation, incidents are created and on-call notifications notify appropriate team resources. Example: A Billing SRE team is notified via phone of a latency issue on the billing portal.
    6. After teams collaboratively diagnose and remediate incidents, they identify action items for improving the system's resilience. Example: The Billing team decides to add additional web server capacity.
    7. Management continually reviews SLO performance, helps to prevent changes when the error budget is exhausted, and prioritizes improvement initiatives for underperforming services.

    SRM benefits

    Benefit Feature Users
    Team-based experience Working with SRM teams SRM administrators, managers, and responders
    Service registration Working with SRM services SRM administrators, managers, and responders
    Prebuilt integrations Working with integrations in SRM SRM administrators, managers, and responders
    Measure service health Reliability metrics in SLO Management SRM administrators, managers, and responders
    On-call coverage Create an SRM on-call schedule SRM administrators, managers, and responders
    Remediate high severity alerts and incidents Working with SRM reliability tasks SRM administrators, managers, and responders