Exploring Service Reliability Management
Summarize
Summary of Exploring Service Reliability Management
Service Reliability Management (SRM) in ServiceNow provides a unified, self-serve, guided experience to help teams manage and optimize the health and reliability of digital services using site reliability engineering (SRE) practices. Built into the Service Operations Workspace, SRM integrates IT Operations Management (ITOM) and IT Service Management (ITSM) capabilities into a single workflow. It enables teams to respond quickly to issues via on-call escalations and simplifies setup with guided onboarding for distributed teams while maintaining data separation and minimal central IT governance.
Show less
Key Features
- Role-based Access and Responsibilities: SRM defines three main user roles:
- SRM Administrators: Manage platform configurations, integrations, and user roles. They install SRM and maintain reliability metrics and error budgets.
- SRM Managers: Oversee SRE teams, assign on-call schedules, monitor performance, and ensure resilience through team and service management.
- SRM Responders: Perform day-to-day incident diagnosis and remediation within their teams, manage alerts, and maintain reliability metrics.
- Integrated Workflow: SRM supports continuous delivery of services and technologies, allowing teams to register services, define service level objectives (SLOs), and integrate monitoring tools to track service health through service level indicators (SLIs).
- Automated Incident Management: Alerts trigger incidents and on-call notifications, enabling prompt response to outages or service degradation.
- Collaborative Remediation and Improvement: Teams diagnose incidents, implement fixes, and identify system resilience improvements while management reviews SLO performance and prioritizes initiatives based on error budget consumption.
- Prebuilt Integrations: Easily connect with Application Performance Monitoring (APM) tools and other monitoring platforms to enhance service health visibility.
Benefits
- Team-based Experience: Facilitates collaboration among administrators, managers, and responders.
- Service Registration and Monitoring: Enables comprehensive tracking and management of services within SRM.
- Reliability Metrics and Error Budgets: Helps measure and maintain service health and performance.
- On-call Scheduling: Streamlines alert escalation and response coordination.
- Incident Remediation: Supports efficient resolution of high-severity alerts and incidents.
Practical Application for ServiceNow Customers
By implementing SRM, ServiceNow customers can accelerate visibility into service health aligned with business objectives through SLOs and incident management. IT Operations and DevOps teams gain a comprehensive platform to manage service reliability, improve uptime, and maintain service quality with reduced operational friction. Customers can expect streamlined workflows for incident response, role-specific access controls, and integrated monitoring to proactively manage and improve their digital services.
Service Reliability Management (SRM) provides a self-serve, guided experience for teams to manage service health. The experience is built using the Service Operations Workspace application and combines ITOM and ITSM capabilities into a single workflow.
SRM overview
- Use on-call escalations to respond to issues in a timely manner.
- Reduce setup friction with guided self-service to onboard distributed teams with separated data, empowered access, and minimal governance from central IT.
When SRM is installed, several plugins and applications are also activated. For more information, see Plugins or applications installed with ITOM AIOps.
SRM users
| Users | Description | Contains Roles |
|---|---|---|
| admin |
ServiceNow administrators manage, configure, and maintain the ServiceNow platform. In SRM, they can access and work in the Service Operations Workspace Admin Center. Only administrators can do the following:
|
All |
| SRM administrator [srm_admin] Note: This role differs from the ServiceNow admin role. |
SRM administrators can manage account settings, configurations, and users. Administrators can perform the following actions:
|
|
| SRM manager [srm_manager] | Managers oversee a team of SREs. Managers assign SREs to the team on-call schedule, monitor their performance, and create procedures to handle incidents and develop solutions. Managers promote resilience across all the systems
and the DevOps workflows. Managers can perform the following actions within the context of their teams:
|
Responder |
| SRM responder [srm_responder] |
A Service Reliability Engineer (SRE) that uses SRM to perform everyday tasks. Responders are the individuals who are on call and diagnose and remediate incidents. Responders can only access configurations that they’re a part of. They can only access the alerts or incidents for which they have permissions. SREs can perform the following actions, within the context of their teams:
|
Inherits 17 roles including the following:
|
For more information, see SRM roles and responsibilities.
SRM workflow
- Product teams in IT or Lines of Business continuously deliver new service instances and technology management services. Example: New customer billing portal.
- Along with SLO Management, teams can register services and define service level objectives (SLOs), helping them reach business outcomes. Example: 95% monthly availability for the billing portal.
- Monitoring integrations are set up by the teams to collect the real-time health of these services. Example: Cloud Observability.
- Monitoring creates service level indicators (SLIs) impacting alerts when services are underperforming. Automation groups and enriches. Example: Billing portal latency is exceeding 7 s.
- When the alerts indicate an outage or customer-impacting degradation, incidents are created and on-call notifications notify appropriate team resources. Example: A Billing SRE team is notified via phone of a latency issue on the billing portal.
- After teams collaboratively diagnose and remediate incidents, they identify action items for improving the system's resilience. Example: The Billing team decides to add additional web server capacity.
- Management continually reviews SLO performance, helps to prevent changes when the error budget is exhausted, and prioritizes improvement initiatives for underperforming services.
SRM benefits
| Benefit | Feature | Users |
|---|---|---|
| Team-based experience | Working with SRM teams | SRM administrators, managers, and responders |
| Service registration | Working with SRM services | SRM administrators, managers, and responders |
| Prebuilt integrations | Working with integrations in SRM | SRM administrators, managers, and responders |
| Measure service health | Working with reliability metrics | SRM administrators, managers, and responders |
| On-call coverage | Create an SRM on-call schedule | SRM administrators, managers, and responders |
| Remediate high severity alerts and incidents | Working with SRM reliability tasks | SRM administrators, managers, and responders |