What is Site Reliability Engineering (SRE)?

What is site reliability engineering (SRE)?

Site reliability engineering is the process of utilizing operation processes and assigning them to the software engineering team for the automation.

Demo ITOM

Table of Contents

What is SRE?

History of site reliability engineering

What does a site reliability engineer do?

Where does SRE fit on your team?

How can SRE benefit your company?

Pros and cons of being a site reliability engineer

DevOps vs SRE

Technology to support SRE

The tools your need for SRE

Is ServiceNow right for SREs?

Expand All

Collapse All

What is SRE?

IT teams are constantly looking to adopt SRE methodologies. Site reliability engineering is taking operations practices and turning them over to software engineers for automation of human tasks, problem solving, and systems management. An SRE team is responsible for change management, emergency response, monitoring, availability, performance, latency, efficiency, and capacity planning of the services, usually writing software for process automation.

SRE is a great asset for reliability in software and scalability, as systems can be managed through code—for a balance between ensuring that a product and features are reliable and releasing new products and features.

History of site reliability engineering

Credit for the term “SRE” goes to Google’s Ben Treynor Sloss

Ben Treynor Sloss of Google is the mastermind behind SRE, and aptly describes it as “what happens when a software engineer is tasked with what used to be called operations”. The concept arose after an examination of the conflicts between operations, who want to ensure that features don’t break anything or inconvenience end users, and dev teams, who have developed and want to release new features as soon as they are ready for a rollout. SRE is a reconciliation between the two.

A team of Google engineers literally wrote the book on SRE

Google published a book on SRE that is available for free online. It offers a deep dive into the role of SRE and recommended best practices for execution. Parts II and III, principles and practices (respectively) are of note.

SRE Principles: The core principles of SRE, according to Google, are:

Embracing risk: Provide neutral approaches to service management using error budgets.
Service level objectives: Provides recommendations for disintengled indicators from agreements and examines how SRE uses the terms.
Eliminating toil: Stepping away from mundane and repetitive tasks that are devoid of value.
Monitoring distributed systems: Always avoid being blind to what is going on in the organization for the sake of reliability.
Release engineering: Carefully account for releases to ensure that they are consistent and do not contribute to outages.
Simplicity: A system that is too complex can lower reliability and become difficult to scale back to a simpler place.

SRE excellence requires experience

The role of a site reliability engineer is best performed by someone with software experience under their belt—it is certainly not a recommended entry-level position. Proper SRE execution requires fluency in software engineering and understanding a system of great scale and complexity.

SRE is a philosophy

A site reliability engineer needs the right mindset for this position. Technical skills are necessary, but a conceptual understanding of operations is key. It is important for SREs to be grounded in traditional software development processes, but there is also a great deal of importance in a holistic understanding of company processes and moving a reliable system forward.

SRE should be a catalyst for change

It should be the job of everyone in the organization to be as reliable as possible, thus applying the important principles of SRE. Apply a reliability model to each team and take the time to discuss how reliability can fit into each team and affect everyone.

What does a site reliability engineer do?

Site reliability engineer (SRE) roles and responsibilities

New launches are green-lighted based on current product performance: Applications are generally not up 100% of the time. The SRE team is meant to craft a service-level agreement to define the system, and how it will be used for end-users. A common part of a service-level agreement is an error budget, or the maximum amount of threshold for outages and errors.

SREs can code

Development teams and SREs share staff, meaning that an additional SRE means one less developer, and vice versa. The system is self-regulating to avoid any battles between developers and SREs for staffing needs. SREs are capable of coding and development as well, which helps them work well alongside the development team.

SREs are allowed to move between projects, as SRE creates a strong sense of motivation and dedication to allow team members to pursue personal goals and objectives.

Common roles and responsibilities for a site reliability engineer

Building software to help operations and teams
Fixing escalation issues
Optimizing on-call processes
Documenting team knowledge
Conducting post-incident reviews

Where does SRE fit on your team?

SREs can fit right at the crux of IT operations, software engineering, and support to provide a strong foundation and relationship among the teams, which helps with feedback loops, collaboration, and reliability.

How can SRE benefit your company?

Site reliability experts can make SRE work for you

SREs are on the lookout for big picture needs to guide different teams toward a singular goal.

Automation is fundamental to SRE

A great deal of an SRE role is rooted in weeding out inefficiencies and identifying things that are easy to automate away. Time-consuming tasks can be stopped, and efficiency can be increased without as much manual work.

SRE isn’t just for tech companies

SRE practices don’t need to only apply to the tech industry. A site reliability engineering culture can be expanded into ecommerce, customer service, and manufacturing.

Pros and cons of being a site reliability engineer

DevOps vs SRE

DevOps is a method for building and delivering good software, combining software development and operation with the intent of fusing operations and development roles. SRE tends to be driven more from a development side rather than the operational side of DevOps.

Learn more about DevOps
Deliver modern operations for DevOps and SRE teams

Technology to support SRE

Linux containers can provide the needed technology for a cloud-native development—the containers support environment unification for integration, automation, development, and delivery. Kubernetes can automate necessary Linux containers.

The tools your need for SRE

There isn’t a single, uniform toolset for SRE. But it is crucial to build out SRE functions within a company in conjunction with automation for scalability and repeatability.

Pricing for ServiceNow IT Operations Management

Get ServiceNow ITOM pricing, which helps your organisation gain visibility across infrastructure and apps and deliver high-performance business services.

Get Pricing

Is ServiceNow right for SREs?

ServiceNow provides increased value by bridging work across multiple teams, registering their microservices, correlating observable data, giving reliability metrics at your fingertips, automating changes, and predicting failures—all while keeping your existing tools intact.

Capabilities that scale with your business

Create your next SRE transformation plan with ServiceNow.

Demo ITOM

Contact Us

Resources

Articles

What is ServiceNow?

What is ITOM?

What is cloud computing?

Analyst Reports

IDC: Accelerating IT Automation

The Forrester Wave™: AIOps - ServiceNow

Autonomous Service Operations - ServiceNow

Data Sheets

The Value of CMDB

ITOM Visibility

Agent Client Collector (ACC)

Ebooks

CMDB 101 primer

Increasing Service Visibility

Dramatically Improve Service Availability

White Papers

ServiceNow ITOM CMDB

AI-Powered Service Operations to Grow the Business

Reap the Benefits of AIOps within Weeks

Automotive

Banking

Consumer Packaged Goods

Healthcare

Insurance

Life Sciences

Manufacturing

Nonprofit

National Government

Retail

Technology Providers

Telecom

Find a partner

Become a partner

Partner awards

Partner portal

Partner applications

Careers

Investors

ServiceNow AI Research

Leadership

Locations

Newsroom

Analyst Reports

Global impact

Trust and compliance

ServiceNow Shop

AI Agents

IT Service Management

ServiceNow AI Control Tower

IT Operations Management

Customer Service Management

Strategic Portfolio Management

IT Asset Management

Governance, Risk, and Compliance

Security Operations

Field Service Management

HR Service Delivery

ServiceNow EmployeeWorks

AI

Data

Workflows

ServiceNow Otto

RaptorDB

Process Mining

AI Agents

ServiceNow AI Control Tower

Security

App Engine

ServiceNow Store

Responsible AI

Provide better experiences

Resolve issues faster

Create and automate workflows

Enterprise Architecture

Service Operations Workspace

Cloud Governance Suite

Operational Technology Management

IT Asset Management

IT Operations Management

IT Service Management

ServiceNow Cloud Observability

Strategic Portfolio Management

Digital End-user Experience

Customer Service Management

Field Service Management

Sales and Order Management

Configure, Price, Quote

Sales Automation

Financial Services Operations

Healthcare and Life Sciences Service Management

Sales and Order Management for Technology Providers

Sales and Order Management for Telecommunications

Public Sector Digital Services

Telecommunications Service Management

Technology Provider Service Management

Security Operations

Security Incident Response

Unified Security Exposure Management

Threat Intelligence Security Center

Integrated Risk Management

Third-party Risk Management

Security Posture Control

Privacy Management

Identity Security

HR Service Delivery

Talent Development

Legal Service Delivery

Workplace Service Delivery

Accounts Payable Operations

Sourcing and Procurement Operations

Supplier Lifecycle Operations