What is site reliability engineering (SRE)?

Site reliability engineering is the process of utilising operation processes and assigning them to the software engineering team for the automation.

IT teams are constantly looking to adopt SRE methodologies. Site reliability engineering is taking operations practices and turning them over to software engineers for automation of human tasks, problem solving, and systems management. An SRE team is responsible for change management, emergency response, monitoring, availability, performance, latency, efficiency, and capacity planning of the services, usually writing software for process automation.

SRE is a great asset for reliability in software and scalability, as systems can be managed through code—for a balance between ensuring that a product and features are reliable and releasing new products and features.

Credit for the term “SRE” goes to Google’s Ben Treynor Sloss

Ben Treynor Sloss of Google is the mastermind behind SRE, and aptly describes it as “what happens when a software engineer is tasked with what used to be called operations”. The concept arose after an examination of the conflicts between operations, who want to ensure that features don’t break anything or inconvenience end users, and dev teams, who have developed and want to release new features as soon as they are ready for a rollout. SRE is a reconciliation between the two.

A team of Google engineers literally wrote the book on SRE

Google published a book on SRE that is available for free online. It offers a deep dive into the role of SRE and recommended best practices for execution. Parts II and III, principles and practices (respectively), are of particular note.

SRE Principles: The core principles of SRE, according to Google, are:

  • Embracing risk: Provide neutral approaches to service management using error budgets.
  • Service level objectives: Provides recommendations for disentangled indicators from agreements and examines how SRE uses the terms.
  • Eliminating toil: Stepping away from mundane and repetitive tasks that are devoid of value.
  • Monitoring distributed systems: Always avoid being blind to what is going on in the organisation for the sake of reliability.
  • Release engineering: Carefully account for releases to ensure that they are consistent and do not contribute to outages.
  • Simplicity: A system that is too complex can lower reliability and become difficult to scale back to a simpler place.

SRE Practices: SREs run related systems for external or internal users, and are responsible for the services. Successful operation of the services includes: capacity planning, addressing root causes of outages, and developing monitoring systems. Google’s hierarchy of a reliable service are:

  • Product: The top of the reliability hierarchy, which indicates that a product is workable and reliable.
  • Development: Software engineering and system design work within the company.
  • Capacity Planning: Loan balancing ensures that the capacity that was built is being used properly.
  • Testing + Release Procedures: After forming an understanding of what went wrong, actively preventing it. Carefully testing products before they are released.
  • Postmortem/Root Cause Analysis: Build a culture of blamelessness and addressing a fix to an issue in order to avoid a repeat incident.
  • Incident Response: Being on-call, staying in touch with systems, effective troubleshooting, and careful planning before the fact.
  • Monitoring: Being aware of problems before the end user notices.

SRE excellence requires experience

The role of a site reliability engineer is best performed by someone with software experience under their belt—it is certainly not a recommended entry-level position. Proper SRE execution requires fluency in software engineering and understanding a system of great scale and complexity.

SRE is a philosophy

A site reliability engineer needs the right mindset for this position. Technical skills are necessary, but a conceptual understanding of operations is key. It is important for SREs to be grounded in traditional software development processes, but there is also a great deal of importance in a holistic understanding of company processes and moving a reliable system forward.

SRE should be a catalyst for change

It should be the job of everyone in the organisation to be as reliable as possible, thus applying the important principles of SRE. Apply a reliability model to each team and take the time to discuss how reliability can fit into each team and affect everyone.

Site reliability engineer (SRE) roles and responsibilities

New launches are green-lighted based on current product performance: Applications are generally not up 100% of the time. The SRE team is meant to craft a service-level agreement to define the system, and how it will be used for end-users. A common part of a service-level agreement is an error budget, or the maximum amount of threshold for outages and errors.

SREs can code

Development teams and SREs share staff, meaning that an additional SRE means one less developer, and vice versa. The system is self-regulating to avoid any battles between developers and SREs for staffing needs. SREs are capable of coding and development as well, which helps them work well alongside the development team.

SREs are allowed to move between projects, as SRE creates a strong sense of motivation and dedication to allow team members to pursue personal goals and objectives.

Common roles and responsibilities for a site reliability engineer

  • Building software to help operations and teams
  • Fixing escalation issues
  • Optimising on-call processes
  • Documenting team knowledge
  • Conducting post-incident reviews

SREs can fit right at the crux of IT operations, software engineering, and support to provide a strong foundation and relationship among the teams, which helps with feedback loops, collaboration, and reliability.

Site reliability experts can make SRE work for you

SREs are on the lookout for big picture needs to guide different teams towards a singular goal.

Automation is fundamental to SRE

A great deal of an SRE role is rooted in weeding out inefficiencies and identifying things that are easy to automate away. Time-consuming tasks can be stopped, and efficiency can be increased without as much manual work.

SRE isn’t just for tech companies

SRE practices don’t need to only apply to the tech industry. A site reliability engineering culture can be expanded into ecommerce, customer service, and manufacturing.

DevOps is a method for building and delivering good software, combining software development and operation with the intention of fusing operations and development roles. SRE tends to be driven more from a development side rather than the operational side of DevOps.

Learn more about DevOps
Deliver modern operations for DevOps and SRE teams

Linux containers can provide the technology necessary for a cloud-native development—the containers support environment unification for integration, automation, development, and delivery. Kubernetes can automate necessary Linux containers.

There isn’t a single, uniform toolset for SRE. But it is crucial to build out SRE functions within a company in conjunction with automation for scalability and repeatability.

ServiceNow provides increased value by bridging work across multiple teams, registering their microservices, correlating observable data, providing reliability metrics at your fingertips, automating changes, and predicting failures—all while keeping your existing tools intact.

Capabilities that scale with your business

Create your next SRE transformation plan with ServiceNow.

Loading spinner