SRE vs. DevOps

Lou BichardMon, 12/02/2019 - 08:44
Subject

You’ve heard the terms “SRE” and “DevOps” thrown around, but understanding the difference is confusing.

But not to fear, as today we’re going to discuss the essence of both concepts. To help us along the way, we’ll discuss different practices, such as implementing observability, improving the software pipeline, improving the deploy process, and preparing for and responding to incidents, including how they relate to both SRE and DevOps.

By the end of the article, you’ll be able to clearly understand both DevOps and SRE, distinguish between the two, and make an informed decision about what’s right for your company. Sound good? Let’s get to it!

What Is DevOps?

DevOps is a software culture based on the belief that stability and throughput of engineering work are not mutually exclusive or opposing goals.

To expand on the definition: a DevOps culture acknowledges how to deliver software to customers quickly, without compromising on stability. This belief is counter to a commonly held and somewhat more intuitive belief that the faster we deploy software, the more likely we’ll introduce instability.

In terms of practices, DevOps culture can manifest itself in different ways. That said, DevOps culture is often characterized by practices such as automating routine tasks; optimizing the build, test, and deploy pipeline; and implementing effective monitoring.

DevOps works through a cyclical process. The cycle begins as a developer commits code until that same commit is in production and monitored. The delivery cycle is then optimized to provide business value to customers in the shortest time possible.

For DevOps, four metrics are central to the culture. These metrics’ aims are split between the goals of frequency of delivery and stability. By focusing on all four, it is assumed that a company will achieve the promised land of fast delivery and stable service. So what are the metrics? Let’s take a look now:

  • Mean time to recovery (MTTR)—how fast an application recovers from an outage.
  • Average failure rate—how often application deployments succeed (or have to be rolled back).
  • Deploy frequency—how often a team deploys (sometimes per engineer per day).
  • Lead time for changes—how long it takes a change to go from code committed to production.

These metrics are the heart of a DevOps culture. Hopefully, you’re more clear on what DevOps is—because that means we can introduce what SRE is and start to analyze the differences.

What Is SRE?

SRE started as a role within Google. Although there’s no strict textbook definition of SRE, Google literally wrote the book on SRE, so we can reverse engineer what it means to be an SRE out of the book itself. That said, a commonly accepted definition is from Ben Treynor, inventor of the term “SRE”:

“[SRE is] what happens when you ask a software engineer to design an operations function.” — Ben Treynor

Unlike DevOps, you can see that SRE is explicitly a role. An SRE role is when an individual (or individuals) on a single team spends roughly 50% of their time on engineering and 50% on operations work. Having this joint role is how SRE bridges the gap between operations and development. The gap between the two departments can be a hindrance to achieving progress on the aforementioned metrics of DevOps.

Like DevOps culture, a key aspect of the role of the SRE is setting metrics. These metrics are service-level objectives (SLOs), service-level indicators (SLIs), and error budgets. We won’t cover in detail what these are, but they seek to measure the amount of uptime an application has within a given time frame.

An SRE engineer will work with the business to define what level of uptime is acceptable for the application they’re working on. The SRE then works to get these metrics visible. The SRE uses the metrics to make decisions such as whether a team can deploy further changes. The answer to which is yes—if they possess enough remaining error budget.

As you can see, SRE takes principles of DevOps and adds some specific details around the idea. But I know what you’re thinking: These sound very familiar; what separates the two? So let’s discuss that now.

The Practices of SRE and DevOps

I appreciate that both of these definitions are abstract. To give you more context on what these terms mean, let’s take a look at some common practices that are attributed to either DevOps culture or to SRE. We can discuss what the practice is and why an organization might want to adopt the practice, but importantly, we’ll also relate the concept back to either DevOps, SRE, or both.

Sound good? Let’s jump in!

Improving an Application’s Observability

Observability can be defined as the ability to understand how a system is behaving from the outside by looking at the markers that the software emits. A big part of the role of SRE is understanding what’s going on with a running application in order to diagnose the root causes of issues and ensure uptime. In a practical sense, an SRE implements observability by instrumenting applications, with logs, and metric events to gather data about the application.

Having an application that can be understood from the outside helps to reduce risk related to software deployments and can reduce downtime in production through a faster diagnostic of outages. Therefore, investments in monitoring and observability of an application could be work that contributes to the four metrics of DevOps outlined at the start. But it also helps an SRE to understand their error budget and current impacts on their SLOs.

Improving Incident Response

Having metrics and data doesn’t help much if your application is down for long periods of time because engineers weren’t aware of the issues early enough. Outages eat through an SRE’s error budget and don’t really help us prove the DevOps idea that frequency of deployment doesn’t impact site stability.

An SRE would therefore also help to implement good incident-response practices. Incident response usually starts with an alert that notifies engineers that a problem has occurred with the application that needs reviewing. Following an alert, an engineer will usually decide the next course of action. Typically, this includes bringing in other engineers for support and also communicating with the business and customers on the outage and the expected time to recovery.

It’s common following an incident for there to be a “postmortem” in which engineers replay the events that occurred leading up to, and during, an incident. The postmortem enables engineers to action the outcomes to not repeat the mistakes. The process of reflecting on outages is part of the DevOps cycle.

Improving the Deployment Process

As we said, DevOps culture believes that increased throughput does not affect the application’s reliability. This is true only if the engineers put in the effort to ensure the deployment process is effective.

One way a deployment process can be improved is through breaking down the application into small, independently deployed services. You can also improve deployment processes by changing the way in which deployments are rolled out, by using techniques like canary deploy or feature flags. Both of these techniques restrict the number of users exposed to a new change, therefore reducing risk.

As with the response to outages, deployment process improvements are usually ongoing and iterative. An SRE then will resolve any issues found with the deployment during their 50% operational focused time. Improvements to the deployment process are also likely to have a positive impact on the four metrics, too.

Improving the Software Pipeline

Another common task of the SRE is to improve the software pipeline. But what is a software pipeline? A software pipeline is a common practice where teams take their code through a series of ever-more-complex tests, often occurring across many environments.

Part of the SRE’s role is to improve the speed and quality of the software pipeline. The SRE asks questions such as, How effectively do the tests cover the application use cases? How well does the staging environment replicate production? What additional tests do we need for more confidence in our deployment?

As with both the incident response and deployment improvements, software pipeline improvements are iterative. As a team works with their pipeline, they get more feedback. The team uses this feedback to incorporate future changes that make their software pipeline faster and more reliable. The more the engineers trust their pipeline, the faster they can push code, without compromising site reliability.

SRE or DevOps: What Do You Need?

And that concludes our run-through of DevOps and SRE. There’s plenty more we could talk about, because the practices of improving software delivery are virtually bottomless. But we’ve covered the main areas, such as incident response, deployment process improvements, pipeline improvements, and observability.

As you can see, both SRE and DevOps have a lot of overlap. However, both concepts center on the business value of getting working software into production while minimizing the impact to users. The outcome of this is immensely desirable for many companies—yours included, I imagine.

I hope that helps clear the fog of confusion around the terms DevOps and SRE. And now you should be able to decide what’s right to implement for your company. No matter what you go forward with, good luck implementing your own DevOps culture—your customers will thank you for it!