WTF is SRE? The job nobody understands.

Oct 11, 2020 9 min sre devops rant

Trigger warning: If you have strong opinions about Ops, DevOps, SRE and adjacent subjects, read this at your own risk.

A bit over 4 years ago I’ve accepted a job offer with the “Site Reliability Engineer” title, and at the time I had no clue what it really was about. The recruiter said it was kinda like a normal Software Engineer, but more about infrastructure, and that was all. Since then people wrote countless posts, gave dozens of talks, literally published whole books about this job, and yet the industry in general has no clue what the fuck SRE means. Okay, maybe SRE still new ¹, but DevOps has been popular for a decade and people still get it wrong 🤷

So let’s set the record straight, shall we?

First of all, every modern organization needs IT Operations (Ops for short) these days. Computer systems are complicated enough to require a specialized professional to run and maintain them. This role’s purpose is to take existing components (software and hardware) and make them work together to solve business’s problem. Even though Ops usually isn’t about creating new components, this is really a jack-of-all-trades role and they can do anything IT.

Depending on the scale of the organization, Ops can branch out into more specialized roles such as Hardware Ops, Network Engineers, System Administrators, be that in-house or outsourced. Contrary to the popular opinion, coding skills often play a big role in these jobs, for example in integrating different systems, task automation, etc. And yeah, all those folks are “real engineers”, oftentimes more real than “software engineers”, but I digress.

Some companies reach a stage when integrating off-the-shelf solutions is no longer cost-efficient for them, and they create an in-house Software Development team to build the missing blocks from scratch. Since they usually come after the Ops team, it’s only natural for the organization let Software Development team focus on building new stuff, and let Ops run it as they already do with the third-party components. This arrangement works, but only to a point:

Delivery cycle is slow because it requires two different teams (with different priorities and organizational incentives) to do something before a new version gets deployed.
The Development team lacks first-hand experience with the problems their software encounters in production, which makes designing for reliability and ease of operation difficult, if not impossible.

A seemingly obvious solution for these problems is to put the same people in charge of developing and operating the software in production.² This idea was named (guess what) DevOps.

Here comes my first pet peeve: “devops” is not a role, it’s a methodology. You don’t need Terraform, Ansible, AWS, Chief, Docker, Kubernetes and CI-of-the-day to be devops. You only need to make your software engineers responsible for supporting their code in production. If you are a freelancer who builds internet shops for customers and you also have a dozen of shell scripts to deploy code, upgrade databases and manage backups, and you run them yourself, congratulations, you are using DevOps methodology!

So why did all this dockers and chefs got associated with DevOps so firmly? Well, if you have a hammer in your hand everything looks like a nail. Operations is a pretty large annoying nail in software developer’s view (rightfully so), and writing more software looks like a handy sledgehammer for the occasion. Which is DevOps working as intended: developers adapt their software and tooling to be easier to operate in prod! For sure, sometime folks get carried away and build an elaborate geo-distributed multi-tenant hybrid cloud environment when a couple of bash scripts would’ve sufficed, but hey, I don’t judge 😇

This works well, but only until all the production-related tooling becomes complicated enough that the software developers no longer have mental capacity to manage the tooling and develop the product at the same time. Again, the standard solution is to specialize: a new team to take care of the production stack and provide a high-level abstraction for the product team. And because they already deal with technologies that are deeply linked to “DevOps the buzzword” the whole unit gets named “The DevOps Team”. This is why there are ~150k DevOps job openings on LinkedIn alone even though DevOps is not a role! 🤦

To confuse things even further, the meaning of “DevOps” varies massively between organizations. On one end of the spectrum we have specialized SysAdmin teams, who focus on setting up and running Kubernetes, CI and control processes around it. Funnily enough, this kind of “DevOps the team” may not be doing “DevOps the methodology”, but their fellow product development teams very well may be following “DevOps the methodology”, assuming they themselves run their product in production using tools provided by “DevOps the team”.

<snark>The far end of this side of the spectrum is essentially a re-branded “Development team + Ops team” combo, except that the Development team outputs Docker containers instead of JARs. Full circle closed.</snark>

Confused yet? Hang on, it gets better because on the other side of the “devops the team” spectrum are specialized software engineering teams who build production infrastructure as an internal product. And this team may or may not follow “devops the process” themselves but typically their purpose is to enable other teams follow it.

To summarize:

When people say “DevOps methodology” they mean that the same people are responsible for writing the code and supporting it in production. Or they might merely imply that Docker is involved at some stage, so make sure to clarify.
When people say “DevOps team” they typically mean “infrastructure team”, which can include roles ranging from SysAdmin, to System Engineers, to regular Software Engineers. Again, make sure you clarify what they mean exactly.
When people say just “DevOps”… you guessed it, they can mean anything and you should ask clarifying questions. 😒

Oops, we are a thousand words in and didn’t even get to the subject of the post 😩 Well, let’s get into it.

Paraphrasing Dan Ariely, SRE is kind of like a unicorn, everybody talks about it, nobody really saw it, but everyone wants to have one.

Turns out a naive DevOps implementation has a couple of problems with it. First, product-feature and operations-feature development require rather different mindsets and proficiencies. A similar difference exists, for example, between Software Engineers and Security Engineers: both know how to code and both know about security vulnerabilities, but they need different mindsets to be good at their jobs, and switching hats frequently is difficult.

Another problem is that product teams are often under a high pressure to deliver product features to the market faster, which pushes them to compromise on non-functional aspects of the product, such as quality control or ease of operations/reliability.

So wouldn’t it be nice to have a specialized group of software engineers within the product team, whose objective is specifically to ensure that the product meets operations and reliability requirements the business needs? This is what “SRE the methodology” offers: the product team gains a specialized sub-team of “SRE the role” people who build functionality necessary for operating the product at a reliability level the business needs and share their expertise with the rest of the team. It allows the product team to have both experts on production aspects as well as business logic with both categories deeply hands-on with the code. Sometimes the SWE and SRE teams may belong to different parts of organizational hierarchy, but they are still building the product together. This is the key difference between “SRE the methodology” and the classic “SWE + Ops” approach. Side note: a successful SRE is both a good engineer and educator, siloing all production-related work within SRE is a path to failure.

This brings me to my second pet peeve: being oncall is not the purpose of “SRE the role”. In the ideal world SREs wouldn’t be oncall because they’ve built the system to self-recover without human intervention from any possible failure. And even though in the real world this is not possible to achieve 100%, being oncall grants SREs knowledge and experience required to make the product more resilient. Being oncall for SREs is a “know your enemy” exercise, not the “somebody has to”. Case in point, most companies with mature SRE teams have developer oncall rotations who serve as second line of escalation for exactly that reason.

While we are talking about misconceptions, let’s bring up another one: “SRE is an infrastructure/platforms team”. It is true that many infrastructure elements are spawned and owned by SRE teams, but as soon as the infrastructure becomes a product (external or internal) in itself, it starts needing distinct SRE and SWE roles. Unfortunately, often it is not recognized and SREs have to wear both hats when it comes to infrastructure products, and it is problematic for all the reasons mentioned above.

A corollary is that “anything related to production/reliability should be owned by an SRE team” is not true. To give a concrete example: building monitoring for a system is generally considered an SRE kind of task. That is, SREs are proficient users of monitoring software and know how to make it work. However, building monitoring platform as a product is a completely different task much more similar to building a database, and saying that the team of software engineers who build a monitoring system are by definition SREs makes no sense.

So there you have it: DevOps is a practice where the same people write and support the code in production, SRE is a particular way of managing cognitive overload DevOps may create, at least in theory. Reality, of course, is messy and different because people often misunderstand or willfully misuse the terms for various reasons. I’m not sure if we should admit the defeat and let people use those words to represent anything vaguely related to production, or we should be pedantic and try to educate people about actual meaning for these words… 😣 But I hope this post cleared things up at least for someone.

It existed at Google since 2003, but nobody really cared until The SRE Book came out in 2016. ↩︎
Pure software companies often arrive to this by starting with Software Engineering department only and then expanding their responsibilities to production management once there is production to run, thereby jumping straight into DevOps. But in the grand scheme of things the outcome is the same. ↩︎

Respond via Twitter