Ne v kontakte Asocial programmer's blog

Wild gremlin engineering

… or how to sell Chaos Engineering to your team when everything is already on fire.

This post is once again inspired by a discussion in DevZen podcast, episode #309 (for russian speakers, I highly recommend listening this one, lots of 🔥 discussions) and the Chaos Engineering book.

In the podcast, @sum3rman brought up an excellent point that most teams don’t reach a stage when their product is “too stable” and they need to introduce faults deliberately. Many more teams are actually in a semi-permanent dumpster fire state and claiming that breaking it even further will somehow help is gonna be a though sell.

on_fire.jpg
Your teammates when you wanna let chaos monkeys in.

So is Chaos Engineering any good when everything is already in chaos? Absolutely! Chaos Engineering is not about causing chaos, it’s about making sure your systems can deal with it. Fault injection is very useful for finding new and rare failure modes (which can be catastrophic nevertheless), but if your production is full of wild gremlins already, you can skip straight to the step two: root causing and fixing problems.

Frequent outages and operational overload are exhausting for the engineering team, and one of the nasty traps of exhaustion is tunnel vision. The more stuff is on fire, the harder it is to step back and look for patterns, and the more tempting is to run around putting every fire out individually. It is easier in the short term, but usually it just shifts the problem from one place to another.

firefighter.gif
I can see myself in this picture and I don’t like it.

So if you want to help your team to get out of the perpetual firefighting, Chaos Engineering is a great tool, as long as you focus on understanding the systems' behavior, failure modes and root causes:

  1. Assume any component can fail. Not only software: hardware, human operators, infrastructure and logistics to name a few.
  2. Do risk analysis, understand business implications for different failures.
  3. Explicitly decide which failure modes the system must be able to survive, based on business impact.
  4. Make it so.
  5. Do this continuously and don’t be afraid to change plans if production situation changes.

Eventually, the gremlins will be tamed and you will be able to have some fun with chaos monkeys 🔨🐵🔧 But until then, there’s no shame in taking advantage of chaos you didn’t create intentionally. If anything, it’s more representative of your system’s real problems 🙃