Skip to content

Chaos Engineering Without the Cosplay

I have been doing chaos engineering, in some form, for about nine years. I have given two conference talks about it. I have, at one point, owned the GitHub repository for an internal framework that was — briefly, in 2022 — name-checked in an AWS re:Invent session. I am telling you all this so that what follows is not mistaken for a junior engineer's bored skepticism. It is a tired senior engineer's specific complaint.

The complaint is this: chaos engineering, as it is currently practised in most companies I know, has become a piece of theatre. We have Gremlin licenses. We have chaos days. We have a Confluence page titled Our Chaos Engineering Strategy that has been viewed forty-eight times since it was published in November. None of this has, in any quarter I have measured, made our systems meaningfully more reliable. What has made our systems more reliable is something much smaller, much quieter, and much harder to put on a slide.

This essay is about what that smaller thing actually looks like.

The Simian Army problem

I want to talk for a minute about the Simian Army, because it is the foundation myth of our entire discipline, and I think we have been telling the myth wrong.

In 2011, Netflix wrote a blog post — The Netflix Simian Army — describing a set of tools that intentionally caused failures in their production environment. Chaos Monkey killed random instances. Latency Monkey introduced delays. Conformity Monkey killed instances that did not match the desired configuration. The post is genuinely good. You should read it. You should also, if you are an SRE leader, notice three things about it that almost nobody in our industry talks about.

The first is that the Netflix engineers running these tools were, at the time, operating one of the largest, most homogeneous, most heavily-instrumented systems in the industry. Their monkeys made sense because their world was uniform enough that random failures produced learnable signal. In a system where every service uses the same deploy pipeline, the same observability stack, the same RPC framework — sure, kill a random instance, you will learn something. In a system where seventeen teams have made forty-three different architectural decisions in three years, killing a random instance teaches you that one specific team has not been testing failover, which is a thing you could have found out by asking.

The second is that the Netflix monkeys ran continuously, in production, as a background process. They were not chaos events. They were a chaos climate. Every Netflix engineer wrote code knowing that an instance of their service might disappear in the next eleven minutes. That is a culture, not a tool. You cannot buy that culture by installing Chaos Monkey.

The third is the one that gets me: the original Simian Army was built by a team of four, over about two years, at a company that had already invested years in service isolation, automated recovery, and graceful degradation. The monkeys were the icing. We, the industry, have been trying to bake the icing for a decade.

What I have actually seen work

In nine years of doing this, the chaos exercises that have produced the most useful findings are not the ones with the framework. They are the ones I now think of as Wednesday afternoon chaos. The format is the same every time.

It is an actual Wednesday afternoon, around two o'clock. The on-call engineer for the week is at their desk. I walk over — or, since we are mostly remote, I send a Slack DM — and say: I am about to kill the leader of the Kafka cluster in us-east-1. Are you free for ninety minutes? They say yes, or they say give me an hour, I'm in a thing. We do it. We sit on a call for the duration. We watch the dashboards. We talk while it happens — what did you expect to see? what are you actually seeing? why is the consumer lag different on broker 3? And we write down, in a shared doc as we go, every surprise.

That is the entire methodology. There is no framework. There is no scheduling tool. There is no pre-flight checklist. The cost is one engineer's afternoon, about every six weeks, and the value is enormous.

What we have found, in nine months of doing this, makes a list:

  • An autoscaling policy on our checkout service that was, due to a misconfigured cooldown, taking forty-one minutes to add capacity after a node failure. The metric had been there. Nobody had noticed because no real incident had stressed it hard enough to make it visible.
  • A circuit breaker in the auth-gateway sidecar that was opening correctly but never closing again, requiring a manual pod restart. We had been getting away with this because the breaker had not been tripped in production since the previous October.
  • A Datadog dashboard that, when one specific service was down, would itself fail to load, because it was making a call to the dead service to enrich its title. The on-call engineer's first response to most incidents is to open the dashboard. If the dashboard requires the service to be up to load, you have a bad day.

None of these would have been found by Chaos Monkey, because Chaos Monkey would not have spent ninety minutes watching the recovery with two engineers asking each other is that the right number. The value was in the watching, not the killing. The killing is the cheap part.

Against the chaos pipeline

The thing I am most skeptical of, in the modern chaos discipline, is the continuous chaos pipeline. The idea, popular in conference talks since about 2022, is that you bake chaos experiments into your CI/CD, so that every deploy is automatically exercising your failure modes. This is, I will grant, sometimes a good idea. For mature, homogeneous, well-instrumented services, it can catch real regressions.

The problem is that for most companies — including, I am almost certain, yours — the chaos pipeline becomes a thing you tick a box on. Yes, we run chaos experiments in CI. What that means, in practice, is that a junior engineer six months ago wrote a test that kills one pod and checks for a successful response, and the test has been passing ever since, because the system trivially handles one dead pod. The test does not get harder. The test does not learn. The test certainly does not exercise the kind of multi-service, partial-degradation, ten-minute-into-the-incident edge cases that actually cause real outages.

The chaos pipeline becomes a piece of compliance theatre — which is, I will note, exactly what we accuse change-management committees of being. We have reinvented the change advisory board, but with monkeys.

What I am doing instead

I am running, on every team I touch now, what I call the slow chaos rotation. It is exactly what I described above. Every six weeks, one engineer (rotating through the team) picks one assumption about the system, and we spend ninety minutes deliberately breaking it. Sometimes the assumption is the autoscaler can handle a 30% sudden capacity loss. Sometimes the assumption is if the primary Postgres goes down, the replica promotion takes less than four minutes. Sometimes the assumption is the dashboard renders when the auth-gateway is degraded. Sometimes — my favourite — the assumption is I, the engineer running this, actually know what I am looking at.

This last one matters more than I expected. A chaos exercise that proves you do not yet know what a normal-failure-recovery should look like is more valuable than one that proves a system handles a failure cleanly, because it tells you to go fix the gap in your own understanding before the gap fixes itself, badly, at 3 a.m. on a Saturday.

You do not need a framework for any of this. You need a calendar invite. You need ninety minutes. You need two engineers willing to look at the same dashboard at the same time and admit, out loud, when something surprises them. You need, most importantly, a culture where admitting surprise is not treated as a sign that you do not know your job.

A modest proposal

If your company is currently sold on the idea of buying a chaos platform — and if you are an SRE lead in a series-C or later company, your company almost certainly is — I want to make a small counterproposal.

Take whatever the licensing cost is. Take the engineering time you would spend integrating it. Take it all, add it up, and instead spend it on a single full-time engineer whose job, for one quarter, is to run Wednesday afternoon chaos. No framework. No platform. No dashboard of past experiments. Just an engineer, a calendar, and the willingness to break one thing every six weeks and sit with the result for a couple of hours.

At the end of the quarter, count the findings. Compare them to whatever the platform vendor's case study claims they would have found. I will, in advance, bet you a coffee that the human-with-calendar wins by a factor of four. I have run this experiment, informally, twice. I have not lost.

We do not need a Simian Army. We need someone willing to kill one node on a Wednesday afternoon and watch what happens. The cosplay is optional.

Next Tuesday: a piece I have been sitting on for a while about why the word blameless is doing a lot of dishonest work in our industry. Until then, go schedule something that breaks on Wednesday. Don't tell anyone. See what you learn.

Anya Petrova

About the author

Anya Petrova

Site Reliability Engineer in Vancouver. Writes about chaos, on-call, and the slow craft of keeping production alive. New essays every Tuesday.

Comments are open — by email reply.

I read every reply personally. Disagreements welcome. The best letters sometimes become their own essay (with permission).

Write a letter to the editor

If this essay was worth your time, you can leave a tip — no subscription, no obligation. It pays for the coffee that pays for the next one.