Against the Runbook: Why I Stopped Trusting My Own Documentation

On April 8th, 2026, at 9:42 a.m. Pacific, Auth0 experienced a partial degradation in us-west-2 that lasted forty-six minutes. You may have read about it. It made the front page of Hacker News for a few hours and then drifted off, the way these things do, into the slow background noise of an internet that breaks all the time.

For my team, that morning was a small disaster in a particular way. We had a runbook for it. The runbook had been written, by me, sixteen months earlier. The runbook was wrong.

This essay is about that runbook, and about why I have spent the months since trying to dismantle the thing I used to evangelise for.

What the runbook said

I have the document in front of me as I write this. It is titled Auth0 Degradation — On-Call Response, and it lives in our internal Backstage. The first revision is mine, dated December 2024. It has been read six hundred and thirty-one times. It has been updated nine times since I wrote it. Three of those updates are mine. The other six are by colleagues, all in good faith, all making it incrementally worse.

The runbook had thirteen steps. I will quote you the first four, exactly as they appeared the morning of the incident:

Confirm the Auth0 status page shows an incident.
Check our internal auth-gateway dashboard for 5xx error rate above 0.4%.
If both, enable the auth_fallback_local flag in LaunchDarkly.
Page the auth team lead and post in #incident-active.

This looks fine. It is what a runbook is supposed to look like. It is what every SRE consultancy will tell you a runbook should look like. It is also, when I read it now, a small piece of fiction that I wrote about a system I no longer fully understood by April.

What actually happened

At 9:42, the Auth0 status page was still showing all-green. Their incident did not get posted until 9:51. Our auth-gateway was showing 5xx at 0.3% — under the runbook threshold I had specified, because between December 2024 and April 2026 we had moved a chunk of the auth load off the gateway and onto a sidecar that the dashboard did not measure. The runbook said no incident. The system was, factually, on fire.

The on-call engineer that morning was Mei, who has been with us for eighteen months and is, by some distance, one of the most competent SREs I have worked with. She did exactly what the runbook said. She watched the dashboard for nine minutes. She did not enable the fallback flag, because the threshold had not been crossed. She did not page the auth lead, because step 4 was gated on step 3. She wrote in #sre-on-call that she was seeing some elevated latency but below thresholds, monitoring. By the time the customer-support escalation finally reached her at 9:54, we had been silently dropping about 14% of new sign-ins for twelve minutes.

The post-mortem was, to use the term of art, educational. Mei did everything right. The runbook did everything wrong. And I — sitting in the post-mortem two days later, with my name on the document's edit history — had to sit with the fact that I had built the trap she walked into.

The deeper problem

The thing that bothered me most was not the specific error in the threshold. Thresholds drift. That is just the cost of writing anything down about a system that changes. The thing that bothered me was the shape of the document.

A runbook says: here is what to do. It does not say: here is how to think about it. When you read a runbook at 3:47 in the morning, with half a brain online, it is comforting precisely because it does not ask you to think. You scan, you match, you execute. That is its entire promise. And that is what makes it dangerous, because the moment the runbook is wrong about a specific detail — and given enough time, every runbook is wrong about something — the on-call engineer trusts the shape of the instructions over the substance of the symptoms in front of them.

We had built, through a thousand small acts of well-intentioned documentation, a culture that trusted documents over senses. The runbook had become a piece of furniture in the room. Nobody questioned it because nobody questions furniture.

What I tried instead

I want to be clear: I am not arguing for getting rid of runbooks. I am arguing for getting rid of commanding runbooks — the imperative-mood, step-by-step kind that we have been trained, by the SRE canon, to write.

What we are moving to instead, slowly and with some pushback, is what one of my colleagues has started calling diagnostic memos. A diagnostic memo for the Auth0 case looks like this:

Auth0 Degradation — Diagnostic Notes
When Auth0 partially fails, the symptom we usually see first is elevated p99 on auth-gateway and a rise in the auth_sidecar_circuit_open counter. Either may appear before the Auth0 status page updates — they have been wrong by 6–14 minutes in past incidents.
The decision you are making is whether to enable auth_fallback_local. The cost of enabling it falsely is brief (local fallback handles ~70% of flows), so when in doubt, enable. The cost of leaving it disabled during a real outage is dropped sign-ins, which we measure in lifetime value not in 5xx.
Page the auth lead any time you've been investigating for more than five minutes, even if you have not yet decided. Their context is cheap; your isolation is expensive.

This is longer than the runbook. It is harder to follow at 3:47. It asks more of the reader. It contains opinions, in prose, that someone might disagree with. All of these are features.

What it does, that the runbook did not, is teach the reader how to think about the system. If the dashboard threshold has drifted, the diagnostic memo still points at the right symptoms. If the Auth0 status page is slow, the memo names the lag. If the engineer is uncertain, the memo says: page anyway. The memo does not collapse the moment a single detail changes, because it was never built on the assumption that a single set of instructions could survive eighteen months of system drift.

What I gave up

I gave up the comfortable fiction that documentation is a one-time deliverable. The diagnostic memos are not "done." We re-read them after every incident in the relevant area. We rewrite the prose, not the steps, because the prose is the part that has to track the engineer's evolving understanding. They get worse, then better, then worse again. They are alive in a way that the old runbooks pretended not to be.

I gave up the metric of runbook coverage. We used to report, in our quarterly reliability review, the percentage of our top-30 services that had a published runbook. It was at 96% in Q1. It was, as April showed us, a lie. Coverage is not what you want. Reading is what you want. We now report a different number: the median number of times each diagnostic memo was opened during incidents in the last quarter. It is a smaller number, and a more honest one.

And I gave up the idea that I, as the person who wrote the document, was helping the on-call engineer by being specific. I was helping them less than I thought, because I was substituting my fifteen-months-stale model of the system for their live, real-time observation of it. The most helpful thing I can do for the engineer at 3:47 is not give them a script. It is give them a clear, opinionated framework for thinking, and trust them to do the thinking.

I am still uncomfortable with this. The runbook habit is deep. When I open Backstage and start writing a new document for a new service, my fingers want to write Step 1. Step 2. Step 3. I delete it. I write prose instead. I am, slowly, getting better at it.

If you want to read what one of our newer diagnostic memos looks like, I have published an annotated version of our Kafka-consumer-lag memo over on my GitHub. The annotations are the interesting part — they show what the on-call engineer cared about during three real incidents in 2026, and where the memo got rewritten as a result.

Next Tuesday: what it actually felt like to delete a Kubernetes operator I had spent six months building, after we found out it was the cause of the cascading failure it was supposed to prevent. Until then, throw out your runbooks. Or at least be a little suspicious of them.

Against the Runbook: Why I Stopped Trusting My Own Documentation

What the runbook said

What actually happened

The deeper problem

What I tried instead

What I gave up

Anya Petrova

Comments are open — by email reply.

Against the Runbook: Why I Stopped Trusting My Own Documentation ​

What the runbook said ​

What actually happened ​

The deeper problem ​

What I tried instead ​

What I gave up ​

Anya Petrova

Comments are open — by email reply.

Against the Runbook: Why I Stopped Trusting My Own Documentation

What the runbook said

What actually happened

The deeper problem

What I tried instead

What I gave up