On Blame, and Why "Blameless" Post-Mortems Are Often Neither

We adopted the blameless post-mortem template from the Google SRE book in March 2019. I was the one who proposed it. I sat in a Wednesday all-hands and gave a fifteen-minute presentation on why focusing on systems rather than individuals would, in time, produce a healthier engineering culture. People nodded. The slide deck still exists in my Dropbox. I read it again recently, while preparing this essay, and I noticed something I had not noticed at the time: the word blameless appears nineteen times in seventeen slides. The word human appears once, in a footer.

This is a piece about what we got wrong, what we are slowly getting right, and why I think the word blameless has done more damage to our post-mortem culture than the explicit blame it was supposed to replace.

The case for the template

Let me start by being fair to the original idea. In 2019, our post-mortems were a small disaster. The format was unstructured. They were often run by the manager of the team where the incident happened, which meant the incident write-up read, with depressing reliability, as a defence of that team's decisions. There were no action items. There was no follow-through. After a particularly bad week in February 2019, when a junior engineer was loudly dressed down in a Slack channel for pushing a config change that took down our staging environment, I went home, ordered the Google SRE book, and underlined the chapter on post-mortems with a pen.

The template I introduced six weeks later was good. It was modelled on Google's. It had structured sections — what happened, timeline, root cause, contributing factors, action items, what went well. It explicitly prohibited the words should have and failed to in the body. It required the post-mortem to be facilitated by someone not on the affected team. It mandated that the document be published, in full, to a company-wide Confluence space within five working days. It worked. The Slack-channel pile-on stopped. The defensive write-ups stopped. We started producing real, structured documents about real incidents.

For about a year, I thought I had fixed the problem.

What was actually happening

In November 2022 — three and a half years in — we had to let an engineer go. I will not name the engineer. The reasons given to HR were consistent under-performance and failure to address feedback. The reasons given in the executive review I sat in on, two days before the conversation, were more specific. The engineer had been at the centre of three SEV2 incidents in the previous eight months. Their name appeared in the contributing factors section of seven post-mortems, more than any other engineer on the platform team. Their peer reviews mentioned, in carefully neutral language, that they were sometimes a source of risk.

Nothing in any of those documents used the word blame. The post-mortems were textbook blameless. The peer reviews were textbook constructive. The performance-improvement plan that preceded the termination was textbook fair. And yet, when I added up the surface area of the documents that contributed to the decision, the picture was unmistakable: we had constructed a paper trail that pointed at exactly one person, and we had used it to fire them.

I sat with this for a long time afterwards. I am still sitting with it. The engineer was not, by any honest reading, the cause of three SEV2s. They were the engineer holding the pager when three SEV2s happened, in a system where the pager was carried by a small rotation of people. Two of the incidents involved a service that had been built by an architect who had left the company, and that nobody on the team had ever fully understood. One of them was a configuration drift that had been quietly accumulating for over a year before it finally caused damage. The contributing factors section of each post-mortem mentioned these things — the abandoned service, the configuration drift, the institutional knowledge gap. But the name of the human at the keyboard appeared, in each document, in the timeline.

A name in a timeline, repeated across seven documents, is a paper trail. The template did not say blameless meant nameless. We never thought to ask whether the two were the same thing. They are not.

What blameless is not

The thing I have come to believe, slowly and at some cost, is that the blameless post-mortem as we have institutionalised it makes three quiet promises that the format itself cannot keep.

The first promise is that if we write carefully enough about systems, the human in the loop will disappear from the analysis. This is technically true and practically false. The human will not disappear from the document, because the document records what happened, and what happened included a human typing a command. The human's name will be in the timeline. The human's name will be in the Slack threads linked from the timeline. The human's name will be in the commit history referenced from the contributing factors. Naming the system without naming the human is a literary exercise, not an organisational one.

The second promise is that blameless analysis produces blameless consequences. This is the one that hurt the most to notice. Even when the document is genuinely careful — even when the systemic analysis is genuinely sharp — the document is read by humans who carry, in their own heads, an entirely separate model of who screws up and who doesn't. A senior leader who reads seven post-mortems that all mention the same engineer's name in the timeline is going to form a pattern in their head, and the careful prose in the contributing factors section is going to do approximately nothing to dislodge it. The blameless format does not make the reader blameless. We forgot to read our own format from the reader's perspective.

The third promise is the deepest one: that the absence of explicit blame produces psychological safety. I no longer believe this is true, and I think believing it for as long as I did was a mistake. Psychological safety is not produced by language. It is produced by the lived experience of telling the truth and watching nothing bad happen. If an engineer says, in a post-mortem, I made this mistake because I was tired and the runbook was wrong and I should have escalated, and three months later that same engineer is on a PIP, the format will not save us. The format is what made the firing look fair.

What I am trying instead

I have not got this fully right yet. I want to be honest about that. But here is what we have started doing.

We have separated the narrative document from the operational one. The narrative document is what the team writes — for itself — in the days after the incident. It is honest about who did what, because that is how engineers actually learn. It is not published. It is not stored in a system that can be searched by a manager looking for patterns. It is shared on a small distribution list, kept for thirty days, then destroyed. The destruction is the point. The narrative cannot be a paper trail if the paper does not survive.

The operational document is what gets published. It contains the systemic findings, the action items, the timeline with role-not-name attribution (the on-call engineer, the deployer, the reviewer, not Anya, Mei, Samir). It is the public record. It is what the org learns from. It is also, deliberately, drained of the kind of detail that lets a senior leader build a pattern about a person.

This is awkward. It is more work. It produces two documents instead of one. It requires the team to actively trust that the destruction will happen — and we have set up a small automation that does the destruction, so that no human has to remember to do it. The first time I proposed this, I got pushback from our compliance team. The compliance team's concern was, almost word-for-word, but how will we know who to hold accountable? And I sat with that question for a long time, because I think the honest answer is: you won't, and that is the entire point.

The second thing we are doing is what I have started calling the consequence review. Every six months, we look at every person who has been performance-managed, demoted, or let go in the previous half-year, and we look at the post-mortems they appeared in. If a person's name appears in three or more post-mortem timelines in the relevant window, the consequence review treats this as evidence of systemic risk-concentration, not of individual under-performance. The default action becomes redesign the rotation, not manage out the engineer. We have done two of these reviews now. The first one quietly changed how we assigned on-call. The second one quietly changed who we put on a particular service.

These are small interventions. They are not as clean or as satisfying as adopting a template was, in 2019. They cannot be summarised in a fifteen-minute slide deck. They have produced, in the two years since we started, something the original template never did: engineers in our org are more willing to write honestly about what they did and why, because they can see — in our actions, not in our documents — that we mean what we say.

A confession

I want to end with a confession. I am the person who, in 2019, championed the blameless template. I am also the person who, in 2022, sat in the executive review where we let an engineer go and did not, in that meeting, raise the objection I have spent five hundred words raising in this essay. I noticed the pattern. I did not name it. I told myself the documents were fair, because the documents looked fair, and I let a colleague lose their job partly because of a structure I had helped build.

I do not have a clean redemption for that. I have only what I have done since: tried to rebuild the structure so that the next time it happens, the documents will not be the thing that makes it look fair. This is the work. It is slower than adopting a template. It is harder to put in a conference talk. It is, I am increasingly sure, the only honest version of blameless that the word actually means.

The template was a beginning. We mistook it for an ending. I am writing this so that, perhaps, some other org will not have to take three and a half years and one termination to notice.

Next Tuesday: a happier piece, I promise — on a small joy I have found in writing Prometheus alerts that actually express what I mean. Until then, go re-read your last post-mortem and ask yourself who, in three years, might be fired by what it says.

On Blame, and Why "Blameless" Post-Mortems Are Often Neither

The case for the template

What was actually happening

What blameless is not

What I am trying instead

A confession

Anya Petrova

Comments are open — by email reply.

On Blame, and Why "Blameless" Post-Mortems Are Often Neither ​

The case for the template ​

What was actually happening ​

What blameless is not ​

What I am trying instead ​

A confession ​

Anya Petrova

Comments are open — by email reply.

On Blame, and Why "Blameless" Post-Mortems Are Often Neither

The case for the template

What was actually happening

What blameless is not

What I am trying instead

A confession