Skip to content

The Graph We Stopped Looking At

There is a Grafana dashboard at my company called payments-backpressure-overview. It was created on February 14th, 2023, by an engineer named Imani, who I have never met because she left the company nine months before I joined. The dashboard has, according to the audit log I pulled last week, been opened by a human being exactly eleven times since it was created. Six of those times were Imani, in the first month. Three were our previous platform lead, between June 2023 and his departure in early 2024. The remaining two were me, last Tuesday, after the incident.

I want to write about this dashboard because it is, in a small way, a story about how organisations forget things, and about why the move-fast-and-break-things culture has, for the people downstream of it, a different name: every six months we re-learn the same lesson, and every time we do, it costs more.

The incident

On May 19th, 2026, at 11:14 a.m. Pacific, our payments service started rejecting a small but growing percentage of card authorisations. The rejection rate climbed from a baseline of about 0.2% to a peak of 3.8% over the course of seventy minutes. The on-call engineer — a fairly recent hire named Hugo who has had a rough first quarter — declared a SEV2 at 11:31 and a SEV1 at 11:42. We had a war room going by 11:48. We did not find the cause until 12:24, at which point a senior engineer named Mark, who was not on call but had joined the bridge out of curiosity, asked the question that nobody had asked yet: has anyone looked at the backpressure dashboard?

The dashboard answered the question in about four seconds. A downstream Kafka consumer group had been silently lagging for sixteen minutes before the first user-facing failure. The lag was, by 11:14, severe enough that the upstream service had begun applying backpressure, which manifested at the API edge as authorisation rejections. The whole picture was there, in a graph that had been quietly drawing itself for two and a half years, in front of nobody.

We resolved the incident by 12:41. Total customer impact: about $340,000 in declined-but-recoverable transactions, plus an unknown amount of brand harm, plus six engineers' afternoons. Hugo, who handled the bridge well by any reasonable standard, told me in our 1:1 the following day that he had not opened the backpressure dashboard because he did not know it existed.

This is the part of the story I want to sit with.

What "institutional knowledge" actually means

There is a fashion, in our industry, for treating institutional knowledge as a kind of soft, fuzzy, hard-to-measure thing — the sort of phrase you put in the culture section of an architecture doc and then move on. After the May 19th incident I have started to think about it differently. Institutional knowledge is not soft. It is atomic. It lives in specific artifacts — a dashboard, a comment in a YAML file, a Confluence page, a habit on a Tuesday standup — and each of those artifacts has, attached to it invisibly, a person and a date. When the person leaves and the date drifts past two years, the artifact silently transitions from known to theoretically discoverable, and then, eventually, to might as well not exist.

The backpressure dashboard was, in February 2023, known. Imani had built it as part of her on-call onboarding. She had linked it in two team channels and presented it in a Wednesday platform sync. Six months later, it was theoretically discoverable — still indexed, still bookmarked in the on-call bookmark folder, but no longer present in anyone's working memory. By the time Hugo joined in January 2026, it was effectively invisible. It existed exactly as much as a long-dead star exists when you look up at the night sky: the light from it was still arriving, but the thing itself was, for practical purposes, gone.

I do not have a clean fix for this. If I did, I would not still be writing about it. But the post-mortem we did last week has made me think about a few small countermeasures that I want to try.

The countermeasures I am now arguing for

The first is what I am calling artifact half-life. Every dashboard, every runbook, every architecture diagram has, in some metadata field, a last meaningful interaction timestamp. When that timestamp passes six months, the artifact gets a banner: this has not been opened by a human in six months. Is it still alive? The banner is small. The banner does not block anything. But the banner exists, and the act of dismissing it forces somebody to make a conscious decision about whether the artifact deserves their continued maintenance.

The second is a quiet practice I have started calling the inheritance walk. When a senior engineer leaves the team — or, more often, when a senior engineer rotates onto a new responsibility — we spend a fifty-minute session with them where they screen-share their bookmark bar, their Grafana folder, their starred Slack messages. We do not take notes. The session exists only to convert one human being's working memory of what they look at when something goes wrong into other human beings' working memory of the same. The first time we did this, it took my colleague Diane forty minutes to walk us through her bookmarks. By the end of it, three of us had quietly opened tabs to dashboards we had never seen before. One of them was the backpressure dashboard. (This was March 2025. We then forgot about it again, because none of us was on call regularly enough to make the dashboard a habit. This is part of the problem; I am not pretending otherwise.)

The third — and this is the one I am least sure about — is a slow campaign to delete dashboards that have not been opened in twelve months. The reasoning is the same as the alerting cleanup I wrote about last month: the cost of keeping a dead artifact is small per artifact but large in the aggregate, because each one degrades the searchability of the things that are alive. The risk, of course, is that we delete the backpressure dashboard the week before the next incident. I do not have a good answer to that risk. The compromise we are trying, very tentatively, is to archive (not delete) anything with twelve months of no human interaction, and to require an explicit unarchive PR with a sentence of justification before it can be opened again. We will see.

The thing I keep coming back to

What I keep coming back to is that the dashboard worked. The instrumentation was correct. The graph drew the right line. The alert that should have fired on the backpressure metric — which I checked, while writing this — was correctly defined, with reasonable thresholds, by Imani in March 2023. It just had never been linked to a notification channel. The whole infrastructure of detection was there, latent, waiting for the moment that would have justified Imani's eight hours of work in February 2023.

The moment came. The infrastructure was silent.

I think a lot of our reliability work is like this. We build the right thing once, and then we forget that we built it, and then years later something fails in the exact way the thing was designed to catch, and we mistake the failure for we didn't know. We knew. We knew on February 14th, 2023, and we forgot, slowly, between then and May 19th, 2026. The whole cost of the incident was the cost of forgetting, which is a category of cost our industry does not yet have a budget line for.

I am, two weeks into thinking about this, fairly convinced that the most undervalued role on a mature SRE team is not the engineer who builds new instrumentation but the librarian who maintains the catalog of what already exists. I am also fairly convinced that no team I have ever been on has that role staffed, and that several teams I have been on have actively penalised people for trying to do it informally — that's not really engineering work, is it, Anya? — and that this is, in a small way, an indictment of how we hire and promote.

I do not know how to fix this from the position I am in. I am writing it down because, in two weeks or two years, when somebody on my team opens a dashboard that has been quiet for thirty months and finds the answer they were looking for, I want there to be a record that says: we knew this could happen. And maybe, the next time, we will open the dashboard at minute four, instead of minute seventy.

Next Tuesday: a piece I have been postponing for about two months, on the difference between resilience and robustness and why the conflation in our industry vocabulary makes us solve the wrong problems. Until then, go open a dashboard you have not looked at in a year. See what it has been quietly telling you.

Anya Petrova

About the author

Anya Petrova

Site Reliability Engineer in Vancouver. Writes about chaos, on-call, and the slow craft of keeping production alive. New essays every Tuesday.

Comments are open — by email reply.

I read every reply personally. Disagreements welcome. The best letters sometimes become their own essay (with permission).

Write a letter to the editor

If this essay was worth your time, you can leave a tip — no subscription, no obligation. It pays for the coffee that pays for the next one.