The Alert That Finally Meant Something

In the spring of 2025, I deleted one hundred and eighty-seven Prometheus alerts in a single afternoon, and the on-call quality of life on my team went up immediately, and I want to write down what I learned before I forget it.

This is a much shorter essay than the ones I have been publishing lately. I have been told, gently, by a long-time reader named Carla that the recent pieces have been a lot of grief, Anya, and I take the point. So today: a small craft note, with numbers.

The starting state

When I rotated onto the platform team in early 2025, our alerting setup was the worst I have ever inherited, and I have inherited some bad ones. The team owned 203 active Prometheus alerts spanning seventeen services. The on-call engineer was getting paged, on average, between sixteen and twenty-two times per week. Of those, the post-rotation survey marked roughly four as useful — meaning the engineer was glad to have been woken, or glad to have been pulled out of focused work, by the page. The other eighteen-or-so were marked variously as false positive, noise, should never page, or my personal favourite, this alert is older than I am at this company.

The team had been talking about alert cleanup for, I was told in my first 1:1, about two years now. Everyone agreed it was important. Nobody had done it. The reason nobody had done it was the reason nobody ever does it: each individual alert had been written by someone, for some reason, and deleting it felt like deleting their work. Two years of accumulated guilt had produced 203 alerts.

The hour I gave up

I spent my first six weeks on the team trying to be respectful. I made a spreadsheet. I tagged each alert with an owner. I asked each owner whether the alert was still needed. The responses were almost uniformly I don't know, the person who wrote it left, or I'd rather not say no in case it bites us. The spreadsheet did not lead to a single deletion in six weeks of work.

On a Thursday afternoon in mid-March, I gave up. I opened the alerts repo, I opened a terminal, and I did something I am still slightly embarrassed about: I wrote a one-line script that, for each alert, counted how many times it had fired in the last ninety days, and how many of those firings had been acknowledged as actionable in PagerDuty.

The numbers were not subtle. Of 203 alerts:

94 alerts had not fired at all in ninety days. (We could not tell whether they were protecting something or were dead. We had no way, from the firing data alone, to distinguish the two.)
62 alerts had fired at least once, but had zero acknowledged-actionable incidents. They fired, the on-call looked, they resolved. Pure noise.
31 alerts had fired and produced at least one acknowledged-actionable response. These were the ones doing real work.
16 alerts were so loud (>50 firings in ninety days) that the on-call had stopped reading them by reflex.

I sat with these numbers for about thirty minutes. Then I did the thing I had been resisting. I wrote a PR that deleted the 62 pure-noise alerts and the 94 never-fired ones. The PR description was four sentences long. I tagged the leads of the three teams whose services were most affected. I posted in #platform-changes. I went and made tea.

The PR got approved in eleven minutes. The reviewers all said variations of thank god, somebody finally did this. I merged it before anyone could change their mind.

That was 156 alerts gone in a single afternoon. The on-call quality of life on my team went up immediately. The week after the deletion, on-call pages dropped from a typical 18 per week to 4. The on-call engineer that week — a recent hire named Soren who had not yet had a full rotation — told me, in our 1:1 the following Monday, that they had thought on-call is supposed to be this bad? until that week, and now they were not sure.

The 156 deleted alerts had been silently consuming the team's attention for years. Not in dramatic, visible ways. In quiet, ambient ways. Every page that arrived now arrived in an inbox that already had thirty pages from yesterday that did not need attention, and the human brain does not distinguish well between page that matters and page that arrived. We had been training our team, alert by alert, to ignore the pager.

The 12 that replaced them

I am not going to pretend that the deletion was the whole story. Of the 31 alerts that had been doing real work, about a third were technically working but were firing on the wrong thing — they would alert on a symptom of a problem that was already past the point of action by the time the page arrived. So I spent the next two weeks rewriting them, with help from a few teammates who were now suddenly interested in alerting because the alerting cleanup had not, in fact, gone terribly.

The principle I tried to follow for each rewrite was simple, and I am stealing it from a tweet by Charity Majors that I cannot find anymore so will paraphrase: an alert should describe the thing the user is experiencing, not the thing the system is doing. A 5xx rate alert is about the system. A new-user-signups-per-minute is below the seasonally-expected band alert is about the user. The first is a fact. The second is a problem.

We ended up with 12 alerts that I am genuinely proud of. They are about user-facing outcomes. They are gated on multiple-window-anomaly detection rather than single-point thresholds. They include a runbook link, but the runbook is a one-page diagnostic memo (see last week's essay on why) rather than a step-by-step. They include, in the alert annotation itself, the named human cost of the alert firing falsely — for example, if this alert fires and no incident is occurring, the on-call has been woken at home for nothing; please tune carefully before adding.

That last line is doing a surprising amount of work. It is not enforceable. It is not technical. It just sits there in the alert definition, and every time a developer goes to copy-paste the alert as a template for a new service, they read it. The line has, in the year since I added it, prevented at least three alerts from being added that I suspect would have been pure noise. It is, in a way, a piece of culture-engineering embedded in a YAML file. I will take that.

What I want you to take from this

If you are inheriting an alerting setup that is older than you are at the company, and the team has been talking about cleanup for years and not doing it, I want to offer you the permission I had to give myself.

You do not need to ask each alert's owner. The owner has left. The owner does not remember. The owner is the spreadsheet. The data — did the alert fire, did anyone act on it — will tell you, in about thirty minutes of querying, more than six weeks of polite emails. The cost of deleting a useful alert is small (you write it again, better, when the underlying condition recurs). The cost of not deleting an unuseful alert is large and ambient and invisible until you stop paying it.

Be brave with the deletion. Be slow and thoughtful with the replacement. The 12 alerts I wrote last spring are still — fourteen months later — the alerts I want at 3 a.m. The 156 I deleted, nobody has noticed are gone.

Sometimes the right answer to what should this alert say? is nothing, because this alert should not exist. It took me eleven years to get comfortable saying that out loud. I am writing it down so that you can get there faster.

Next Tuesday: I am back to a longer one — a piece on the difference between resilience and robustness and why our industry uses them interchangeably and shouldn't. Until then, go look at your noisiest alert. Ask whether it deserves you.

The Alert That Finally Meant Something

The starting state

The hour I gave up

The 12 that replaced them

What I want you to take from this

Anya Petrova

Comments are open — by email reply.

The Alert That Finally Meant Something ​

The starting state ​

The hour I gave up ​

The 12 that replaced them ​

What I want you to take from this ​

Anya Petrova

Comments are open — by email reply.

The Alert That Finally Meant Something

The starting state

The hour I gave up

The 12 that replaced them

What I want you to take from this