Appearance
The Quiet Hour: What Nobody Tells You About the Pager
The first time my pager went off in the middle of the night, I was twenty-six and had been an SRE for exactly eleven days. The alert was for a service I had not yet been onboarded to. The runbook linked from PagerDuty pointed at a Confluence page that had been deleted in a cleanup the previous spring. The Slack channel referenced in the runbook had been archived. I sat on the edge of my bed at 3:14 a.m. in Vancouver, staring at a phone that was telling me, in the calmest possible voice, that the world was on fire.
The thing nobody told me in any of the SRE books — not the orange Google one, not Charity Majors' essays, not the increasingly desperate threads on Hacker News — is that the technical part of being on call is the easy part. You will, eventually, learn the systems. You will memorise the dashboards, the runbooks that actually work, the four log queries that resolve eighty percent of incidents. That is the visible craft, and we talk about it endlessly because it is what we know how to talk about.
The invisible craft is everything else.
The 3:47 problem
Page Mercer, an SRE I met at SREcon EMEA last October, has a phrase for it: the quiet hour problem. It is the gap, usually somewhere between three and five in the morning, when your phone goes off and you have to make a decision before your brain has fully booted. The pager does not wait for your prefrontal cortex. The pager asks you, right now, whether to wake your secondary, whether to declare a SEV2, whether the symptom you are seeing is a real cascade or just a flap. And you have, generously, about ninety seconds before the on-call expectation curve starts to bend.
I have made some of the worst engineering decisions of my career in that window. I have escalated a logger config change to a SEV1 because I misread a graph. I have failed to escalate an actual database split-brain because the dashboard, by coincidence, recovered for forty seconds while I was typing. The post-mortems on both of those were charitable, because we are an SRE org that takes blamelessness seriously, but reading them back I can hear the gap between what the document says and what actually happened. The document says the on-call engineer initially classified the incident as a SEV3 based on the available telemetry. What actually happened is: I was, at 3:47, a stupid mammal in pyjamas being asked to think.
What on-call actually costs
We have an excellent SRE book. We have lots of them. None of them tell you that the partner you live with will, after the fourth or fifth quiet hour in a month, develop a small tic where they wake whenever your phone vibrates, even when it is just a Slack mention from a colleague in Sydney. They will not complain. They will keep getting up at six to go to work, and they will keep pretending that they did not also wake at 3:47.
None of the books mention that after a particularly brutal week of incidents — the kind where you have spent three nights in a row on a bridge — you will catch yourself, at lunch on the Friday, becoming irrationally angry at someone for chewing too loudly. This is what neurologists call sleep debt aggression, and it is what HR calls a culture problem, and it is what your team lead calls Anya, are you OK? And the honest answer is no, but the polite answer is yes, and the engineering culture we have built makes it very, very hard to give the honest answer without seeming like you cannot do the job.
The pager has costs that do not show up on any SLO dashboard, and we have a whole industry that has decided not to measure them.
The boring fixes that actually work
I am not going to pretend I have a clean answer for this. If I did, I would not still be writing about it eleven years in. But there are a handful of things I have started insisting on, and the team I am on now is gentler for it.
The first is what I call the dignity buffer. When someone comes off a brutal on-call rotation — one with two or more page-bearing nights — they get the next two business days off the work-stream. Not "off the rotation," which they already are. Off the work-stream. No PRs to review, no sprint commitments, nothing that requires their attention to be reliably online. The first time I argued for this in a planning meeting, my director asked me what business outcome it produced. I told him: Anya, three months from now, when she has not quit. He approved it.
The second is a slow, quiet rule that we never deploy on Friday after 11 a.m. We have all the modern things — feature flags, canaries, automated rollback. The reason is not technical. The reason is that the on-call person on Friday afternoon is statistically the most cognitively fatigued person on the team, and giving them a fresh deploy to debug at 4 p.m. is a form of cruelty we have agreed to stop performing on each other.
The third — and this one took me years to swallow — is that runbooks are written for the version of you that is barely conscious. If the runbook starts with a verb like Investigate or Determine, throw it out. The first step should be a single, concrete, finger-pointable action. Open this dashboard. Run this query. Page this person. If the runbook requires me to think, I have already failed the on-call who is reading it at 3:47.
On not romanticising the bridge
There is a culture in our industry — I am not blameless in spreading it — that romanticises the incident bridge. The war stories. The night your team saved the company. The triumphant blameless post-mortem that became a conference talk. I have given some of those talks. I am not proud of all of them.
The truth is that the best incidents are the ones nobody hears about, because the system healed itself, because the alert fired into a runbook that worked, because the on-call engineer slept through the page and the secondary slept through the escalation and the tertiary alert fired and the autohealer did its job and by morning there was a Jira ticket and that was the whole story. Boring infrastructure is a moral good. Boring on-call is a moral good. We should celebrate the Sunday nights where nothing happened with at least as much energy as we celebrate the war stories.
What I want you to take from this
If you are an SRE reading this from a hotel room on a Tuesday, with your phone face-up on the bedside table because you have been trained to keep it face-up, I want you to know two things.
The first: you are not a bad engineer for finding this hard. The literature has been dishonest with you. The pager is not a normal device. Carrying one changes you, and it changes the people who live with you, and pretending otherwise is the cause of half the burnout in this industry.
The second: the parts of the job that nobody can measure are exactly the parts that matter. The runbook that you rewrote so that your future self at 3:47 has one less thing to think about. The teammate you let off the rotation when their parent was in hospital. The deploy you didn't push on Friday. These are the work. The dashboards are the wrapper.
I learned this slowly, and at some cost. I am writing it down because the books I read when I was twenty-six did not say it, and I think they should have.
Tuesday next week: an essay on why I burned the runbook for our auth service and what we replaced it with. Until then, sleep when you can.
Comments are open — by email reply.
I read every reply personally. Disagreements welcome. The best letters sometimes become their own essay (with permission).
Write a letter to the editor