home | Anya "Logic_Loom

Anya PetrovaSite Reliability Engineer · Vancouver, BC

A note from the editor

I have spent eleven years carrying a pager. This publication is what I learned in the quiet hours after the incident bridge cleared — about why systems fail, why our diagrams lie to us, and what it actually costs to be the person on call.

New essays land on Tuesdays. Long pieces, no listicles, no checklists you could have found anywhere else. If a post here didn't change how you think about your on-call rotation, I would rather you unsubscribe and tell me why.

The Featured Essay

Densely-packed cabling inside a server cabinet — A patch panel in our Tukwila datacenter, two hours into the rebuild.

Essay · Reliability

The Quiet Hour: What Nobody Tells You About the Pager

On-call is not a technical problem. It is a relationship — with your team, your sleep, your partner, and a small black device that has decided, against all evidence, that 3:47 a.m. is the right time to be honest with you.

By Anya Petrova·14 min read·May 26, 2026

Read the essay

Recently Published

Full archive →

ReliabilityMay 26, 202614 min

The Quiet Hour: What Nobody Tells You About the Pager

A meditation on being woken at 3:47 a.m. by a service that, at the time of paging, was already fine.

ProcessMay 19, 202611 min

Against the Runbook: Why I Stopped Trusting My Own Documentation

After an Auth0 outage taught me that step-by-step instructions were lying to me, I had to rewrite how our team teaches itself.

ArchitectureMay 12, 202617 min

A Eulogy for the Monolith I Spent Two Years Killing

The migration finished in March. The thing I miss is not the code. It is something quieter, and I want to name it before I forget.

ChaosMay 05, 202612 min

Chaos Engineering Without the Cosplay

We do not need a Netflix-branded Simian Army. We need someone willing to kill one node on a Wednesday afternoon and watch what happens.

CultureApril 28, 20269 min

On Blame, and Why "Blameless" Post-Mortems Are Often Neither

We adopted the template from the Google book in 2019. It took us four years to notice we were still firing people, just more politely.

Browse by Series

The Pager Life7 essays
Essays on being on call, in love, and out of sleep.
Post-Mortem Diaries9 essays
Real incidents, real timelines, sometimes redacted.
The Engineering Craft12 essays
On the slow art of writing code that other people sleep through.

About the Author

Anya Petrova has been a site reliability engineer for eleven years — most recently at a fintech you have used. She lives in Vancouver, runs ultramarathons badly, and keeps a pager that has gone off during a wedding, a funeral, and a llama trek.

Read the full bio

Support the work

This publication is free and ad-free. If something here helped you ship a saner Sunday, a tip jar lives below.

Buy me a coffee Write to me

The Graph We Stopped Looking At

A small story about a dashboard that nobody opened for two and a half years, the migration it would have prevented, and what I learned about institutional attention.

2026-05-27 observabilityculturepostmortem

The Quiet Hour: What Nobody Tells You About the Pager

2026-05-26 on-callculturereliability

Against the Runbook: Why I Stopped Trusting My Own Documentation

After an Auth0 outage taught me that step-by-step instructions were lying to me, I had to rewrite how our team teaches itself.

2026-05-19 runbookson-calldocumentationpostmortem

The Tuesday I Deleted My CV

A note on staying — what eleven years at one industry has taught me about job-hopping, the engineer-shaped hole in our talent market, and the slow rewards of being the person who is still there.

2026-05-13 careerculturereflection

A Eulogy for the Monolith I Spent Two Years Killing

The migration finished in March. The thing I miss is not the code. It is something quieter, and I want to name it before I forget.

2026-05-12 architecturemicroservicesmigrationreflection

Chaos Engineering Without the Cosplay

We do not need a Netflix-branded Simian Army. We need someone willing to kill one node on a Wednesday afternoon and watch what happens.

2026-05-05 chaosreliabilityculturetesting

Chaos Engineering for Fintech: Building Resilience in High-Stakes Trading Systems

Fintech platforms operate in an environment where every millisecond and every transaction matters. This guide explores how chaos engineering principles can fortify trading and brokerage systems against market volatility, infrastructure failures, and unexpected cascading failures. Learn how to design fault-injection strategies, conduct game-day exercises, and build confidence in system resilience when customer capital is on the line. Discover real-world chaos patterns for fintech workloads, observability-driven validation, and how to balance aggressive testing with regulatory compliance and risk management.

2026-04-29 chaos-engineeringfintechresilience-testing

On Blame, and Why "Blameless" Post-Mortems Are Often Neither

We adopted the template from the Google book in 2019. It took us four years to notice we were still firing people, just more politely.

2026-04-28 postmortemcultureblamelessincidents

Orchestrating Chaos: The Modern SRE Playbook for Resilience Engineering

Learn how modern SREs leverage advanced chaos engineering principles to build antifragile systems. Discover systematic approaches to failure injection, game day orchestration, and observability patterns that transform organizations from reactive firefighters to proactive resilience engineers. Explore practical frameworks, real-world implementation strategies, and tooling approaches used by leading cloud-native teams to ensure production systems don't just survive failures—they thrive.

2026-04-23 chaos-engineeringsreresilience-engineeringobservabilitydevops

Automated Incident Response: Building Self-Healing Systems in DevOps and SRE

Modern infrastructure demands more than reactive incident response. This comprehensive guide explores how to build autonomous incident response systems that detect, diagnose, and remediate issues automatically. Learn the principles of self-healing infrastructure, automation best practices, and how to leverage intelligent orchestration and observability to reduce mean time to recovery (MTTR). Discover real-world strategies for implementing auto-remediation workflows, handling false positives, and maintaining human oversight while building systems that heal themselves. Also explore AI-powered market intelligence and autonomous AI agent orchestration for related AI tooling.

2026-04-21 12:00 incident-responseautomationself-healing

123