Vol. III · Issue 22
2,418 subscribers

Anya Petrova

Field notes on reliability, chaos, and the systems we keep alive.

Portrait of Anya Petrova
Anya PetrovaSite Reliability Engineer · Vancouver, BC

A note from the editor

I have spent eleven years carrying a pager. This publication is what I learned in the quiet hours after the incident bridge cleared — about why systems fail, why our diagrams lie to us, and what it actually costs to be the person on call.

New essays land on Tuesdays. Long pieces, no listicles, no checklists you could have found anywhere else. If a post here didn't change how you think about your on-call rotation, I would rather you unsubscribe and tell me why.

Recently Published

Full archive →
04

ChaosMay 05, 202612 min

Chaos Engineering Without the Cosplay

We do not need a Netflix-branded Simian Army. We need someone willing to kill one node on a Wednesday afternoon and watch what happens.

Skip to content
The Graph We Stopped Looking At

A small story about a dashboard that nobody opened for two and a half years, the migration it would have prevented, and what I learned about institutional attention.

The Quiet Hour: What Nobody Tells You About the Pager

On-call is not a technical problem. It is a relationship — with your team, your sleep, your partner, and a small black device that has decided, against all evidence, that 3:47 a.m. is the right time to be honest with you.

Against the Runbook: Why I Stopped Trusting My Own Documentation

After an Auth0 outage taught me that step-by-step instructions were lying to me, I had to rewrite how our team teaches itself.

The Tuesday I Deleted My CV

A note on staying — what eleven years at one industry has taught me about job-hopping, the engineer-shaped hole in our talent market, and the slow rewards of being the person who is still there.

A Eulogy for the Monolith I Spent Two Years Killing

The migration finished in March. The thing I miss is not the code. It is something quieter, and I want to name it before I forget.

Chaos Engineering Without the Cosplay

We do not need a Netflix-branded Simian Army. We need someone willing to kill one node on a Wednesday afternoon and watch what happens.

Chaos Engineering for Fintech: Building Resilience in High-Stakes Trading Systems

Fintech platforms operate in an environment where every millisecond and every transaction matters. This guide explores how chaos engineering principles can fortify trading and brokerage systems against market volatility, infrastructure failures, and unexpected cascading failures. Learn how to design fault-injection strategies, conduct game-day exercises, and build confidence in system resilience when customer capital is on the line. Discover real-world chaos patterns for fintech workloads, observability-driven validation, and how to balance aggressive testing with regulatory compliance and risk management.

On Blame, and Why "Blameless" Post-Mortems Are Often Neither

We adopted the template from the Google book in 2019. It took us four years to notice we were still firing people, just more politely.

Orchestrating Chaos: The Modern SRE Playbook for Resilience Engineering

Learn how modern SREs leverage advanced chaos engineering principles to build antifragile systems. Discover systematic approaches to failure injection, game day orchestration, and observability patterns that transform organizations from reactive firefighters to proactive resilience engineers. Explore practical frameworks, real-world implementation strategies, and tooling approaches used by leading cloud-native teams to ensure production systems don't just survive failures—they thrive.

Automated Incident Response: Building Self-Healing Systems in DevOps and SRE

Modern infrastructure demands more than reactive incident response. This comprehensive guide explores how to build autonomous incident response systems that detect, diagnose, and remediate issues automatically. Learn the principles of self-healing infrastructure, automation best practices, and how to leverage intelligent orchestration and observability to reduce mean time to recovery (MTTR). Discover real-world strategies for implementing auto-remediation workflows, handling false positives, and maintaining human oversight while building systems that heal themselves. Also explore AI-powered market intelligence and autonomous AI agent orchestration for related AI tooling.