Skip to content

The Documentation That Didn't Exist

Category: The Hard Lesson Domains: runbooks, incident-command Read time: ~5 min


Setting the Scene

We ran a bespoke event-streaming platform built on Kafka, Flink, and a custom ingestion layer that Marcus had written. Marcus was the only person who understood how the ingestion layer worked. Marcus was a genius. Marcus was also on a hiking trip in Patagonia with no cell service.

The system handled about 2 million events per hour for our analytics pipeline. It had run without incident for 14 months. We had no runbook, no architecture diagram, and the deploy process lived in Marcus's bash history.

What Happened

At 2:17 AM on a Tuesday, the ingestion layer stopped accepting connections. Kafka was fine. Flink was fine. The custom Go binary sitting between our load balancer and Kafka had segfaulted and the systemd unit wasn't configured to restart on failure.

The on-call engineer — me — SSH'd into the box and saw the process was dead. Simple enough: restart it. Except I didn't know the binary name, the config file location, or the startup flags. I ran ps aux | grep ingest on the other node in the cluster and found /opt/eventpipe/ingester --config /etc/eventpipe/prod.toml --workers 32 --buffer-size 4096. I copied that, started the process, and it crashed immediately with FATAL: cannot connect to schema registry at 10.4.12.8:8081.

That IP didn't exist anymore. It had been migrated to a new subnet three months ago. Marcus knew this. Marcus had updated the config on the running process but never updated the TOML file on disk. The in-memory config and the on-disk config had diverged.

I spent 90 minutes grepping through Slack history for "schema registry" and "migration." Found the new IP in a thread from October. Updated the TOML, restarted. The process came up but started throwing WARN: checkpoint directory /data/eventpipe/checkpoints not found. The EBS volume had been unmounted by a previous kernel panic and nobody had noticed because the process had been running from memory.

I remounted the volume, restarted again. Four hours and twelve minutes after the initial page, we were back. A 15-minute fix turned into a 4-hour archaeology expedition.

The Moment of Truth

Sitting in the postmortem the next day, my manager asked, "Where's the runbook for this service?" I pulled up Confluence and searched. The only page was titled "Eventpipe Design Doc (DRAFT)" from 18 months ago. It described a system that no longer existed.

The Aftermath

We instituted a rule: no service goes to production without a runbook in /docs/runbooks/ in the repo. Every runbook has four sections: what it does, how to restart it, how to verify it's healthy, and who to escalate to. We also started architecture decision records and mandatory README.md updates with every deploy.

Marcus came back from Patagonia and was genuinely surprised we'd had trouble. "It's pretty straightforward," he said. We made him write the runbook that afternoon.

The Lessons

  1. Documentation is insurance, not overhead: You don't write a runbook because you need it today. You write it because at 2 AM someone else will need it and you won't be there.
  2. Bus factor of 1 is unacceptable: If one person's absence turns a 15-minute restart into a 4-hour outage, your team has a single point of failure more dangerous than any server.
  3. Write the runbook before you need it: The best time was when the system was built. The second best time is right now, before the next 2 AM page.

What I'd Do Differently

I'd require a runbook as a merge requirement for any new service, enforced by a CI check that validates /docs/runbooks/<service>.md exists and contains the four required sections. I'd also run quarterly "runbook fire drills" where someone who didn't build the service follows the runbook to perform a restart, and we fix every gap they find.

The Quote

"Marcus didn't have a bus factor of 1. He had a bus factor of 1 minus the Patagonia hiking season."

Cross-References