The Database That Wasn't Backing Up¶
Category: The Incident Domains: database-ops, backup-restore Read time: ~5 min
Setting the Scene¶
Enterprise logistics company, about 2,000 employees. I'd been on the platform team for six months. We ran a critical PostgreSQL 13 cluster on bare metal — three nodes, streaming replication, the works. Our backup system was a cron job running pg_dump nightly to an NFS share, then rsync'd to S3. It had been set up two years before I joined by an engineer who'd since left. Everyone assumed it worked because the cron job was there and nobody had ever needed a restore.
What Happened¶
Monday 10:15 AM — A developer runs a migration script on the wrong database. Not staging. Production. The script drops and recreates three tables that hold shipment tracking data. Fifteen million rows, gone. The developer's face goes white. My face goes white.
10:18 AM — "No problem," I say, "we have nightly backups." I SSH into the backup server and navigate to the NFS share. The most recent backup file is from 97 days ago. It's 14 bytes. I open it. It says: pg_dump: error: connection to server at "10.0.4.12", port 5432 failed: FATAL: password authentication failed for user "backup_svc".
10:22 AM — I check the cron log. The job has been running every night at 2 AM. It's been failing every night at 2 AM. For three months. Someone changed the backup_svc password during a security rotation in October and never updated the backup script's .pgpass file.
10:30 AM — I check S3. The rsync job runs after pg_dump, and since pg_dump produced a tiny error file, rsync dutifully uploaded 14 bytes to S3 every night. Our S3 bucket has 97 copies of an error message.
10:45 AM — We discover the last valid backup is from 97 days ago. Restoring it would lose three months of shipment data. That's not an option. We pivot to the streaming replica — it replicated the DROP faithfully. WAL archives? We had them, but only retained 7 days.
11:30 AM — A DBA contractor we emergency-called suggests checking pg_stat_user_tables on the replica for any lag. No luck. But he asks about delayed replicas. We don't have one. He asks about filesystem snapshots. Our SAN takes daily snapshots with 30-day retention. We find a snapshot from 6 AM that morning, before the migration ran.
12:15 PM — We mount the SAN snapshot, start a temporary PostgreSQL instance against it, and extract the three dropped tables. Data recovered. Total data loss: about 4 hours of shipment updates between the 6 AM snapshot and the 10:15 AM incident.
The Moment of Truth¶
Fourteen bytes. Our entire backup strategy produced fourteen bytes every night for three months, and nobody knew because we monitored whether the cron job ran but not whether it succeeded. The cron exit code was 0 because the shell script didn't use set -e.
The Aftermath¶
We replaced the cron script with a proper backup tool (pgBackRest) that has built-in verification and alerting. We added monitoring that checks backup file size (must be > 1GB for our dataset) and backup age (must be < 26 hours). Most importantly, we instituted monthly restore drills where we spin up a fresh PostgreSQL instance from the latest backup and run a validation query suite against it. The SAN snapshots saved us, but that was luck, not planning.
The Lessons¶
- A backup is not a backup until you've restored from it: If you haven't tested a restore, you have a hope, not a backup. Schedule regular restore drills.
- Alert on backup job failures, not just execution: Monitoring that the cron ran is useless. Monitor that it produced a valid artifact of expected size within expected time.
- Defense in depth for data: WAL archiving, delayed replicas, SAN snapshots, logical backups — you want multiple independent layers because any single one can silently fail.
What I'd Do Differently¶
I'd implement backup verification as a CI pipeline stage that runs nightly: take the backup, restore it to an ephemeral instance, run checksums against production, tear it down. Fully automated, with PagerDuty alerts on failure. I'd also set up a delayed replica with at least 4 hours of lag — it's the cheapest insurance against "oops" moments.
The Quote¶
"We had 97 backups. Every single one was an error message."
Cross-References¶
- Topic Packs: Backup and Restore, Database Ops, PostgreSQL
- Case Studies: Linux Ops