Skip to content

The Backup We Never Tested

Category: The Hard Lesson Domains: backup-restore, disaster-recovery Read time: ~5 min


Setting the Scene

Our PostgreSQL 14 database backed a B2B SaaS platform with 1,400 paying customers. Backups ran nightly via pg_dump piped to gzip and uploaded to S3. The backup script had been running successfully for 14 months. CloudWatch showed a green check every morning at 3:15 AM. Backup size: ~12 GB compressed. The S3 bucket had versioning enabled and a 90-day lifecycle policy.

We had never performed a test restore. Not once. "The backups are there, we've spot-checked the files, they look fine," was the line I used in our quarterly security review. The auditor accepted it. I wish she hadn't.

What Happened

In month 8 of our backup history, we upgraded from PostgreSQL 14.5 to 14.8. Routine patch upgrade, no schema changes. What we didn't realize was that our pg_dump wrapper script used the system pg_dump binary, which was installed via apt and pinned to 14.5. After the PostgreSQL upgrade, the server was running 14.8 but pg_dump was still 14.5.

pg_dump 14.5 mostly works against a 14.8 server. Mostly. But it silently skipped certain statistics and, crucially, it generated a custom-format dump that pg_restore 14.8 choked on about 60% of the way through due to a catalog version mismatch in the TOC.

For 6 months, our "successful" backups were partially unrestorable. The backup script exited 0 because pg_dump didn't error — it completed, just with silently incomplete output. The gzip file looked right. The upload succeeded. CloudWatch saw a clean exit code and flashed green.

Then, on a Wednesday afternoon, a junior developer ran a data migration script against production instead of staging. DELETE FROM invoices WHERE created_at < '2024-01-01' without the AND status = 'draft' filter. 247,000 invoice records — including 89,000 finalized, legally-required invoices — gone. The transaction committed before anyone noticed.

We reached for the backups with confidence. Downloaded the latest nightly dump. Ran pg_restore --dbname=restore_test --verbose. It chewed through tables for about 20 minutes, then: pg_restore: error: unsupported version (1.14) in file header. I tried the previous night's backup. Same error. The night before that. Same.

I went backwards through S3, trying every backup. The last restorable backup was from 6 months ago — before the PostgreSQL patch upgrade.

The Moment of Truth

I called our CTO and said the words no engineer ever wants to say: "Our most recent restorable backup is 184 days old. We've lost 6 months of invoice data." The silence on the other end lasted about 10 seconds. It felt like 10 minutes.

The Aftermath

We recovered using a combination of WAL archiving (which, thankfully, we had enabled for replication and which used the correct binary version), Stripe payment records for invoice amounts, and a painful manual reconciliation process that took two weeks and involved calling customers to verify records.

Afterward, we implemented three things. First, pg_dump now runs from the same container as the database, guaranteeing version parity. Second, every backup is followed by an automated restore test to a disposable RDS instance, with row-count validation against production. Third, we run a full disaster recovery drill quarterly — drop a table in staging, restore from backup, validate data integrity.

The restore test adds $4.50 per night in RDS costs. That's $1,642 per year. The incident cost approximately $180,000 in engineering time, legal review, and customer remediation.

The Lessons

  1. Test restores regularly: A backup you haven't restored is a hypothesis, not a safety net. Prove it works on a schedule, not during an emergency.
  2. Backup success does not equal restore success: Exit code 0 from the backup script means the script finished. It doesn't mean the output is usable. Validate the artifact, not just the process.
  3. Disaster recovery needs rehearsal: You don't want the first time you run a restore to be the time everything depends on it. Practice when the stakes are low.

What I'd Do Differently

I'd add a make test-restore target that spins up a temporary database, restores the latest backup, runs SELECT count(*) FROM <table> against every table, compares to production counts, and tears down the instance. I'd run it nightly and alert if counts diverge by more than 1%. I'd also version-pin the backup tooling to the database version using a shared Docker image, so they can never drift apart.

The Quote

"We had 14 months of backups. We had 8 months of restores. We didn't find out the difference until we needed all 14."

Cross-References