Pattern: Untested Backup¶
ID: FP-025 Family: Silent Corruption Frequency: Very Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Actively Misleading
The Shape¶
Automated backups complete successfully for months or years. The success metric is "backup job exited 0," not "backup was successfully restored." When a restore is actually needed, the backup is discovered to be corrupt, incomplete, or using a format that the current version of the software cannot read. The backup infrastructure provides false confidence — the appearance of disaster recovery without the substance.
How You'll See It¶
In Linux/Infrastructure¶
pg_dump job exits 0 nightly. The dump file is actually 0 bytes because the job ran
before the database was ready. Restore test (never performed): psql < backup.sql
exits with "unexpected end of file."
Alternatively: backup was from PostgreSQL 13; restore is attempted on PostgreSQL 15;
pg_restore fails with incompatible format version.
In Kubernetes¶
Velero backup job completes successfully. Backup stored in S3. During DR drill:
velero restore create --from-backup daily-backup-2024 succeeds (exit 0) but the
restored PVC is empty. The backup captured the PVC object definition but not the data
(storage class didn't support CSI snapshots; Velero fell back to empty restore silently).
In Datacenter¶
Tape backup system reports "backup complete" for 6 months. Tape drive has a bad head that writes corrupted data. First restore attempt during an actual disaster: all tapes unreadable. No restore tests were ever run.
The Tell¶
The backup job has never been tested with a full restore. Backup success is measured by job exit code, not by successful data retrieval. The backup format or location is not periodically validated.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Backup system failure | Backup was always broken; never tested | Run a restore in a test environment; confirm data integrity |
| Storage failure | Bad backup content | Storage works fine; the backup content is the problem |
| Version incompatibility (known) | Untested compatibility assumption | The assumption was never validated against a real restore |
The Fix (Generic)¶
- Immediate: Attempt a restore from the most recent backup immediately to validate it.
- Short-term: Implement automated restore testing: weekly, spin up a test environment, restore from backup, run a data integrity check (row count, hash, query result).
- Long-term: Define RTO/RPO targets; validate that the backup system can meet them; treat "backup tested" as a required metric alongside "backup succeeded."
Real-World Examples¶
- Example 1: MySQL backup via
mysqldumpfor 2 years. Restore test during migration: dump file was invalid SQL (table locks during dump caused partial output). 2 years of "successful" backups, none restorable. - Example 2: Kubernetes Velero PVC backup. Restore test during disaster recovery drill: PVC restored as empty (storage class didn't support snapshotting; Velero silently fell back to empty volume). Discovered during drill, not actual disaster.
War Story¶
Database hardware failed on a Tuesday. We had 2 years of automated nightly backups. IT ran the restore. Three hours later: "restore failed — backup file appears corrupted." They tried the previous night: same. A week before: same. We eventually found a good backup from 3 months ago. We lost 3 months of data. The backup job had been silently failing to capture all tables after a schema change broke the dump script. Exit code was always 0. We now run automated restore tests every Sunday in a throwaway database and alert if the restored row count is less than 90% of production.
Cross-References¶
- Topic Packs: database-ops, backup-restore
- Footguns: database-ops/footguns.md — "Backup never tested"
- Related Patterns: FP-026 (replication lag at failover — another "believed consistent, wasn't"), FP-027 (missing PITR — related DR gap)