Skip to content

Pattern: Untested Backup

ID: FP-025 Family: Silent Corruption Frequency: Very Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Actively Misleading

The Shape

Automated backups complete successfully for months or years. The success metric is "backup job exited 0," not "backup was successfully restored." When a restore is actually needed, the backup is discovered to be corrupt, incomplete, or using a format that the current version of the software cannot read. The backup infrastructure provides false confidence — the appearance of disaster recovery without the substance.

How You'll See It

In Linux/Infrastructure

pg_dump job exits 0 nightly. The dump file is actually 0 bytes because the job ran before the database was ready. Restore test (never performed): psql < backup.sql exits with "unexpected end of file."

Alternatively: backup was from PostgreSQL 13; restore is attempted on PostgreSQL 15; pg_restore fails with incompatible format version.

In Kubernetes

Velero backup job completes successfully. Backup stored in S3. During DR drill: velero restore create --from-backup daily-backup-2024 succeeds (exit 0) but the restored PVC is empty. The backup captured the PVC object definition but not the data (storage class didn't support CSI snapshots; Velero fell back to empty restore silently).

In Datacenter

Tape backup system reports "backup complete" for 6 months. Tape drive has a bad head that writes corrupted data. First restore attempt during an actual disaster: all tapes unreadable. No restore tests were ever run.

The Tell

The backup job has never been tested with a full restore. Backup success is measured by job exit code, not by successful data retrieval. The backup format or location is not periodically validated.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Backup system failure Backup was always broken; never tested Run a restore in a test environment; confirm data integrity
Storage failure Bad backup content Storage works fine; the backup content is the problem
Version incompatibility (known) Untested compatibility assumption The assumption was never validated against a real restore

The Fix (Generic)

  1. Immediate: Attempt a restore from the most recent backup immediately to validate it.
  2. Short-term: Implement automated restore testing: weekly, spin up a test environment, restore from backup, run a data integrity check (row count, hash, query result).
  3. Long-term: Define RTO/RPO targets; validate that the backup system can meet them; treat "backup tested" as a required metric alongside "backup succeeded."

Real-World Examples

  • Example 1: MySQL backup via mysqldump for 2 years. Restore test during migration: dump file was invalid SQL (table locks during dump caused partial output). 2 years of "successful" backups, none restorable.
  • Example 2: Kubernetes Velero PVC backup. Restore test during disaster recovery drill: PVC restored as empty (storage class didn't support snapshotting; Velero silently fell back to empty volume). Discovered during drill, not actual disaster.

War Story

Database hardware failed on a Tuesday. We had 2 years of automated nightly backups. IT ran the restore. Three hours later: "restore failed — backup file appears corrupted." They tried the previous night: same. A week before: same. We eventually found a good backup from 3 months ago. We lost 3 months of data. The backup job had been silently failing to capture all tables after a schema change broke the dump script. Exit code was always 0. We now run automated restore tests every Sunday in a throwaway database and alert if the restored row count is less than 90% of production.

Cross-References