Disaster Recovery Footguns¶

Mistakes that turn a recoverable incident into a catastrophe.

1. Testing backups by checking if the file exists¶

You verify your backup by checking that /backup/daily.tar.gz exists and has a non-zero size. It exists. It is 200 GB. It is also corrupted because the disk had bad sectors. You discover this during an actual restore at 2 AM.

Fix: Verify backups by actually restoring them. Schedule automated restore tests weekly. Check file integrity with borg check or restic check, not just file size.

War story: GitLab's 2017 data loss incident revealed that 5 of their 5 backup methods had silently failed or were never verified. pg_dump was producing 0-byte files. LVM snapshots were not configured. Backups existed on paper but not in practice. They lost 6 hours of production data.

2. Backup and production on the same disk array¶

Your "backup" is a borg repo on the same RAID array as production. The RAID controller fails. You lose production AND backups simultaneously. The 3-2-1 rule exists for this exact scenario.

Fix: Backups must be on physically separate storage. Local backup for speed, offsite backup for survival. At minimum: separate disk, separate server, separate site.

Remember: The 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 offsite. The 3-2-1-1-0 extension adds: 1 copy air-gapped or immutable, 0 errors verified by restore testing. Ransomware has made immutable/air-gapped copies non-optional.

3. Offsite backups with no tested restore path¶

You dutifully send restic snapshots to S3 every night. You have never tried downloading and restoring them. The S3 bucket is in a region that requires VPN access you do not have from your DR site. Or the restore takes 18 hours and your RTO is 4 hours.

Fix: Test the full restore path from offsite at least quarterly. Measure actual restore time. Factor in network bandwidth, download costs, and access requirements.

4. Running `borg prune` with the wrong retention¶

You meant --keep-daily=7 but typed --keep-daily=1. You now have one daily backup. A week later you discover data corruption that happened 3 days ago. Your only backup already contains the corruption.

Fix: Test prune commands with --dry-run first. Set retention in a config file, not on the command line. Review retention policy with the team, not solo at 11 PM.

Debug clue: borg list <repo> shows all archives with timestamps. borg info <repo>::<archive> shows the size and deduplicated size of a specific archive. After pruning, borg compact <repo> actually reclaims disk space — borg prune only marks archives for deletion, it doesn't free space until compaction runs.

5. Forgetting to back up the backup encryption keys¶

Your borg repo is encrypted with a passphrase. The passphrase is stored on the server that just died. Your offsite backup is intact, encrypted, and permanently inaccessible.

Fix: Export borg keys with borg key export. Store the passphrase and key in a separate secrets manager or physical safe. Test decryption from a clean machine annually.

Remember: Borg encryption keys are stored in the repo itself (in <repo>/config), encrypted with your passphrase. If you lose the passphrase, the key is useless. If the repo's config file is corrupted and you didn't export the key, the repo is unrecoverable. borg key export <repo> /safe/location/borg-key-backup saves the key in a portable format. Store it separately from both the repo and the passphrase.

6. DR runbook that references decommissioned infrastructure¶

Your runbook says "restore to dr-server-02." That server was decommissioned 8 months ago. The DNS entry, the SSH keys, and the backup mount point are all gone. You find this out during an actual disaster.

Fix: Review DR runbooks quarterly. Runbook review is part of the DR drill, not separate from it. Version runbooks in git with a last_tested date at the top.

7. Backing up a database by copying data files¶

You back up PostgreSQL by copying /var/lib/postgresql/data/ while the database is running. The files are inconsistent because writes were happening during the copy. Your restore produces a corrupted database.

Fix: Use database-native backup tools: pg_dump, pg_basebackup, or filesystem snapshots with the database in backup mode. Never copy data files from a running database.

Under the hood: PostgreSQL uses Write-Ahead Logging (WAL). At any moment, committed data may be in WAL files that haven't been flushed to data files. Copying data files without pg_start_backup()/pg_stop_backup() misses in-flight WAL, producing a backup that PostgreSQL will refuse to start from — or worse, silently lose recent transactions.

8. No monitoring on backup job failures¶

Your backup cron job has been failing for 3 weeks. The disk filled up, the cron sends email to root, and nobody reads root's email. You discover this when you need a restore and your newest backup is 3 weeks old.

Fix: Monitor backup recency as a metric. Alert if the newest backup is older than 1.5x the backup interval. Use a dead man's switch (Healthchecks.io, Cronitor) that alerts when backups STOP succeeding.

9. Restoring over production instead of to a staging path¶

You run borg extract /backup/borg-repo::latest in the wrong directory. It extracts over your live production data, overwriting files that changed since the backup. You now have a mix of live data and restored data.

Fix: Always restore to a temporary directory first. Verify the restored data. Then move it into place. Add --destination /tmp/restore-test to every restore command in your runbook.

10. Assuming cloud provider handles DR for you¶

You run on AWS. You assume S3 is durable (it is) so you do not need backups. Then someone deletes the S3 bucket, or an IAM policy change locks you out, or a Terraform apply destroys the resource. S3 durability protects against hardware failure, not human error or misconfiguration.

Fix: Cloud durability is not a backup strategy. Enable S3 versioning and Object Lock for critical buckets. Maintain independent offsite copies in a different cloud account or provider.

Gotcha: S3's "11 nines" durability (99.999999999%) protects against hardware failure, not aws s3 rm --recursive or a Terraform force_destroy = true. S3 Object Lock in Compliance mode prevents even the root account from deleting objects — this is the only S3 feature that protects against credential compromise.