Backup & Restore Footguns¶
Mistakes that leave you without recovery options when disaster strikes.
1. Never testing restores¶
You run backups every night for two years. A disk fails. You try to restore and discover the backup format is incompatible with the current database version. Or the backup file is zero bytes because the cron job errored silently. Your backup was a comfort blanket, not a recovery plan.
Fix: Automate monthly restore tests to a temporary database. Verify row counts and data integrity. Alert on restore failures. A backup that has never been restored is not a backup.
War story: On January 31, 2017, GitLab lost 300GB of production data when an engineer accidentally ran
rm -rfon the wrong database directory. When they tried to restore, they discovered: pg_dump backups had silently never run (misconfigured), backup failure alerts were rejected by DMARC email filtering, and the only usable backup was 6 hours old. The incident became a textbook case for why untested backups are not backups.
2. Backups on the same failure domain¶
Your backup script copies the database dump to another directory on the same server. The disk dies. Both the data and the backup are gone. Same failure applies to snapshots stored in the same cloud region.
Fix: Follow 3-2-1: 3 copies, 2 different media, 1 offsite. Use cross-region S3 replication. Copy snapshots to another region. Test that the offsite copy is independently restorable.
3. No backup monitoring or alerting¶
The backup cron job has been failing for 3 weeks. Nobody checks. The nightly email goes to a mailbox nobody reads. When you need to restore, your newest backup is 3 weeks old — far beyond your RPO.
Fix: Monitor backup job exit codes, file sizes, and timestamps. Alert if a backup is missing, undersized, or late. Track backup success rate as an SLA metric.
Debug clue: A quick sanity check: compare today's backup size to yesterday's. A backup that is 50% smaller than usual likely failed partway through. A backup that is exactly 0 bytes means the job errored silently. A backup that is the same size every day for a week on a write-heavy database means something is wrong with the snapshot mechanism.
4. Backing up the container, not the data¶
You snapshot the Docker host or back up container images. The actual data lives in a volume mount that is not included. You restore the container and find an empty database.
Fix: Identify where persistent data actually lives (volumes, PVCs, external databases). Back up the data, not the compute. For Kubernetes, use Velero with volume snapshot support.
5. Inconsistent database backups¶
You copy the PostgreSQL data directory while the database is running. The files are in an inconsistent state because writes happened mid-copy. The backup is corrupt and unrestorable.
Fix: Use pg_dump or pg_basebackup for consistent logical or physical backups. For MySQL, use --single-transaction. For filesystem-level backups, quiesce the database or use LVM snapshots.
Under the hood: A database's on-disk files are not self-consistent during writes. PostgreSQL uses WAL (write-ahead logging) to ensure crash recovery; a file copy during writes captures a half-written state.
pg_basebackupcoordinates with the WAL to produce a consistent snapshot. A rawcporrsyncof the data directory does not.
6. No encryption on backup media¶
Your database dump containing customer PII sits unencrypted in an S3 bucket. A misconfigured bucket policy exposes it. You now have a data breach on top of whatever problem prompted the backup.
Fix: Encrypt backups at rest (restic and borg encrypt by default). Enable S3 server-side encryption. Encrypt in transit (TLS). Treat backup storage with the same security controls as production data.
7. RPO/RTO mismatch with backup frequency¶
Your RPO is 1 hour but you run daily backups. When you need to restore, you lose 23 hours of data. Management thought they had 1-hour recovery capability because nobody translated RPO into backup frequency.
Fix: RPO drives backup frequency: 1-hour RPO requires at least hourly backups (or continuous WAL archiving/replication). Document RPO/RTO requirements and verify your backup schedule meets them.
8. Retention policy deletes too aggressively¶
Your retention policy keeps 7 daily backups. On day 1, a data corruption bug is introduced. On day 8, you discover the corruption. All 7 backups contain the corrupted data. The last clean backup was deleted yesterday.
Fix: Use tiered retention: 7 daily, 4 weekly, 6 monthly. Keep at least one backup older than your longest detection window for data corruption. For critical data, add immutable backups that cannot be deleted programmatically.
Remember: Grandfather-Father-Son (GFS) rotation: keep 7 dailies, 4 weeklies, 12 monthlies, 1 yearly. This gives you 12+ months of coverage with only ~24 backup copies. Both restic and Borg support retention policies natively. The key insight: your retention window must exceed your longest corruption detection time.
9. Storing backup credentials alongside the backup¶
Your backup script has the S3 access key and database password hardcoded in it. An attacker who compromises the server gets the credentials to delete all backups before deploying ransomware.
Fix: Use IAM roles (not keys) for cloud access. Use separate credentials for backup write vs. delete operations. Enable S3 Object Lock for immutable backups. Store backup credentials in a secrets manager, not in scripts.
10. Snapshot-only strategy with no offsite copy¶
You rely entirely on EBS snapshots in us-east-1. AWS has a regional disruption. Your snapshots are inaccessible. Your data and your backups are in the same blast radius.
Fix: Snapshots are not backups by themselves. Copy snapshots to another region. Maintain at least one backup in a different provider or medium. Test restoration from the offsite copy.
Gotcha: EBS snapshots are stored in S3 within the same region. A full regional disruption (rare but documented: us-east-1, 2017) makes all snapshots in that region inaccessible. Cross-region snapshot copies cost only the incremental storage and are your cheapest insurance against regional failure.