The Backup Nobody Tested
- lesson
- backup-strategies
- pitr
- restore-testing
- rpo/rto
- velero
- disaster-recovery
- l2 ---# The Backup Nobody Tested
Topics: backup strategies, PITR, restore testing, RPO/RTO, Velero, disaster recovery Level: L2 (Operations) Time: 60–75 minutes Prerequisites: None
The Mission¶
It's 4 AM. The database is corrupted. A bad migration truncated the wrong table. You need to restore from backup. You open your backup system and discover:
- The last successful backup was 72 hours ago
- Nobody has ever tested a restore
- The backup is in a format you don't have the tool to read
- The restore documentation says "see John" (John left 6 months ago)
Backups that have never been tested are not backups. They are hopes. This lesson teaches backup strategy, restore testing, and disaster recovery planning — because the time to discover your backup doesn't work is NOT during the disaster.
The Two Numbers That Matter¶
RPO — Recovery Point Objective¶
How much data can you afford to lose?
RPO = 0: No data loss acceptable (synchronous replication)
RPO = 1 hour: Can lose up to 1 hour of data (hourly WAL archiving)
RPO = 24 hours: Can lose up to 1 day (daily full backups)
RTO — Recovery Time Objective¶
How long can you be down?
RTO = 0: No downtime acceptable (active-active, auto-failover)
RTO = 15 min: Service must be back within 15 minutes (warm standby)
RTO = 4 hours: Service must be back within 4 hours (cold restore)
RTO = 24 hours: Can afford a day of downtime (restore from offline backup)
Every backup strategy maps to an RPO/RTO pair. The more aggressive the target, the more expensive and complex the solution.
| Strategy | RPO | RTO | Cost |
|---|---|---|---|
| No backups | ∞ | ∞ | $0 (until the disaster) |
| Daily pg_dump to S3 | 24h | 2-4h | Low |
| WAL archiving + PITR | Minutes | 30-60min | Medium |
| Synchronous replication | 0 | Seconds | High |
| Active-active multi-region | 0 | 0 | Very high |
PostgreSQL Backup Strategies¶
Strategy 1: Logical backups (pg_dump)¶
# Full database dump
pg_dump -Fc mydb > mydb-$(date +%Y%m%d).dump
# Restore
pg_restore -d mydb mydb-20260322.dump
- RPO: However often you dump (typically daily = 24h RPO)
- RTO: Proportional to database size (100GB ≈ 30-60 minutes to restore)
- Pros: Portable across PostgreSQL versions. Human-inspectable.
- Cons: Slow for large databases. Locks tables during dump (use
--jobsfor parallel).
Strategy 2: WAL archiving + PITR (Point-in-Time Recovery)¶
The Write-Ahead Log (WAL) records every change to the database. Archive WAL files continuously, and you can restore to any point in time:
Full base backup (weekly) + WAL archive (continuous)
↓ ↓
"The database as of Sunday" + "Every change since Sunday"
↓
Replay WAL up to "Tuesday 3:47 PM" = exact state at that moment
# Base backup (using pg_basebackup)
pg_basebackup -D /backups/base -Ft -z -P
# Continuous WAL archiving (postgresql.conf)
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'
# Restore to a specific point in time
restore_command = 'aws s3 cp s3://backups/wal/%f %p'
recovery_target_time = '2026-03-22 15:47:00'
- RPO: Minutes (depends on WAL archive lag)
- RTO: 30-60 minutes (base restore + WAL replay)
- Pros: Can restore to any point in time. Minimal data loss.
- Cons: More complex to set up and monitor.
Gotcha: WAL archiving only works if
archive_commandsucceeds. If S3 is unreachable, WAL files accumulate on disk. If the disk fills, PostgreSQL goes read-only. Monitorpg_stat_archiverforlast_failed_time— if it's recent, your backup pipeline is broken.
Strategy 3: Physical replication + failover¶
Streaming replication keeps a hot standby synchronized:
- RPO: Seconds (asynchronous) or 0 (synchronous)
- RTO: Seconds to minutes (promote replica)
- Cons: Replica has the same data, including the corruption. If the bad migration runs on the primary, it replicates to the standby. You still need WAL archiving/PITR for "oops" recovery.
Mental Model: Replication protects against hardware failure ("the server died"). PITR protects against human error ("someone dropped the table"). You need both.
Kubernetes Backup with Velero¶
For Kubernetes workloads, Velero backs up: - Kubernetes resources (Deployments, Services, ConfigMaps, Secrets) - Persistent Volume data (via CSI snapshots)
# Install Velero
velero install --provider aws --bucket velero-backups --secret-file ./credentials
# Backup a namespace
velero backup create staging-backup --include-namespaces staging
# Schedule daily backups with 7-day retention
velero schedule create daily --schedule="0 2 * * *" --ttl 168h
# Restore
velero restore create --from-backup staging-backup
Gotcha: Velero backs up Kubernetes resources, not necessarily the data inside PersistentVolumes (unless you configure CSI snapshots). A Velero restore recreates the Deployment and PVC, but if the underlying storage is gone, the PVC binds to an empty volume. Test the full restore path.
The Restore Test Ritual¶
A backup you've never restored is not a backup. Schedule regular restore tests:
# Monthly restore test checklist:
# 1. Restore to a separate environment (never prod!)
# 2. Verify data integrity
# 3. Verify application works against restored data
# 4. Time the restore (does it meet RTO?)
# 5. Document the procedure (update the runbook)
# PostgreSQL restore test
createdb mydb_restore_test
pg_restore -d mydb_restore_test /backups/mydb-latest.dump
psql -d mydb_restore_test -c "SELECT count(*) FROM users;"
# → Does the count match production?
dropdb mydb_restore_test
War Story: A team automated backups for 2 years. One day they needed to restore. The backup tool had been silently failing for 3 months — the S3 bucket policy was changed and uploads were denied, but the backup script's exit code wasn't checked. Three months of backups were empty files. The last good backup was 90 days old. Recovery meant losing 90 days of data — in practice, this meant manually recreating 90 days of transactions from payment processor records.
The 3-2-1 Rule¶
3 copies of your data, on 2 different media, with 1 offsite.
Copy 1: Production database (live)
Copy 2: Local backup server (same datacenter)
Copy 3: S3 / GCS / offsite storage (different region)
Two media: spinning disk + object storage (or tape)
One offsite: different building, different region
This protects against: - Disk failure (Copy 2 and 3 survive) - Datacenter failure (Copy 3 survives) - Ransomware (Copy 3 is immutable/versioned) - Accidental deletion (Copy 2 or 3 for restore)
Flashcard Check¶
Q1: RPO = 1 hour. What does this mean?
You can afford to lose up to 1 hour of data. Your backup strategy must ensure no more than 1 hour between the disaster and the last recoverable point.
Q2: Replication protects against . PITR protects against .
Replication: hardware failure (server dies, disk fails). PITR: human error (someone drops a table, bad migration). You need both.
Q3: Your pg_dump runs daily. What's the RPO?
24 hours. In the worst case (failure just before the next backup), you lose an entire day of data.
Q4: How do you know your backup actually works?
Restore it. Monthly restore tests to a separate environment. Check data integrity, application functionality, and restore time. If you haven't restored it, it's not a backup — it's a hope.
Q5: 3-2-1 rule — what are the 3, 2, and 1?
3 copies of data, 2 different media types, 1 copy offsite.
Cheat Sheet¶
PostgreSQL Backup Commands¶
| Task | Command |
|---|---|
| Full dump (custom format) | pg_dump -Fc mydb > backup.dump |
| Full dump (parallel) | pg_dump -Fc -j 4 mydb > backup.dump |
| Restore from dump | pg_restore -d mydb backup.dump |
| Base backup for PITR | pg_basebackup -D /backups/base -Ft -z |
| Check archive status | psql -c "SELECT * FROM pg_stat_archiver;" |
| Check replication lag | psql -c "SELECT * FROM pg_stat_replication;" |
Velero Commands¶
| Task | Command |
|---|---|
| Create backup | velero backup create NAME --include-namespaces NS |
| Schedule backup | velero schedule create NAME --schedule="0 2 * * *" |
| List backups | velero backup get |
| Restore | velero restore create --from-backup NAME |
| Check backup status | velero backup describe NAME |
RPO/RTO Quick Reference¶
| Strategy | RPO | RTO | Complexity |
|---|---|---|---|
| Daily pg_dump | 24h | Hours | Low |
| WAL archiving + PITR | Minutes | 30-60min | Medium |
| Streaming replication | Seconds | Minutes | Medium-High |
| Synchronous replication | 0 | Seconds | High |
Takeaways¶
-
An untested backup is not a backup. Schedule monthly restore tests. Time them. Verify data integrity. Update the runbook.
-
Replication ≠ backup. Replication copies corruption instantly. PITR lets you go back in time to before the corruption. You need both.
-
RPO and RTO drive the strategy. Know your business requirements first. "We need RPO of 1 hour" narrows options to WAL archiving or better.
-
Monitor the backup pipeline. Check
pg_stat_archiver, verify backup file sizes, alert on failures. The worst time to discover backups are broken is during a disaster. -
3-2-1: three copies, two media, one offsite. Protects against hardware failure, datacenter failure, ransomware, and accidental deletion.
Related Lessons¶
- The Database That Wouldn't Start — when the database won't come back after a crash
- The Disk That Filled Up — when WAL archiving fills the disk
- The Terraform State Disaster — backing up infrastructure state