The Backup Nobody Tested

lesson
backup-strategies
pitr
restore-testing
rpo/rto
velero
disaster-recovery
l2 ---# The Backup Nobody Tested

Topics: backup strategies, PITR, restore testing, RPO/RTO, Velero, disaster recovery Level: L2 (Operations) Time: 60–75 minutes Prerequisites: None

The Mission¶

It's 4 AM. The database is corrupted. A bad migration truncated the wrong table. You need to restore from backup. You open your backup system and discover:

The last successful backup was 72 hours ago
Nobody has ever tested a restore
The backup is in a format you don't have the tool to read
The restore documentation says "see John" (John left 6 months ago)

Backups that have never been tested are not backups. They are hopes. This lesson teaches backup strategy, restore testing, and disaster recovery planning — because the time to discover your backup doesn't work is NOT during the disaster.

The Two Numbers That Matter¶

RPO — Recovery Point Objective¶

How much data can you afford to lose?

RPO = 0:       No data loss acceptable (synchronous replication)
RPO = 1 hour:  Can lose up to 1 hour of data (hourly WAL archiving)
RPO = 24 hours: Can lose up to 1 day (daily full backups)

RTO — Recovery Time Objective¶

How long can you be down?

RTO = 0:        No downtime acceptable (active-active, auto-failover)
RTO = 15 min:   Service must be back within 15 minutes (warm standby)
RTO = 4 hours:  Service must be back within 4 hours (cold restore)
RTO = 24 hours: Can afford a day of downtime (restore from offline backup)

Every backup strategy maps to an RPO/RTO pair. The more aggressive the target, the more expensive and complex the solution.

Strategy	RPO	RTO	Cost
No backups	∞	∞	$0 (until the disaster)
Daily pg_dump to S3	24h	2-4h	Low
WAL archiving + PITR	Minutes	30-60min	Medium
Synchronous replication	0	Seconds	High
Active-active multi-region	0	0	Very high

PostgreSQL Backup Strategies¶

Strategy 1: Logical backups (pg_dump)¶

# Full database dump
pg_dump -Fc mydb > mydb-$(date +%Y%m%d).dump

# Restore
pg_restore -d mydb mydb-20260322.dump

RPO: However often you dump (typically daily = 24h RPO)
RTO: Proportional to database size (100GB ≈ 30-60 minutes to restore)
Pros: Portable across PostgreSQL versions. Human-inspectable.
Cons: Slow for large databases. Locks tables during dump (use --jobs for parallel).

Strategy 2: WAL archiving + PITR (Point-in-Time Recovery)¶

The Write-Ahead Log (WAL) records every change to the database. Archive WAL files continuously, and you can restore to any point in time:

Full base backup (weekly)  +  WAL archive (continuous)
         ↓                            ↓
"The database as of Sunday"  +  "Every change since Sunday"
         ↓
Replay WAL up to "Tuesday 3:47 PM" = exact state at that moment

# Base backup (using pg_basebackup)
pg_basebackup -D /backups/base -Ft -z -P

# Continuous WAL archiving (postgresql.conf)
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'

# Restore to a specific point in time
restore_command = 'aws s3 cp s3://backups/wal/%f %p'
recovery_target_time = '2026-03-22 15:47:00'

RPO: Minutes (depends on WAL archive lag)
RTO: 30-60 minutes (base restore + WAL replay)
Pros: Can restore to any point in time. Minimal data loss.
Cons: More complex to set up and monitor.

Gotcha: WAL archiving only works if archive_command succeeds. If S3 is unreachable, WAL files accumulate on disk. If the disk fills, PostgreSQL goes read-only. Monitor pg_stat_archiver for last_failed_time — if it's recent, your backup pipeline is broken.

Strategy 3: Physical replication + failover¶

Streaming replication keeps a hot standby synchronized:

Primary → WAL stream → Replica (near-real-time)

RPO: Seconds (asynchronous) or 0 (synchronous)
RTO: Seconds to minutes (promote replica)
Cons: Replica has the same data, including the corruption. If the bad migration runs on the primary, it replicates to the standby. You still need WAL archiving/PITR for "oops" recovery.

Mental Model: Replication protects against hardware failure ("the server died"). PITR protects against human error ("someone dropped the table"). You need both.

Kubernetes Backup with Velero¶

For Kubernetes workloads, Velero backs up: - Kubernetes resources (Deployments, Services, ConfigMaps, Secrets) - Persistent Volume data (via CSI snapshots)

# Install Velero
velero install --provider aws --bucket velero-backups --secret-file ./credentials

# Backup a namespace
velero backup create staging-backup --include-namespaces staging

# Schedule daily backups with 7-day retention
velero schedule create daily --schedule="0 2 * * *" --ttl 168h

# Restore
velero restore create --from-backup staging-backup

Gotcha: Velero backs up Kubernetes resources, not necessarily the data inside PersistentVolumes (unless you configure CSI snapshots). A Velero restore recreates the Deployment and PVC, but if the underlying storage is gone, the PVC binds to an empty volume. Test the full restore path.

The Restore Test Ritual¶

A backup you've never restored is not a backup. Schedule regular restore tests:

# Monthly restore test checklist:
# 1. Restore to a separate environment (never prod!)
# 2. Verify data integrity
# 3. Verify application works against restored data
# 4. Time the restore (does it meet RTO?)
# 5. Document the procedure (update the runbook)

# PostgreSQL restore test
createdb mydb_restore_test
pg_restore -d mydb_restore_test /backups/mydb-latest.dump
psql -d mydb_restore_test -c "SELECT count(*) FROM users;"
# → Does the count match production?
dropdb mydb_restore_test

War Story: A team automated backups for 2 years. One day they needed to restore. The backup tool had been silently failing for 3 months — the S3 bucket policy was changed and uploads were denied, but the backup script's exit code wasn't checked. Three months of backups were empty files. The last good backup was 90 days old. Recovery meant losing 90 days of data — in practice, this meant manually recreating 90 days of transactions from payment processor records.

The 3-2-1 Rule¶

3 copies of your data, on 2 different media, with 1 offsite.

Copy 1: Production database (live)
Copy 2: Local backup server (same datacenter)
Copy 3: S3 / GCS / offsite storage (different region)

Two media: spinning disk + object storage (or tape)
One offsite: different building, different region

This protects against: - Disk failure (Copy 2 and 3 survive) - Datacenter failure (Copy 3 survives) - Ransomware (Copy 3 is immutable/versioned) - Accidental deletion (Copy 2 or 3 for restore)

Flashcard Check¶

Q1: RPO = 1 hour. What does this mean?

You can afford to lose up to 1 hour of data. Your backup strategy must ensure no more than 1 hour between the disaster and the last recoverable point.

Q2: Replication protects against . PITR protects against .

Replication: hardware failure (server dies, disk fails). PITR: human error (someone drops a table, bad migration). You need both.

Q3: Your pg_dump runs daily. What's the RPO?

24 hours. In the worst case (failure just before the next backup), you lose an entire day of data.

Q4: How do you know your backup actually works?

Restore it. Monthly restore tests to a separate environment. Check data integrity, application functionality, and restore time. If you haven't restored it, it's not a backup — it's a hope.

Q5: 3-2-1 rule — what are the 3, 2, and 1?

3 copies of data, 2 different media types, 1 copy offsite.

Cheat Sheet¶

PostgreSQL Backup Commands¶

Task	Command
Full dump (custom format)	`pg_dump -Fc mydb > backup.dump`
Full dump (parallel)	`pg_dump -Fc -j 4 mydb > backup.dump`
Restore from dump	`pg_restore -d mydb backup.dump`
Base backup for PITR	`pg_basebackup -D /backups/base -Ft -z`
Check archive status	`psql -c "SELECT * FROM pg_stat_archiver;"`
Check replication lag	`psql -c "SELECT * FROM pg_stat_replication;"`

Velero Commands¶

Task	Command
Create backup	`velero backup create NAME --include-namespaces NS`
Schedule backup	`velero schedule create NAME --schedule="0 2 * * *"`
List backups	`velero backup get`
Restore	`velero restore create --from-backup NAME`
Check backup status	`velero backup describe NAME`

RPO/RTO Quick Reference¶

Strategy	RPO	RTO	Complexity
Daily pg_dump	24h	Hours	Low
WAL archiving + PITR	Minutes	30-60min	Medium
Streaming replication	Seconds	Minutes	Medium-High
Synchronous replication	0	Seconds	High

Takeaways¶

An untested backup is not a backup. Schedule monthly restore tests. Time them. Verify data integrity. Update the runbook.
Replication ≠ backup. Replication copies corruption instantly. PITR lets you go back in time to before the corruption. You need both.
RPO and RTO drive the strategy. Know your business requirements first. "We need RPO of 1 hour" narrows options to WAL archiving or better.
Monitor the backup pipeline. Check pg_stat_archiver, verify backup file sizes, alert on failures. The worst time to discover backups are broken is during a disaster.
3-2-1: three copies, two media, one offsite. Protects against hardware failure, datacenter failure, ransomware, and accidental deletion.

The Database That Wouldn't Start — when the database won't come back after a crash
The Disk That Filled Up — when WAL archiving fills the disk
The Terraform State Disaster — backing up infrastructure state