Portal | Level: L2: Operations | Topics: Disaster Recovery, Backup & Restore | Domain: Security
Disaster Recovery & Backup Engineering - Primer¶
Why This Matters¶
Every production environment will eventually suffer data loss, hardware failure, or a catastrophic event that requires recovery from backups. The difference between a 4-hour recovery and a 4-day recovery is whether you engineered your DR strategy or just hoped for the best. I have seen companies lose weeks of customer data because their "backup system" was an untested cron job writing to a local disk that failed alongside the primary.
DR is not a project you finish. It is an ongoing discipline: designing backup pipelines, testing restores, measuring RTO/RPO, and running failover drills. If you have never restored from your backups, you do not have backups — you have hopes.
Core Concepts¶
1. RTO and RPO¶
These two numbers define your entire DR strategy.
RPO (Recovery Point Objective) RTO (Recovery Time Objective)
How much data can you afford How long can you be down
to lose? before business impact?
|---- data loss window ----|---- downtime window ----|
^ ^
disaster occurs service restored
| Tier | RPO | RTO | Strategy |
|---|---|---|---|
| Tier 1 (critical) | < 1 hour | < 1 hour | Real-time replication, hot standby |
| Tier 2 (important) | < 4 hours | < 4 hours | Frequent snapshots, warm standby |
| Tier 3 (standard) | < 24 hours | < 24 hours | Daily backups, cold restore |
| Tier 4 (archival) | < 7 days | < 72 hours | Weekly backups, offsite only |
Remember: The RTO/RPO mnemonic: RPO = data, RTO = downtime. RPO looks backward ("how much data can I lose?"), RTO looks forward ("how long until service returns?"). In an interview, if asked "what is the difference between RTO and RPO?" — RPO is about data loss tolerance, RTO is about downtime tolerance. They drive completely different engineering decisions: RPO drives replication frequency, RTO drives failover architecture.
2. The 3-2-1 Backup Rule¶
3 copies of your data
2 different storage media
1 offsite copy
Example:
Copy 1: Production database (live)
Copy 2: Local backup server (borg repo on ZFS)
Copy 3: Offsite S3 bucket (restic to Backblaze B2)
Fun fact: The 3-2-1 rule was popularized by photographer Peter Krogh in his 2005 book on digital asset management. It has since become the universal baseline for backup strategy across all of IT. Some organizations extend it to 3-2-1-1-0: 3 copies, 2 media types, 1 offsite, 1 offline (air-gapped, immune to ransomware), and 0 errors (verified restores). The offline copy is increasingly critical — ransomware specifically targets network-accessible backups.
3. Borg Backup¶
Who made it: Borg Backup was forked from Attic (created 2010) by Thomas Waldmann and the community in 2015. The fork happened because Attic's development had stalled. Borg added authenticated encryption (AEAD), better compression options, and faster deduplication. The name is a nod to Star Trek's Borg collective — "resistance is futile" when it comes to data loss. Restic, the main alternative, was created by Alexander Neumann in 2015 and took a different approach: native cloud backend support (S3, B2, Azure, GCS) without needing rclone or mount points.
Borg is a deduplicating backup tool. It is fast, space-efficient, and handles encryption natively. This is what you reach for when backing up Linux servers with large, slowly-changing datasets.
# Initialize a borg repository
borg init --encryption=repokey /backup/borg-repo
# Create a backup with pruning-friendly naming
borg create /backup/borg-repo::{hostname}-{now:%Y-%m-%d_%H:%M} \
/etc /var/lib/postgresql /home \
--exclude '*.tmp' \
--exclude '/home/*/.cache' \
--compression lz4
# List archives
borg list /backup/borg-repo
# Restore a specific path
borg extract /backup/borg-repo::myhost-2026-03-14_02:00 \
home/deploy/app/config.yml
# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune /backup/borg-repo \
--keep-daily=7 --keep-weekly=4 --keep-monthly=6
# Verify repository integrity
borg check /backup/borg-repo
Borg Deduplication¶
Backup 1: [A][B][C][D][E] = 50 GB
Backup 2: [A][B][C][D'][E] = only D' is new = ~10 GB stored
Backup 3: [A][B][C'][D'][E'] = only C' E' are new = ~20 GB stored
Total logical: 150 GB
Total physical: ~80 GB (deduplication ratio depends on change rate)
4. Restic¶
Restic is the modern alternative to borg. It natively supports cloud backends (S3, B2, Azure, GCS) and is simpler to set up for offsite backups.
# Initialize a restic repo on S3
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
restic init -r s3:s3.amazonaws.com/my-backup-bucket
# Create a backup
restic backup /etc /var/lib/postgresql /home \
-r s3:s3.amazonaws.com/my-backup-bucket \
--exclude-caches \
--tag production --tag database
# List snapshots
restic snapshots -r s3:s3.amazonaws.com/my-backup-bucket
# Restore to a target directory
restic restore latest -r s3:s3.amazonaws.com/my-backup-bucket \
--target /restore --include /var/lib/postgresql
# Forget and prune (reclaim space)
restic forget -r s3:s3.amazonaws.com/my-backup-bucket \
--keep-daily 7 --keep-weekly 4 --keep-monthly 12 --prune
# Check repository integrity
restic check -r s3:s3.amazonaws.com/my-backup-bucket
5. rsnapshot¶
rsnapshot uses rsync and hard links for space-efficient incremental backups. It is dead simple and battle-tested for filesystem-level backups.
# /etc/rsnapshot.conf (tabs between fields, not spaces!)
snapshot_root /backup/rsnapshot/
retain daily 7
retain weekly 4
retain monthly 6
backup /etc/ localhost/
backup /home/ localhost/
backup /var/lib/postgresql/ localhost/
# Crontab
0 3 * * * /usr/bin/rsnapshot daily
0 4 * * 1 /usr/bin/rsnapshot weekly
0 5 1 * * /usr/bin/rsnapshot monthly
Directory structure after a week:
/backup/rsnapshot/
daily.0/ <- today's backup
daily.1/ <- yesterday
daily.2/ <- 2 days ago
...
weekly.0/ <- last Monday
6. Backup Verification¶
A backup you have never restored from is not a backup.
#!/bin/bash
# verify-backup.sh - automated restore test
set -euo pipefail
RESTORE_DIR=$(mktemp -d /tmp/restore-test.XXXXX)
BORG_REPO=/backup/borg-repo
LATEST=$(borg list "$BORG_REPO" --last 1 --format '{archive}')
echo "Testing restore of archive: $LATEST"
# Extract to temp directory
borg extract "$BORG_REPO::$LATEST" --destination "$RESTORE_DIR"
# Verify critical files exist
for f in etc/hosts var/lib/postgresql/data/PG_VERSION; do
if [ ! -f "$RESTORE_DIR/$f" ]; then
echo "FAIL: missing $f"
exit 1
fi
done
# Verify PostgreSQL data directory is valid
pg_verifybackup "$RESTORE_DIR/var/lib/postgresql/data" 2>/dev/null || \
echo "WARN: pg_verifybackup not available, skipping"
echo "PASS: restore verification complete"
rm -rf "$RESTORE_DIR"
7. DR Runbook Structure¶
Every DR runbook follows this skeleton:
## DR Runbook: [Service Name]
### Classification
- RPO: X hours
- RTO: X hours
- Tier: 1/2/3/4
- Last tested: YYYY-MM-DD
- Owner: team-name
### Dependencies
- Upstream: [services this depends on]
- Downstream: [services that depend on this]
### Backup Details
- Tool: borg/restic/pg_dump/velero
- Schedule: every N hours
- Retention: X daily, Y weekly, Z monthly
- Location: primary and offsite paths
- Encryption: yes/no, key location
### Recovery Procedure
1. [Step-by-step restore instructions]
2. [Verification steps]
3. [DNS/routing cutover if applicable]
4. [Notification procedure]
### Failover Procedure (if hot/warm standby)
1. [Promote standby]
2. [Redirect traffic]
3. [Verify data integrity]
### Post-Recovery Checklist
- [ ] Verify data integrity
- [ ] Check replication lag
- [ ] Run smoke tests
- [ ] Notify stakeholders
- [ ] Update incident timeline
- [ ] Rebuild backup pipeline
8. Offsite Replication¶
Primary Site Offsite / Cloud
+------------------+ +------------------+
| Production DB | | S3 / B2 / GCS |
| Borg local repo |--restic-->| Offsite restic |
| Application data | | repo |
+------------------+ +------------------+
| |
borg prune restic forget
(local retention) (offsite retention)
Key decisions: - Bandwidth: initial seed may need physical media for large datasets - Encryption: always encrypt before leaving your network (borg and restic do this by default) - Retention: offsite retention is usually longer than local - Cost: S3 Standard vs Glacier vs B2 (10x price difference) - Restore speed: Glacier takes hours; factor this into your RTO
War story: A company ran nightly borg backups to a local backup server for two years. When their primary database server's RAID controller failed, they discovered the backup server's disk had been silently corrupting data for months —
borg checkhad never been run. The restore produced a database that started but had missing tables. The lesson:borg check(orrestic check) must be part of your backup pipeline, not an afterthought. Schedule it weekly and alert on non-zero exit codes.
Common Pitfalls¶
- Backing up but never testing restore — The most common DR failure. Schedule monthly restore drills.
- Backing up to the same disk — When the server dies, your backups die with it. Always have offsite.
- No encryption on offsite backups — Your entire database is now in a cloud bucket protected only by IAM. Use client-side encryption.
- Forgetting to back up config alongside data — You restore the database but the application config, TLS certs, and secrets are gone.
- Backup window exceeds backup interval — Your daily backup takes 26 hours. You now have a gap. Monitor backup duration.
- Not versioning your DR runbooks — The runbook references a server that was decommissioned 6 months ago. Keep runbooks in git and review quarterly.
- Single point of failure in the backup pipeline — The backup server itself has no redundancy. If it dies, you lose your ability to back up AND restore.
- Retention policy that is too aggressive — You keep only 3 days of backups. Corruption happened 4 days ago. You now have 3 copies of corrupted data.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
- Storage Operations (Topic Pack, L2)
Related Content¶
- Backup Restore (Topic Pack, L1) — Backup & Restore
- Backup Restore Flashcards (CLI) (flashcard_deck, L1) — Backup & Restore
- Disaster Recovery Flashcards (CLI) (flashcard_deck, L1) — Disaster Recovery
Pages that link here¶
- Anti-Primer: Disaster Recovery
- Backup Restore
- Certification Prep: AWS SAA — Solutions Architect Associate
- Disaster Recovery & Backup Engineering
- Level 7: SRE & Cloud Operations
- Linux Ops
- Master Curriculum: 40 Weeks
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Storage Operations