Disaster Recovery & Backup Engineering - Street-Level Ops¶
What experienced DR engineers know from years of backup failures and 3 AM restore calls.
Quick Diagnosis Commands¶
# Check last borg backup timestamp and size
borg list /backup/borg-repo --last 3 --format '{archive:<40} {time} {size}'
# Check restic snapshot freshness
restic snapshots -r s3:s3.amazonaws.com/my-bucket --latest 3
# Verify borg repo integrity (fast mode)
borg check --repository-only /backup/borg-repo
# Verify restic repo integrity
restic check -r s3:s3.amazonaws.com/my-bucket
# Check rsnapshot last run
ls -lt /backup/rsnapshot/daily.0/
# Check if backup cron ran
grep -i backup /var/log/syslog | tail -20
systemctl status backup.timer
# Check disk space on backup volume
df -h /backup/
du -sh /backup/borg-repo/
# Check PostgreSQL WAL archiving status
psql -c "SELECT * FROM pg_stat_archiver;"
# Check replication lag (PostgreSQL)
psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
Gotcha: Borg Lock File Prevents Backup¶
A previous borg process crashed or was killed, leaving a stale lock file. Every subsequent backup attempt fails with Failed to create/acquire the lock.
Fix:
# Check if a borg process is actually running
pgrep -af borg
# If no borg process is running, break the lock
borg break-lock /backup/borg-repo
# Then run a check to verify repo integrity
borg check /backup/borg-repo
Never blindly break the lock if borg is still running. You will corrupt the repository.
Default trap: Borg's default
--encryption=repokeystores the encryption key inside the repo itself (encrypted with your passphrase). If the repo gets corrupted, you lose both data and key. Alwaysborg key exportto a separate location immediately after repo creation.Debug clue: If
pgrep -af borgreturns nothing but the lock persists, check if the process died on a different machine (NFS-mounted repo). The lock file contains the hostname and PID of the holder —cat /backup/borg-repo/lock.exclusiveto verify before breaking.
Gotcha: Restic Forgets but Does Not Prune¶
You run restic forget --keep-daily 7 and think you reclaimed space. You did not. forget only removes the snapshot metadata. The data blobs are still there.
Remember: Restic's two-step process:
forgetremoves snapshot pointers,pruneremoves orphaned data blobs. Withoutprune, your backup storage never shrinks. Always use--prunewithforget, or runpruneseparately on a schedule.
Fix:
# Always combine forget with prune
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --prune
# Or run prune separately if you want to batch it
restic forget --keep-daily 7 --keep-weekly 4
restic prune
Gotcha: Backup Runs as Root but Restore Needs Different Permissions¶
Your backup cron runs as root. It captures files with root ownership. When you restore to a staging environment, everything is owned by root and the application user cannot read its own config files.
Fix:
# Restore and fix permissions in one pass
borg extract /backup/borg-repo::latest var/lib/myapp/
chown -R myapp:myapp /var/lib/myapp/
# Or use restic's --include to restore specific paths
restic restore latest --target / --include /var/lib/myapp/
chown -R myapp:myapp /var/lib/myapp/
Always have the ownership fix as an explicit step in your DR runbook.
Gotcha: Borg and restic both preserve Unix permissions and ownership by UID/GID number, not by username. If the
myappuser has UID 1001 on the production server but UID 1005 on the restore target, files will be owned by the wrong user even though the restore "succeeded." Verify UID mappings match before restore, or chown explicitly afterward.
Gotcha: rsnapshot Config Uses Spaces Instead of Tabs¶
rsnapshot silently fails or produces cryptic errors when the config file uses spaces instead of tabs. The config format requires literal tab characters between fields.
Fix:
# Check for spaces where tabs should be
cat -A /etc/rsnapshot.conf | grep -v '^#' | head -20
# Tabs show as ^I, spaces show as regular spaces
# Validate config
rsnapshot configtest
Gotcha: Borg Passphrase Lost¶
You initialized a borg repo with --encryption=repokey but nobody documented the passphrase. The passphrase is required for every operation including restore. Your backups are now a pile of encrypted bytes.
Fix: Prevention only.
# Export the key and store it separately
borg key export /backup/borg-repo /secure/borg-key-backup.txt
# Store passphrase in a secrets manager, not in a script
# If using a password file:
chmod 600 /root/.borg-passphrase
echo "BORG_PASSPHRASE_FILE=/root/.borg-passphrase" >> /etc/environment
Keep the key export and passphrase in different locations. If they are together, someone who compromises one location gets everything.
War story: A startup lost 6 months of backups because the borg passphrase was stored only in the engineer's head. When that engineer left the company, nobody could decrypt the repo. The passphrase wasn't in any password manager, secrets vault, or safe deposit box. Their "disaster recovery" became their disaster.
Pattern: Automated Backup Verification Pipeline¶
#!/bin/bash
# nightly-verify.sh - run after backups complete
set -euo pipefail
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"
RESTORE_DIR=$(mktemp -d)
RESULT="PASS"
# Test borg restore
LATEST=$(borg list /backup/borg-repo --last 1 --format '{archive}')
borg extract /backup/borg-repo::"$LATEST" \
--destination "$RESTORE_DIR" \
etc/passwd var/lib/postgresql/data/PG_VERSION 2>/dev/null || RESULT="FAIL"
# Verify critical files
for f in etc/passwd var/lib/postgresql/data/PG_VERSION; do
[ -f "$RESTORE_DIR/$f" ] || RESULT="FAIL"
done
# Check backup age (fail if older than 25 hours)
BACKUP_AGE=$(borg info /backup/borg-repo::"$LATEST" --json | \
python3 -c "import sys,json,datetime; \
d=json.load(sys.stdin)['archives'][0]['start']; \
print(int((datetime.datetime.now(datetime.timezone.utc)-datetime.datetime.fromisoformat(d)).total_seconds()/3600))")
[ "$BACKUP_AGE" -gt 25 ] && RESULT="STALE"
# Report
echo "Backup verify: $RESULT (archive=$LATEST, age=${BACKUP_AGE}h)"
# Alert on failure
if [ "$RESULT" != "PASS" ] && [ -n "$SLACK_WEBHOOK" ]; then
curl -s -X POST "$SLACK_WEBHOOK" \
-d "{\"text\":\"BACKUP VERIFY $RESULT: $LATEST (age: ${BACKUP_AGE}h)\"}"
fi
rm -rf "$RESTORE_DIR"
[ "$RESULT" = "PASS" ]
Pattern: PostgreSQL Backup with Point-in-Time Recovery¶
# Base backup with pg_basebackup
pg_basebackup -D /backup/pg-base -Ft -z -P -U replication
# WAL archiving in postgresql.conf
# archive_mode = on
# archive_command = 'test ! -f /backup/pg-wal/%f && cp %p /backup/pg-wal/%f'
# Point-in-time recovery
# 1. Stop PostgreSQL
# 2. Replace data directory with base backup
# 3. Create recovery signal file
touch /var/lib/postgresql/data/recovery.signal
# 4. Set recovery target in postgresql.conf
# recovery_target_time = '2026-03-14 15:30:00'
# restore_command = 'cp /backup/pg-wal/%f %p'
# 5. Start PostgreSQL — it replays WAL to the target time
Debug clue: If PITR recovery stalls with "requested WAL segment has already been removed," your
wal_keep_size(orwal_keep_segmentsin older PG) was too small and WAL files were recycled before archiving. This is unrecoverable for that time window. Setarchive_commandAND monitorpg_stat_archiver.failed_countto catch archiving failures early.
Pattern: DR Failover Drill Checklist¶
PRE-DRILL:
[ ] Notify stakeholders of planned drill
[ ] Verify backup freshness (< RPO)
[ ] Confirm offsite backup accessibility
[ ] Have rollback plan if drill goes wrong
DURING DRILL:
[ ] Start timer (measuring actual RTO)
[ ] Restore from backup to standby environment
[ ] Verify data integrity (row counts, checksums)
[ ] Switch DNS/routing to standby
[ ] Run application smoke tests
[ ] Verify monitoring is working on standby
POST-DRILL:
[ ] Record actual RTO achieved
[ ] Record actual RPO (data loss window)
[ ] Document any issues encountered
[ ] Update runbook with lessons learned
[ ] Fail back to primary
[ ] Verify primary is healthy
One-liner: The only backup that matters is the one you have tested restoring from. Untested backups are Schrodinger's backups — they might contain your data, or they might contain corrupted garbage. You won't know until the worst possible moment.
Emergency: Complete Server Loss¶
# 1. Provision replacement server (Ansible/Terraform)
# Do NOT try to recover the dead hardware first
# 2. Install borg/restic on new server
apt-get update && apt-get install -y borgbackup
# 3. Restore from offsite backup (restic example)
export AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=...
restic restore latest \
-r s3:s3.amazonaws.com/my-bucket \
--target /
# 4. Fix permissions
chown -R postgres:postgres /var/lib/postgresql/
chown -R myapp:myapp /var/lib/myapp/
# 5. Start services in dependency order
systemctl start postgresql
systemctl start myapp
# 6. Verify
psql -c "SELECT count(*) FROM critical_table;"
curl -s http://localhost:8080/health
# 7. Update DNS if IP changed
# 8. Rebuild local backup pipeline to new server
# 9. Post-incident review
Emergency: Ransomware / Crypto Event¶
# 1. ISOLATE - disconnect affected systems from network immediately
# Do NOT shut down - you may lose forensic evidence in memory
# 2. Identify scope
find / -name "*.encrypted" -o -name "DECRYPT_README*" 2>/dev/null | head -50
# 3. Verify offsite backups are NOT compromised
# Use a CLEAN machine to check offsite backups
# Restic/borg repos are append-only if configured correctly
# 4. Check if backups were targeted
restic check -r s3:s3.amazonaws.com/my-bucket
borg check /offsite-mount/borg-repo
# 5. Restore from last known good backup to CLEAN infrastructure
# Do NOT restore to compromised machines
# 6. Change ALL credentials before bringing services online
# Backup encryption keys, cloud IAM keys, database passwords, SSH keys
# 7. Enable append-only mode for future protection
# Borg: borg config /backup/borg-repo append_only 1
# Restic: use S3 Object Lock
Quick Reference¶
- Runbook: Disaster Recovery