Backup & Restore - Street-Level Ops¶
Real-world workflows for backing up, restoring, and verifying data across servers and Kubernetes.
Quick Backup with Restic to S3¶
# Initialize a restic repo on S3
export RESTIC_REPOSITORY=s3:s3.amazonaws.com/my-backup-bucket
export RESTIC_PASSWORD="your-repo-password"
restic init
# Backup critical directories
restic backup /etc /var/lib/postgresql /home --exclude-caches --verbose
# Output:
# Files: 12345 new, 234 changed, 98765 unmodified
# Dirs: 567 new, 23 changed, 4567 unmodified
# Added: 1.234 GiB
# processed 111344 files, 45.678 GiB in 3:45
# List snapshots
restic snapshots
# Output:
# ID Time Host Tags Paths
# a1b2c3d4 2024-03-15 02:00:01 db-prod /etc, /var/lib/postgresql, /home
# e5f6g7h8 2024-03-14 02:00:01 db-prod /etc, /var/lib/postgresql, /home
# Restore latest snapshot to a specific directory
restic restore latest --target /restore/ --include /var/lib/postgresql
# Forget old snapshots (keep 7 daily, 4 weekly, 6 monthly)
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
Gotcha:
restic forgetwithout--prunemarks snapshots for deletion but does not free disk space. The actual space reclamation happens duringprune, which rewrites pack files. On large repos (multi-TB),prunecan take hours and temporarily doubles storage usage. Schedulepruneduring maintenance windows, not immediately after everyforget.Under the hood: Restic deduplicates at the block level using content-defined chunking (CDC). A 50GB database backup where only 100MB changed stores only ~100MB of new data. The
prunestep is expensive — it rewrites pack files to remove unreferenced chunks. Runforgetdaily butpruneweekly during maintenance windows.
Database Backup and Restore¶
# PostgreSQL: consistent dump with custom format (compressed)
pg_dump -Fc -Z 6 mydb > "/backup/mydb_$(date +%Y%m%d_%H%M).dump"
# PostgreSQL: restore to a new database
createdb mydb_restored
pg_restore -d mydb_restored /backup/mydb_20240315_0200.dump
# PostgreSQL: restore specific tables only
pg_restore -d mydb_restored -t users -t orders /backup/mydb_20240315_0200.dump
# MySQL: consistent backup with single-transaction
mysqldump --single-transaction --routines --triggers \
--all-databases > "/backup/full_$(date +%Y%m%d).sql"
# MySQL: restore
mysql < /backup/full_20240315.sql
# Verify backup integrity (check without restoring)
pg_restore --list /backup/mydb_20240315_0200.dump > /dev/null && echo "OK" || echo "CORRUPT"
Gotcha:
pg_dumpwithout--single-transaction(plain format) or-Fc(custom format, which implies snapshot isolation) can produce an inconsistent backup if writes happen during the dump. For production, always use-Fc(custom format with snapshot) or--single-transaction(plain SQL). The--single-transactionflag wraps the dump in a singleSERIALIZABLEtransaction so the backup reflects a single point in time.
Kubernetes Backup with Velero¶
# Install Velero
velero install --provider aws --bucket my-velero-bucket \
--secret-file ./credentials --backup-location-config region=us-east-1
# Backup a namespace
velero backup create prod-backup --include-namespaces production
# Check backup status
velero backup describe prod-backup
> **Debug clue:** Velero backup status `PartiallyFailed` usually means some resources failed to be backed up — often PVs with no volume snapshotter configured, or resources with custom finalizers that block deletion. Check `velero backup logs prod-backup` for the specific failures. A `PartiallyFailed` backup may still be restorable for the resources that succeeded.
# Output:
# Phase: Completed
# Items backed up: 234
# Warnings: 0
# Errors: 0
# Create a scheduled backup (daily at 2 AM, keep 30 days)
velero schedule create daily-prod --schedule="0 2 * * *" \
--include-namespaces production --ttl 720h
# Restore to a different namespace
velero restore create --from-backup prod-backup \
--namespace-mappings production:production-restored
# List all backups
velero backup get
Remember: Backup strategy mnemonic: 3-2-1 — keep 3 copies of data, on 2 different media types, with 1 copy offsite. Cloud object storage (S3, GCS) counts as the offsite copy. Cross-region replication counts as a different media type from local disk.
Snapshot-Based Backups¶
# AWS EBS snapshot
aws ec2 create-snapshot \
--volume-id vol-0abc123def456 \
--description "db-daily-$(date +%Y%m%d)" \
--tag-specifications "ResourceType=snapshot,Tags=[{Key=Purpose,Value=daily-backup}]"
# Copy snapshot to another region (offsite copy)
aws ec2 copy-snapshot \
--source-region us-east-1 \
--source-snapshot-id snap-0abc123 \
--destination-region us-west-2
# LVM snapshot for consistent filesystem backup
lvcreate --size 10G --snapshot --name db-snap /dev/vg0/db-data
mount /dev/vg0/db-snap /mnt/snapshot
rsync -a /mnt/snapshot/ /backup/db-data/
umount /mnt/snapshot
lvremove -f /dev/vg0/db-snap
Default trap: LVM snapshots degrade write performance because every write to the origin volume triggers a copy-on-write to the snapshot. The larger the snapshot grows relative to the origin, the worse the performance hit. Always remove snapshots promptly after the backup completes. A forgotten snapshot can fill its allocated space and become invalid, causing the origin volume to go read-only.
Automated Backup Script¶
#!/usr/bin/env bash
set -euo pipefail
BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d_%H%M)
RETENTION_DAYS=30
# 1. Database backup
echo "Backing up PostgreSQL..."
pg_dump -Fc mydb > "${BACKUP_DIR}/db_${DATE}.dump"
# 2. Config backup
echo "Backing up configs..."
tar czf "${BACKUP_DIR}/configs_${DATE}.tar.gz" /etc/nginx /etc/app
# 3. Upload to S3
echo "Uploading to S3..."
aws s3 cp "${BACKUP_DIR}/db_${DATE}.dump" s3://my-backups/db/
aws s3 cp "${BACKUP_DIR}/configs_${DATE}.tar.gz" s3://my-backups/configs/
# 4. Cleanup old local backups
find "${BACKUP_DIR}" -type f -mtime +${RETENTION_DAYS} -delete
# 5. Verify remote backup exists
aws s3 ls "s3://my-backups/db/db_${DATE}.dump" || { echo "UPLOAD FAILED"; exit 1; }
echo "Backup complete: ${DATE}"
Restore Testing¶
# Monthly restore verification script
#!/usr/bin/env bash
set -euo pipefail
RESTORE_DIR=$(mktemp -d /tmp/restore-test.XXXXXXXX)
trap 'rm -rf "${RESTORE_DIR}"' EXIT
# 1. Download latest backup
LATEST=$(aws s3 ls s3://my-backups/db/ | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://my-backups/db/${LATEST}" "${RESTORE_DIR}/"
# 2. Restore to test database
createdb restore_test
pg_restore -d restore_test "${RESTORE_DIR}/${LATEST}"
# 3. Validate data
ROW_COUNT=$(psql -t -c "SELECT count(*) FROM users" restore_test)
if (( ROW_COUNT > 0 )); then
echo "PASS: ${ROW_COUNT} rows in users table"
else
echo "FAIL: users table is empty"
exit 1
fi
# 4. Cleanup
dropdb restore_test
echo "Restore test passed: ${LATEST}"
War story: A team ran backup scripts nightly for two years but never tested a restore. When they needed to recover from a ransomware attack, they discovered the backup files were encrypted with a restic password stored only in the compromised server's environment variables. The backups were useless. Store backup encryption passwords in a separate, offline secret manager (e.g., sealed envelope in a safe, or a dedicated vault instance).
Monitoring Backup Health¶
# Check when last backup ran
ls -lt /backup/*.dump | head -3
# Check backup sizes (sudden drop = problem)
ls -lh /backup/*.dump | awk '{print $5, $NF}'
# Alert if no backup in last 26 hours
LATEST_BACKUP=$(find /backup -name "*.dump" -mmin -1560 | head -1)
if [[ -z "${LATEST_BACKUP}" ]]; then
echo "ALERT: No backup in last 26 hours" | mail -s "Backup Alert" ops@company.com
fi
# Verify S3 backup count matches expected
EXPECTED=7
ACTUAL=$(aws s3 ls s3://my-backups/db/ | wc -l)
if (( ACTUAL < EXPECTED )); then
echo "WARNING: Only ${ACTUAL} backups found (expected ${EXPECTED})"
fi
One-liner: The most important backup metric is not "did the backup run?" but "when was the last successful restore test?" A backup that has never been tested is a hope, not a backup. Schedule monthly automated restore-and-verify jobs and alert when they fail.