Backup & Restore - Street-Level Ops¶

Real-world workflows for backing up, restoring, and verifying data across servers and Kubernetes.

Quick Backup with Restic to S3¶

# Initialize a restic repo on S3
export RESTIC_REPOSITORY=s3:s3.amazonaws.com/my-backup-bucket
export RESTIC_PASSWORD="your-repo-password"

restic init

# Backup critical directories
restic backup /etc /var/lib/postgresql /home --exclude-caches --verbose

# Output:
# Files:       12345 new, 234 changed, 98765 unmodified
# Dirs:        567 new, 23 changed, 4567 unmodified
# Added:       1.234 GiB
# processed 111344 files, 45.678 GiB in 3:45

# List snapshots
restic snapshots

# Output:
# ID        Time                 Host     Tags  Paths
# a1b2c3d4  2024-03-15 02:00:01  db-prod        /etc, /var/lib/postgresql, /home
# e5f6g7h8  2024-03-14 02:00:01  db-prod        /etc, /var/lib/postgresql, /home

# Restore latest snapshot to a specific directory
restic restore latest --target /restore/ --include /var/lib/postgresql

# Forget old snapshots (keep 7 daily, 4 weekly, 6 monthly)
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune

Gotcha: restic forget without --prune marks snapshots for deletion but does not free disk space. The actual space reclamation happens during prune, which rewrites pack files. On large repos (multi-TB), prune can take hours and temporarily doubles storage usage. Schedule prune during maintenance windows, not immediately after every forget.

Under the hood: Restic deduplicates at the block level using content-defined chunking (CDC). A 50GB database backup where only 100MB changed stores only ~100MB of new data. The prune step is expensive — it rewrites pack files to remove unreferenced chunks. Run forget daily but prune weekly during maintenance windows.

Database Backup and Restore¶

# PostgreSQL: consistent dump with custom format (compressed)
pg_dump -Fc -Z 6 mydb > "/backup/mydb_$(date +%Y%m%d_%H%M).dump"

# PostgreSQL: restore to a new database
createdb mydb_restored
pg_restore -d mydb_restored /backup/mydb_20240315_0200.dump

# PostgreSQL: restore specific tables only
pg_restore -d mydb_restored -t users -t orders /backup/mydb_20240315_0200.dump

# MySQL: consistent backup with single-transaction
mysqldump --single-transaction --routines --triggers \
  --all-databases > "/backup/full_$(date +%Y%m%d).sql"

# MySQL: restore
mysql < /backup/full_20240315.sql

# Verify backup integrity (check without restoring)
pg_restore --list /backup/mydb_20240315_0200.dump > /dev/null && echo "OK" || echo "CORRUPT"

Gotcha: pg_dump without --single-transaction (plain format) or -Fc (custom format, which implies snapshot isolation) can produce an inconsistent backup if writes happen during the dump. For production, always use -Fc (custom format with snapshot) or --single-transaction (plain SQL). The --single-transaction flag wraps the dump in a single SERIALIZABLE transaction so the backup reflects a single point in time.

Kubernetes Backup with Velero¶

# Install Velero
velero install --provider aws --bucket my-velero-bucket \
  --secret-file ./credentials --backup-location-config region=us-east-1

# Backup a namespace
velero backup create prod-backup --include-namespaces production

# Check backup status
velero backup describe prod-backup

> **Debug clue:** Velero backup status `PartiallyFailed` usually means some resources failed to be backed up — often PVs with no volume snapshotter configured, or resources with custom finalizers that block deletion. Check `velero backup logs prod-backup` for the specific failures. A `PartiallyFailed` backup may still be restorable for the resources that succeeded.

# Output:
# Phase: Completed
# Items backed up: 234
# Warnings: 0
# Errors: 0

# Create a scheduled backup (daily at 2 AM, keep 30 days)
velero schedule create daily-prod --schedule="0 2 * * *" \
  --include-namespaces production --ttl 720h

# Restore to a different namespace
velero restore create --from-backup prod-backup \
  --namespace-mappings production:production-restored

# List all backups
velero backup get

Remember: Backup strategy mnemonic: 3-2-1 — keep 3 copies of data, on 2 different media types, with 1 copy offsite. Cloud object storage (S3, GCS) counts as the offsite copy. Cross-region replication counts as a different media type from local disk.

Snapshot-Based Backups¶

# AWS EBS snapshot
aws ec2 create-snapshot \
  --volume-id vol-0abc123def456 \
  --description "db-daily-$(date +%Y%m%d)" \
  --tag-specifications "ResourceType=snapshot,Tags=[{Key=Purpose,Value=daily-backup}]"

# Copy snapshot to another region (offsite copy)
aws ec2 copy-snapshot \
  --source-region us-east-1 \
  --source-snapshot-id snap-0abc123 \
  --destination-region us-west-2

# LVM snapshot for consistent filesystem backup
lvcreate --size 10G --snapshot --name db-snap /dev/vg0/db-data
mount /dev/vg0/db-snap /mnt/snapshot
rsync -a /mnt/snapshot/ /backup/db-data/
umount /mnt/snapshot
lvremove -f /dev/vg0/db-snap

Default trap: LVM snapshots degrade write performance because every write to the origin volume triggers a copy-on-write to the snapshot. The larger the snapshot grows relative to the origin, the worse the performance hit. Always remove snapshots promptly after the backup completes. A forgotten snapshot can fill its allocated space and become invalid, causing the origin volume to go read-only.

Automated Backup Script¶

#!/usr/bin/env bash
set -euo pipefail

BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d_%H%M)
RETENTION_DAYS=30

# 1. Database backup
echo "Backing up PostgreSQL..."
pg_dump -Fc mydb > "${BACKUP_DIR}/db_${DATE}.dump"

# 2. Config backup
echo "Backing up configs..."
tar czf "${BACKUP_DIR}/configs_${DATE}.tar.gz" /etc/nginx /etc/app

# 3. Upload to S3
echo "Uploading to S3..."
aws s3 cp "${BACKUP_DIR}/db_${DATE}.dump" s3://my-backups/db/
aws s3 cp "${BACKUP_DIR}/configs_${DATE}.tar.gz" s3://my-backups/configs/

# 4. Cleanup old local backups
find "${BACKUP_DIR}" -type f -mtime +${RETENTION_DAYS} -delete

# 5. Verify remote backup exists
aws s3 ls "s3://my-backups/db/db_${DATE}.dump" || { echo "UPLOAD FAILED"; exit 1; }

echo "Backup complete: ${DATE}"

Restore Testing¶

# Monthly restore verification script
#!/usr/bin/env bash
set -euo pipefail

RESTORE_DIR=$(mktemp -d /tmp/restore-test.XXXXXXXX)
trap 'rm -rf "${RESTORE_DIR}"' EXIT

# 1. Download latest backup
LATEST=$(aws s3 ls s3://my-backups/db/ | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://my-backups/db/${LATEST}" "${RESTORE_DIR}/"

# 2. Restore to test database
createdb restore_test
pg_restore -d restore_test "${RESTORE_DIR}/${LATEST}"

# 3. Validate data
ROW_COUNT=$(psql -t -c "SELECT count(*) FROM users" restore_test)
if (( ROW_COUNT > 0 )); then
    echo "PASS: ${ROW_COUNT} rows in users table"
else
    echo "FAIL: users table is empty"
    exit 1
fi

# 4. Cleanup
dropdb restore_test
echo "Restore test passed: ${LATEST}"

War story: A team ran backup scripts nightly for two years but never tested a restore. When they needed to recover from a ransomware attack, they discovered the backup files were encrypted with a restic password stored only in the compromised server's environment variables. The backups were useless. Store backup encryption passwords in a separate, offline secret manager (e.g., sealed envelope in a safe, or a dedicated vault instance).

Monitoring Backup Health¶

# Check when last backup ran
ls -lt /backup/*.dump | head -3

# Check backup sizes (sudden drop = problem)
ls -lh /backup/*.dump | awk '{print $5, $NF}'

# Alert if no backup in last 26 hours
LATEST_BACKUP=$(find /backup -name "*.dump" -mmin -1560 | head -1)
if [[ -z "${LATEST_BACKUP}" ]]; then
    echo "ALERT: No backup in last 26 hours" | mail -s "Backup Alert" ops@company.com
fi

# Verify S3 backup count matches expected
EXPECTED=7
ACTUAL=$(aws s3 ls s3://my-backups/db/ | wc -l)
if (( ACTUAL < EXPECTED )); then
    echo "WARNING: Only ${ACTUAL} backups found (expected ${EXPECTED})"
fi

One-liner: The most important backup metric is not "did the backup run?" but "when was the last successful restore test?" A backup that has never been tested is a hope, not a backup. Schedule monthly automated restore-and-verify jobs and alert when they fail.

Backup & Restore - Street-Level Ops¶

Quick Backup with Restic to S3¶

Database Backup and Restore¶

Kubernetes Backup with Velero¶

Snapshot-Based Backups¶

Automated Backup Script¶

Restore Testing¶

Monitoring Backup Health¶

Pages that link here¶