- security
- l1
- topic-pack
- backup-restore --- Portal | Level: L1: Foundations | Topics: Backup & Restore | Domain: Security
Backup & Restore Primer¶
Why This Matters¶
Backups are your last line of defense against data loss — ransomware, human error, hardware failure, or bad deployments. But a backup you have never restored is just a hope. The discipline is not in taking backups; it is in testing restores, meeting recovery targets, and knowing exactly what is and is not protected.
Core Concepts¶
The 3-2-1 Rule¶
The foundational backup strategy:
- 3 copies of your data (1 primary + 2 backups)
- 2 different storage media/types (e.g., disk + object storage)
- 1 offsite copy (different region, different provider, or physical location)
Modern variant (3-2-1-1-0): add 1 air-gapped/immutable copy and 0 errors in restore testing.
RPO and RTO¶
| Metric | Definition | Example |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss | RPO = 1 hour means you can lose at most 1 hour of data |
| RTO (Recovery Time Objective) | Maximum acceptable downtime | RTO = 4 hours means service must be back within 4 hours |
RPO determines backup frequency. RTO determines restore speed and automation level. Both are business decisions, not technical ones.
Remember: "RPO = how much data you can lose. RTO = how long you can be down." Mnemonic: RPO has a P for Point (point in time you recover to). RTO has a T for Time (time to recover). An RPO of 0 requires synchronous replication. An RTO of 0 requires active-active architecture. Both are expensive — define what the business actually needs.
Gotcha: The most dangerous backup assumption: "we have backups" without ever testing a restore. At least 30% of backup restores fail due to corruption, missing dependencies, or changed schemas. Test restores quarterly at minimum. For databases, test that the restored data is actually queryable, not just that the file was copied.
Backup Types¶
| Type | What It Captures | Speed | Storage |
|---|---|---|---|
| Full | Everything | Slow | Large |
| Incremental | Changes since last backup (any type) | Fast | Small |
| Differential | Changes since last full backup | Medium | Medium |
| Snapshot | Point-in-time filesystem/volume state | Very fast | Varies |
Backup Tools¶
Borg Backup¶
Name origin: BorgBackup is named after the Borg from Star Trek — the cybernetic collective that assimilates everything. Fitting for a deduplicating backup tool that absorbs data efficiently. It was forked from Attic in 2015. The key innovation is content-defined chunking with a rolling hash (Buzhash), which means small changes to a large file only result in a few new chunks being stored, not a full copy.
Deduplicating, compressed, encrypted backup tool. Excellent for server-side backups:
# Initialize a repository
borg init --encryption=repokey /backup/repo
# Create a backup
borg create /backup/repo::daily-{now} /etc /var/lib/postgresql /home \
--exclude '*.tmp' --exclude '/home/*/.cache'
# List archives
borg list /backup/repo
# Restore a specific archive
borg extract /backup/repo::daily-2024-01-15 --target /restore/
# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune /backup/repo --keep-daily=7 --keep-weekly=4 --keep-monthly=6
Key features: block-level deduplication, compression (lz4/zstd), authenticated encryption.
Restic¶
Similar to Borg but with native cloud backend support (S3, GCS, Azure Blob, SFTP):
# Initialize a repo on S3
restic init --repo s3:s3.amazonaws.com/my-backup-bucket
# Backup
restic backup /etc /var/lib/postgresql \
--repo s3:s3.amazonaws.com/my-backup-bucket \
--exclude-caches
# List snapshots
restic snapshots --repo s3:s3.amazonaws.com/my-backup-bucket
# Restore
restic restore latest --target /restore/ \
--repo s3:s3.amazonaws.com/my-backup-bucket
# Forget and prune
restic forget --keep-daily 7 --keep-weekly 4 --prune \
--repo s3:s3.amazonaws.com/my-backup-bucket
Velero (Kubernetes)¶
Name origin: Velero is Italian for "sailboat" — a nod to Heptio's nautical branding (Heptio was the company founded by Kubernetes co-creators Joe Beda and Craig McLuckie). VMware acquired Heptio in 2018. Velero was originally called "Ark" but was renamed to avoid trademark conflicts. It backs up both Kubernetes resource definitions (YAML) and persistent volume data, storing everything in object storage (S3, GCS, Azure Blob).
Backup and restore for Kubernetes cluster resources and persistent volumes:
# Install Velero with AWS provider
velero install --provider aws --bucket my-velero-bucket \
--secret-file ./credentials --backup-location-config region=us-east-1
# Create a backup of a namespace
velero backup create staging-backup --include-namespaces staging
# Create a scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *" \
--include-namespaces production --ttl 720h
# Restore from backup
velero restore create --from-backup staging-backup
# Restore to a different namespace
velero restore create --from-backup staging-backup \
--namespace-mappings staging:staging-restored
# List backups
velero backup get
Snapshot Strategies¶
Snapshots (LVM/Cloud)¶
# LVM snapshot for consistent backup
lvcreate --size 10G --snapshot --name db-snap /dev/vg0/db-data
# AWS EBS snapshot
aws ec2 create-snapshot --volume-id vol-abc123 --description "db-daily"
Snapshots are not backups alone — they reside in the same provider/region. Combine with cross-region copies.
Database Backups¶
Gotcha: Copying database files while the database is running (e.g.,
cp /var/lib/postgresql/while PostgreSQL is active) produces a corrupted backup. The database has in-flight transactions, dirty buffers, and WAL entries that are not on disk yet. You need either: (1) a logical dump (pg_dump) that reads consistent data through the database engine, (2) a filesystem snapshot taken while the database is quiesced (FLUSH TABLES WITH READ LOCKfor MySQL), or (3) continuous WAL archiving for point-in-time recovery. Nevercpa live database directory.
Databases need application-consistent backups, not just file copies:
# PostgreSQL logical backup
pg_dump -Fc mydb > mydb_$(date +%Y%m%d).dump
# PostgreSQL restore
pg_restore -d mydb mydb_20240115.dump
# MySQL
mysqldump --single-transaction --all-databases > full_$(date +%Y%m%d).sql
# Point-in-time recovery (PostgreSQL WAL archiving)
archive_command = 'cp %p /backup/wal/%f'
restore_command = 'cp /backup/wal/%f %p'
Ransomware-Resilient Backup Design¶
Ransomware encrypts your data and demands payment. Your backups are the primary defense — but only if the attacker cannot reach them.
Immutable Backups¶
Storage that cannot be modified or deleted for a defined retention period:
# AWS S3 Object Lock (compliance mode — even root can't delete)
aws s3api put-object-lock-configuration \
--bucket my-backup-bucket \
--object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Days":30}}}'
# Restic with S3 Object Lock
restic backup /data --repo s3:s3.amazonaws.com/my-backup-bucket
# Objects are immutable for 30 days — ransomware can't delete them
# Borg with append-only repository (remote server restricts to append)
# In ~/.ssh/authorized_keys on backup server:
# command="borg serve --append-only --restrict-to-path /backup/repo" ssh-rsa AAAA...
Air-Gapped Copies¶
A backup that is physically disconnected from any network the attacker could reach:
- Tape: LTO tape stored offsite (still used in enterprise)
- Offline disk: USB drive rotated weekly, stored in a safe
- Cloud with separate credentials: Different provider, different account, different auth chain
- Pull-based backup: Backup server pulls data from production (production can't write to backup)
Retention That Survives Encryption¶
Ransomware may sit dormant for weeks before activating. Your retention must outlast the dwell time:
- Keep at least 30 days of daily backups
- Keep at least 3 months of weekly backups
- Test restoration from oldest available backup, not just latest
- Monitor backup sizes — sudden size changes may indicate encrypted data being backed up
The Ransomware Backup Checklist¶
- Backups are immutable (S3 Object Lock, append-only, WORM storage)
- At least one copy is air-gapped or on a separate credential chain
- Retention covers 30+ days (beyond typical ransomware dwell time)
- Backup credentials are separate from production credentials
- Restore tested monthly from a backup older than 7 days
- Backup integrity monitoring alerts on anomalies (size, duration, error rate)
- Backup network is segmented from production network
Restore Testing¶
The most important backup practice. Automate a monthly restore to a temp location, validate critical files exist, and check data integrity. Track results. A backup that fails restore is not a backup.
Common Pitfalls¶
- No restore testing: The backup works until you need it — then you discover it does not
- Backing up the container, not the data: Containers are ephemeral; back up persistent volumes
- Ignoring RPO/RTO: Daily backups with a 1-hour RPO requirement is a gap, not a strategy
- Single-region backups: Provider outage takes your data and your backups
- No encryption: Backup media is a theft target — encrypt at rest and in transit
- No monitoring: Backup jobs fail silently; alert on missed or failed backups
- Snapshot-only strategy: Snapshots in the same provider are not offsite copies
Wiki Navigation¶
Related Content¶
- Backup Restore Flashcards (CLI) (flashcard_deck, L1) — Backup & Restore
- Disaster Recovery & Backup Engineering (Topic Pack, L2) — Backup & Restore
Pages that link here¶
- Anti-Primer: Backup Restore
- Backup Restore
- Certification Prep: AWS SAA — Solutions Architect Associate
- Certification Prep: CKA — Certified Kubernetes Administrator
- Disaster Recovery & Backup Engineering
- Linux Data Hoarding - Primer
- Master Curriculum: 40 Weeks
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans