Backup Restore¶

27 cards — 🟢 4 easy | 🟡 10 medium | 🔴 6 hard

🟢 Easy (4)¶

1. What does the 3-2-1 backup rule require, and what does the modern 3-2-1-1-0 variant add?

Show answer

3 copies of data, 2 different media types, 1 offsite copy. The modern 3-2-1-1-0 variant adds 1 immutable or air-gapped copy and 0 errors in restore testing.

Remember: 3-2-1 rule: 3 copies of data, 2 different media types, 1 offsite. Protects against hardware failure, site disaster, and ransomware.

Example: production database + local snapshot + S3 cross-region replication = 3 copies, 2 media, 1 offsite.

2. What is the difference between RPO and RTO?

Show answer

RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time. RTO (Recovery Time Objective) is the maximum acceptable downtime. RPO drives backup frequency; RTO drives restore speed and automation.

Remember: RPO = how much data can you lose? RPO 1h = you can lose up to 1 hour of data. Lower RPO = more frequent backups = higher cost.

Example: RPO=0 (no data loss) requires synchronous replication. RPO=24h means daily backups suffice.

3. What are the three main backup types and how do they differ?

Show answer

Full captures everything (slow, large). Incremental captures changes since last backup of any type (fast, small). Differential captures changes since the last full backup (medium speed and size).

4. What are the risks of manually triggered backups versus automated scheduled backups?

Show answer

Manual backups are forgotten, inconsistently timed, and create unpredictable RPO gaps. Automated backups (cron, Velero schedules, managed service policies) run reliably and can be monitored for failures. Always automate, and alert when a scheduled backup does not complete within its expected window.

🟡 Medium (10)¶

1. What makes Borg Backup well-suited for server backups?

Show answer

Borg provides block-level deduplication (only stores unique data chunks), compression (lz4/zstd), authenticated encryption, and efficient pruning of old archives. It significantly reduces storage requirements for incremental backups.

2. How does restic differ from Borg in backend support?

Show answer

Restic has native support for cloud backends including S3, GCS, Azure Blob, and SFTP, making it well-suited for cloud-native backup strategies. Borg primarily targets local and SSH-based repositories.

3. How do you create a scheduled Kubernetes backup with Velero?

Show answer

velero schedule create daily-backup --schedule="0 2
* * *" --include-namespaces production --ttl 720h. This backs up the production namespace daily at 2 AM with a 30-day retention.

4. Why are filesystem-level copies insufficient for database backups?

Show answer

Databases have data in memory buffers, write-ahead logs, and complex file relationships. A raw file copy may capture an inconsistent state. Use application-consistent tools like pg_dump, mysqldump, or snapshot with fsfreeze.

5. What are the key management risks when encrypting backups, and how do you mitigate them?

Show answer

If you lose the encryption key, the backup is unrecoverable — encryption without key management is worse than no encryption. Mitigate: store keys separately from backups (never in the same bucket), use a KMS or key escrow service, document key rotation procedures, and test decryption as part of restore drills.

6. How do you estimate restore time before a disaster actually happens?

Show answer

Measure: backup size / network bandwidth to get transfer time, add decompression and decryption overhead (benchmark with a sample restore), add application startup and validation time. For databases, factor in WAL replay time. Document the result as tested RTO and compare against the business RTO requirement. Re-test quarterly as data grows.

7. How do you verify backup integrity using checksums, and when should you check them?

Show answer

Tools like restic and Borg automatically store and verify content hashes (SHA-256) for each chunk. Run periodic integrity checks: restic check or borg check weekly. For raw file backups, generate SHA-256 manifests at backup time and verify after transfer. Check on both write (detect corruption during backup) and read (detect bit-rot in storage).

Remember: 'An untested backup is not a backup.' Schedule regular restore drills. Verify data integrity after restore. Automate restoration testing.

Gotcha: the #1 backup failure mode is discovering your backups are corrupt or incomplete during an actual emergency.

8. What is RPO and how does it influence backup strategy?

Show answer

Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time. A 1-hour RPO requires backups at least hourly. Lower RPO = more frequent backups or continuous replication.

Remember: RPO = how much data can you lose? RPO 1h = you can lose up to 1 hour of data. Lower RPO = more frequent backups = higher cost.

Example: RPO=0 (no data loss) requires synchronous replication. RPO=24h means daily backups suffice.

9. Describe the 3-2-1 backup rule.

Show answer

Keep 3 copies of data, on 2 different media types, with 1 copy offsite. This protects against hardware failure, site disaster, and media-specific corruption.

Remember: 3-2-1 rule: 3 copies of data, 2 different media types, 1 offsite. Protects against hardware failure, site disaster, and ransomware.

Example: production database + local snapshot + S3 cross-region replication = 3 copies, 2 media, 1 offsite.

10. What is the difference between incremental and differential backups?

Show answer

Incremental backs up only changes since the last backup (any type). Differential backs up all changes since the last full backup. Incremental is faster to create but slower to restore.

Remember: Full = everything. Differential = changes since last full. Incremental = changes since last backup (any type). Incremental is smallest but slowest to restore.

Remember: restore speed: Full > Differential > Incremental. Backup speed/size: Incremental > Differential > Full. Trade-off!

🔴 Hard (6)¶

1. Why is restore testing the most important backup practice?

Show answer

A backup that cannot be restored is worthless. Silent corruption, missing files, incompatible versions, or changed schemas can make backups useless. Monthly automated restore tests with validation checks are the only way to confirm recoverability.

Remember: 'An untested backup is not a backup.' Schedule regular restore drills. Verify data integrity after restore. Automate restoration testing.

Gotcha: the #1 backup failure mode is discovering your backups are corrupt or incomplete during an actual emergency.

2. Why are cloud snapshots alone not a complete backup strategy?

Show answer

Snapshots reside in the same provider and often the same region. A provider outage or account compromise can take both primary data and snapshots. Snapshots must be combined with cross-region or cross-provider copies to meet the offsite requirement.

Remember: snapshots are point-in-time copies, usually copy-on-write. Fast to create but not a replacement for offsite backups.

3. What are the consequences of not monitoring backup jobs?

Show answer

Backup jobs fail silently — network issues, full disks, expired credentials, or permission changes can cause failures with no alert. Without monitoring, you discover the failure only when you need to restore, which is the worst possible time.

4. How do immutable backups protect against ransomware, and how do you implement them?

Show answer

Immutable backups cannot be modified or deleted for a set retention period, even by admins. Ransomware that compromises admin credentials cannot encrypt or destroy them. Implement with: S3 Object Lock (Compliance mode), Azure immutable blob storage, or air-gapped tape/offline media. Test that immutability cannot be bypassed by your own admin accounts.

Remember: immutable backups (WORM / Object Lock) prevent deletion or modification. Essential defense against ransomware that encrypts backups.

5. What is Point-in-Time Recovery (PITR) for databases, and what does it require?

Show answer

PITR lets you restore a database to any specific moment (e.g., 1 second before a bad DELETE). It requires: a base backup (full snapshot) plus continuous archiving of write-ahead logs (WAL in PostgreSQL, binlogs in MySQL). The base backup is replayed, then WAL is applied up to the target timestamp. WAL archiving must be configured before the incident.

6. Why must backup restores be tested regularly, and what is one common failure mode?

Show answer

Untested backups may be corrupt, incomplete, or incompatible with the current environment. Common failure: backup succeeds but restore fails due to changed schema, missing dependencies, or permission drift.

Remember: 'An untested backup is not a backup.' Schedule regular restore drills. Verify data integrity after restore. Automate restoration testing.

Gotcha: the #1 backup failure mode is discovering your backups are corrupt or incomplete during an actual emergency.