Disaster Recovery¶

26 cards — 🟢 4 easy | 🟡 9 medium | 🔴 7 hard

🟢 Easy (4)¶

1. What do RPO and RTO mean?

Show answer

RPO (Recovery Point Objective) is how much data you can afford to lose, measured as the time window before the disaster. RTO (Recovery Time Objective) is how long you can be down before business impact, measured as the time window after the disaster until service is restored.

2. What is the 3-2-1 backup rule?

Show answer

Keep 3 copies of your data, on 2 different storage media, with 1 copy offsite. For example: production database (live), local backup server (Borg repo on ZFS), and offsite S3 bucket (restic to Backblaze B2).

3. Why is the statement "if you have never restored from your backups, you do not have backups" important?

Show answer

The most common DR failure is backing up but never testing restores. Backups can silently fail, produce corrupt archives, or miss critical files. Only a verified restore proves the backup works. Schedule monthly restore drills.

4. How often should you test disaster recovery and what should each test cover?

Show answer

Minimum quarterly for critical services, annually for others. Tests should include: full restore from backup to a clean environment, verification that restored data is complete and consistent, timing the restore to confirm it meets RTO, testing failover to standby systems, and validating that runbooks are accurate and current. Document results and fix any gaps immediately.

🟡 Medium (9)¶

1. How does Borg backup achieve space efficiency through deduplication?

Show answer

Borg splits data into chunks and stores each unique chunk only once. If a 50 GB backup changes only 10 GB between runs, only the new 10 GB of changed chunks are stored. Three 50 GB logical backups might only use 80 GB of physical storage depending on the change rate.

2. How does restic differ from Borg and when would you choose it?

Show answer

Restic natively supports cloud backends (S3, B2, Azure, GCS) and is simpler to set up for offsite backups. Choose restic when your primary need is pushing backups to cloud storage. Both support deduplication and encryption, but Borg is faster for local or LAN-based backups.

3. What are the four DR tiers and their corresponding RPO/RTO targets?

Show answer

Tier 1 (critical): RPO < 1 hour, RTO < 1 hour, uses real-time replication and hot standby. Tier 2 (important): RPO < 4 hours, RTO < 4 hours, uses frequent snapshots and warm standby. Tier 3 (standard): RPO < 24 hours, RTO < 24 hours, uses daily backups and cold restore. Tier 4 (archival): RPO < 7 days, RTO < 72 hours, uses weekly offsite backups.

4. What are the essential sections of a DR runbook?

Show answer

Classification (RPO, RTO, tier, last tested, owner), Dependencies (upstream and downstream services), Backup Details (tool, schedule, retention, location, encryption), Recovery Procedure (step-by-step restore, verification, DNS cutover, notifications), Failover Procedure (promote standby, redirect traffic, verify integrity), and Post-Recovery Checklist.

5. How do you determine appropriate RTO and RPO targets for a service?

Show answer

Start from business impact: revenue loss per hour of downtime, contractual SLAs, regulatory requirements, and customer tolerance. A payment system may need RPO < 1 minute and RTO < 15 minutes, while an internal wiki may tolerate RPO < 24 hours and RTO < 8 hours. Document these in the service's DR runbook and validate that your backup frequency and restore speed actually meet the targets through regular testing.

6. What are the main database DR strategies and their RPO implications?

Show answer

Synchronous replication: RPO = 0 (zero data loss) but adds write latency. Asynchronous replication: RPO = replication lag (seconds to minutes), better performance. Periodic backups: RPO = backup interval (hours). Point-in-time recovery (WAL archiving for PostgreSQL, binlog for MySQL): RPO = seconds, restores to any moment.
Best practice: combine async replication for fast failover with WAL archiving for point-in-time recovery.

7. What makes a DR runbook effective and how do you keep it current?

Show answer

Effective runbooks have: exact commands (not "restore the database" but "pg_restore -d mydb -F c /backup/latest.dump"), contact info for escalation, decision trees for different failure modes, and verification steps after each action. Keep current by: updating after every DR test, requiring runbook review during on-call handoffs, versioning in git alongside the service code, and including a "last tested" date prominently at the top.

8. What is the trade-off between point-in-time recovery (PITR) and snapshot restore?

Show answer

Snapshots are fast to restore but only recover to the snapshot moment — you lose all changes after it. PITR replays WAL/binlog from a base backup to any point in time, recovering more data but taking longer.
Best practice: use snapshots for fast RTO, PITR for precise RPO.

9. Why should you test disaster recovery plans with controlled failure injection?

Show answer

DR plans that are never tested tend to fail when needed — stale credentials, changed schemas, missing runbook steps. Chaos-DR exercises (killing a primary database, simulating region outage) validate the plan end-to-end and expose gaps before a real disaster. Schedule quarterly at minimum.

🔴 Hard (7)¶

1. What are the most dangerous pitfalls in disaster recovery engineering?

Show answer

Backing up to the same disk as production (both die together), no encryption on offsite backups, forgetting to back up config alongside data (TLS certs, secrets), backup window exceeding backup interval (creating gaps), not versioning DR runbooks (referencing decommissioned servers), single point of failure in the backup pipeline, and retention too aggressive (3 days of backups when corruption happened 4 days ago).

2. Describe an automated backup verification script pattern.

Show answer

Extract the latest Borg archive to a temp directory. Verify critical files exist (e.g., etc/hosts, PostgreSQL PG_VERSION). Run pg_verifybackup if available. Check file sizes and counts against expected baselines. Output PASS/FAIL with details. Clean up the temp directory. Schedule this as a cron job or CI pipeline to run after each backup completes.

3. What key decisions must be made when designing offsite backup replication?

Show answer

Bandwidth (large initial seeds may need physical media), encryption (always encrypt before leaving your network — Borg and restic do this by default), retention policy (offsite retention is usually longer than local), cost (S3 Standard vs Glacier vs B2 can differ by 10x), and restore speed (Glacier retrieval takes hours, which must be factored into your RTO).

4. What is the difference between active-passive and active-active DR architectures?

Show answer

Active-passive: standby site receives replicated data but serves no traffic until failover. Lower cost, simpler, but risk of stale standby and longer failover time.
Active-active: both sites serve traffic simultaneously with data replication between them. Faster failover (just remove one site from DNS/LB) but requires conflict resolution for writes and costs roughly 2x. Active-active is preferred for services with near-zero RTO requirements.

5. What should a DR communication plan include?

Show answer

Internal notification chain (who is informed and in what order), customer communication templates (pre-written for common scenarios), status page update procedures, regulatory notification requirements (e.g., GDPR 72-hour breach notification), partner/vendor contact list, and executive briefing cadence during extended outages. Pre-drafted templates save critical minutes during an actual disaster when stress is high.

6. What are the key considerations for cloud region failover DR?

Show answer

Data replication lag between regions (async replication = potential data loss). DNS TTL (low TTLs enable faster failover but increase DNS query load). Stateful services (databases, caches) are the hardest to fail over. Infrastructure-as-code ensures the standby region can be provisioned identically. Test regularly because cross-region permissions, quotas, and AMI availability often differ. Cost: running warm standby in a second region roughly doubles infrastructure spend.

7. What causes split-brain in active-active database failover and how do you resolve it?

Show answer

Split-brain occurs when both nodes believe they are primary after a network partition, accepting conflicting writes. Resolution strategies: quorum-based fencing (STONITH), witness/arbiter nodes that break ties, or application-level conflict resolution (last-write-wins, CRDTs). Prevention is better — use proper fencing before failover.