Portal | Level: L2: Operations | Topics: Kubernetes Core | Domain: Kubernetes
Runbook: Disaster Recovery Plan¶
Overview¶
This runbook covers recovery from major failures. Follow the appropriate section based on the failure type.
Failure Scenarios¶
Scenario 1: Single Node Failure¶
Impact: Pods rescheduled to other nodes. Temporary capacity reduction.
Recovery:
# 1. Check node status
kubectl get nodes
kubectl describe node <failed-node>
# 2. Pods should auto-reschedule if using Deployments/StatefulSets
kubectl get pods -A -o wide | grep <failed-node>
# 3. If node is recoverable:
# - Reboot and wait for kubelet to rejoin
# - Check: kubectl get node <name> (should become Ready)
# 4. If node is dead:
kubectl delete node <failed-node>
# - Provision replacement node
# - Join to cluster
RTO: Minutes (auto-reschedule). RPO: Zero for stateless, depends on storage for stateful.
Scenario 2: Control Plane Failure¶
Impact: Cannot create/modify resources. Existing workloads keep running.
Recovery:
# 1. Check control plane components
kubectl get pods -n kube-system
# 2. If kubeadm: check static pod manifests
ls /etc/kubernetes/manifests/
# 3. Check etcd
etcdctl endpoint health
# 4. Check API server logs
journalctl -u kube-apiserver --since "10 min ago" # or
crictl logs $(crictl ps --name kube-apiserver -q)
# 5. If etcd is corrupted: restore from backup (see etcd_backup_restore.md)
# 6. If certs expired: renew with kubeadm
kubeadm certs renew all
RTO: 15-60 minutes. RPO: Last etcd backup.
Scenario 3: Full Cluster Loss¶
Impact: Everything is gone. Total rebuild needed.
Recovery:
# 1. Provision new cluster infrastructure (Terraform)
terraform apply
# 2. Bootstrap new cluster (kubeadm/k3s)
# 3. Install core infrastructure (cert-manager, ingress, monitoring)
# 4. Restore etcd from backup OR re-deploy via GitOps
# Option A: etcd restore (fastest if backup is recent)
# See etcd_backup_restore.md
# Option B: GitOps re-deploy (more reliable if backup is old)
# - Install ArgoCD
# - Point to Git repo
# - Sync all applications
# - Restore data from Velero/database backups
# 5. Restore application data
velero restore create --from-backup <latest>
# 6. Update DNS to point to new cluster
# 7. Verify all services
RTO: 1-4 hours (automated), 4-8 hours (manual). RPO: Last backup.
Scenario 4: Namespace Accidentally Deleted¶
Impact: All resources in the namespace are gone.
Recovery:
# 1. Restore from Velero backup
velero restore create ns-restore \
--from-backup <latest-backup> \
--include-namespaces <deleted-namespace>
# 2. If no Velero: restore from GitOps
# ArgoCD will recreate the namespace and all managed resources
argocd app sync <app-name>
# 3. Restore data (databases, PVCs)
# PVCs are gone. Restore from database backup.
# See etcd_backup_restore.md or database-ops primer
# 4. Verify
kubectl get all -n <namespace>
RTO: 10-30 minutes. RPO: Last Velero/database backup.
Scenario 5: Database Corruption/Loss¶
Impact: Application data lost or corrupted.
Recovery:
[!WARNING] Scaling to zero replicas is a destructive operation that takes the application fully offline. Confirm with the team before executing — in some cases, isolating the database (NetworkPolicy or PG connection kill) is safer than stopping all traffic.
# 1. Stop the application (prevent further damage)
kubectl scale deployment grokdevops -n grokdevops --replicas=0
# 2. Assess damage
kubectl exec -it postgres-0 -n grokdevops -- psql -U postgres -c "\l"
# 3. Restore from backup
# Option A: Velero PVC snapshot
velero restore create db-restore --from-backup <latest> \
--include-resources persistentvolumeclaims \
--selector app=postgres
# Option B: pg_dump restore
cat backup.dump | kubectl exec -i postgres-0 -n grokdevops -- \
pg_restore -U postgres -d grokdevops --clean
# Option C: PITR (point-in-time recovery)
# Requires WAL archiving. See database-ops primer.
# 4. Restart application
kubectl scale deployment grokdevops -n grokdevops --replicas=3
# 5. Verify data integrity
kubectl exec -it postgres-0 -n grokdevops -- \
psql -U postgres -d grokdevops -c "SELECT count(*) FROM users;"
DR Readiness Checklist¶
Run quarterly:
- etcd backup: automated, tested restore within 30 days
- Velero backup: scheduled, tested restore within 30 days
- Database backup: automated (pg_dump + WAL), tested restore
- GitOps: all infrastructure and apps defined in Git
- DNS failover: tested with health check simulation
- Runbooks: reviewed and updated
- Team: DR drill conducted (tabletop or live)
- RTO/RPO: documented and agreed with stakeholders
RTO/RPO Summary¶
| Scenario | RTO | RPO |
|---|---|---|
| Single node failure | 2-5 min (auto-reschedule) | 0 (stateless) |
| Control plane failure | 15-60 min | Last etcd backup |
| Full cluster loss | 1-4 hours | Last backup |
| Namespace deleted | 10-30 min | Last Velero backup |
| Database corruption | 30-60 min | Last DB backup / PITR |
Post-Recovery¶
After any recovery: 1. Verify all services are healthy 2. Check monitoring dashboards for anomalies 3. Notify stakeholders 4. Write a postmortem if data loss occurred 5. Update runbooks with lessons learned 6. Schedule a follow-up to verify backup systems
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core