Portal | Level: L2: Operations | Topics: Kubernetes Core | Domain: Kubernetes

Runbook: Disaster Recovery Plan¶

Overview¶

This runbook covers recovery from major failures. Follow the appropriate section based on the failure type.

Failure Scenarios¶

Scenario 1: Single Node Failure¶

Impact: Pods rescheduled to other nodes. Temporary capacity reduction.

Recovery:

# 1. Check node status
kubectl get nodes
kubectl describe node <failed-node>

# 2. Pods should auto-reschedule if using Deployments/StatefulSets
kubectl get pods -A -o wide | grep <failed-node>

# 3. If node is recoverable:
# - Reboot and wait for kubelet to rejoin
# - Check: kubectl get node <name> (should become Ready)

# 4. If node is dead:
kubectl delete node <failed-node>
# - Provision replacement node
# - Join to cluster

RTO: Minutes (auto-reschedule). RPO: Zero for stateless, depends on storage for stateful.

Scenario 2: Control Plane Failure¶

Impact: Cannot create/modify resources. Existing workloads keep running.

Recovery:

# 1. Check control plane components
kubectl get pods -n kube-system

# 2. If kubeadm: check static pod manifests
ls /etc/kubernetes/manifests/

# 3. Check etcd
etcdctl endpoint health

# 4. Check API server logs
journalctl -u kube-apiserver --since "10 min ago"  # or
crictl logs $(crictl ps --name kube-apiserver -q)

# 5. If etcd is corrupted: restore from backup (see etcd_backup_restore.md)
# 6. If certs expired: renew with kubeadm
kubeadm certs renew all

RTO: 15-60 minutes. RPO: Last etcd backup.

Scenario 3: Full Cluster Loss¶

Impact: Everything is gone. Total rebuild needed.

Recovery:

# 1. Provision new cluster infrastructure (Terraform)
terraform apply

# 2. Bootstrap new cluster (kubeadm/k3s)
# 3. Install core infrastructure (cert-manager, ingress, monitoring)
# 4. Restore etcd from backup OR re-deploy via GitOps

# Option A: etcd restore (fastest if backup is recent)
# See etcd_backup_restore.md

# Option B: GitOps re-deploy (more reliable if backup is old)
# - Install ArgoCD
# - Point to Git repo
# - Sync all applications
# - Restore data from Velero/database backups

# 5. Restore application data
velero restore create --from-backup <latest>

# 6. Update DNS to point to new cluster
# 7. Verify all services

RTO: 1-4 hours (automated), 4-8 hours (manual). RPO: Last backup.

Scenario 4: Namespace Accidentally Deleted¶

Impact: All resources in the namespace are gone.

Recovery:

# 1. Restore from Velero backup
velero restore create ns-restore \
  --from-backup <latest-backup> \
  --include-namespaces <deleted-namespace>

# 2. If no Velero: restore from GitOps
# ArgoCD will recreate the namespace and all managed resources
argocd app sync <app-name>

# 3. Restore data (databases, PVCs)
# PVCs are gone. Restore from database backup.
# See etcd_backup_restore.md or database-ops primer

# 4. Verify
kubectl get all -n <namespace>

RTO: 10-30 minutes. RPO: Last Velero/database backup.

Scenario 5: Database Corruption/Loss¶

Impact: Application data lost or corrupted.

Recovery:

[!WARNING] Scaling to zero replicas is a destructive operation that takes the application fully offline. Confirm with the team before executing — in some cases, isolating the database (NetworkPolicy or PG connection kill) is safer than stopping all traffic.

# 1. Stop the application (prevent further damage)
kubectl scale deployment grokdevops -n grokdevops --replicas=0

# 2. Assess damage
kubectl exec -it postgres-0 -n grokdevops -- psql -U postgres -c "\l"

# 3. Restore from backup
# Option A: Velero PVC snapshot
velero restore create db-restore --from-backup <latest> \
  --include-resources persistentvolumeclaims \
  --selector app=postgres

# Option B: pg_dump restore
cat backup.dump | kubectl exec -i postgres-0 -n grokdevops -- \
  pg_restore -U postgres -d grokdevops --clean

# Option C: PITR (point-in-time recovery)
# Requires WAL archiving. See database-ops primer.

# 4. Restart application
kubectl scale deployment grokdevops -n grokdevops --replicas=3

# 5. Verify data integrity
kubectl exec -it postgres-0 -n grokdevops -- \
  psql -U postgres -d grokdevops -c "SELECT count(*) FROM users;"

DR Readiness Checklist¶

Run quarterly:

etcd backup: automated, tested restore within 30 days
Velero backup: scheduled, tested restore within 30 days
Database backup: automated (pg_dump + WAL), tested restore
GitOps: all infrastructure and apps defined in Git
DNS failover: tested with health check simulation
Runbooks: reviewed and updated
Team: DR drill conducted (tabletop or live)
RTO/RPO: documented and agreed with stakeholders

RTO/RPO Summary¶

Scenario	RTO	RPO
Single node failure	2-5 min (auto-reschedule)	0 (stateless)
Control plane failure	15-60 min	Last etcd backup
Full cluster loss	1-4 hours	Last backup
Namespace deleted	10-30 min	Last Velero backup
Database corruption	30-60 min	Last DB backup / PITR

Post-Recovery¶

After any recovery: 1. Verify all services are healthy 2. Check monitoring dashboards for anomalies 3. Notify stakeholders 4. Write a postmortem if data loss occurred 5. Update runbooks with lessons learned 6. Schedule a follow-up to verify backup systems

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Runbook: Disaster Recovery Plan¶

Overview¶

Failure Scenarios¶

Scenario 1: Single Node Failure¶

Scenario 2: Control Plane Failure¶

Scenario 3: Full Cluster Loss¶

Scenario 4: Namespace Accidentally Deleted¶

Scenario 5: Database Corruption/Loss¶

DR Readiness Checklist¶

RTO/RPO Summary¶

Post-Recovery¶

Wiki Navigation¶

Pages that link here¶

Runbook: Disaster Recovery Plan¶

Overview¶

Failure Scenarios¶

Scenario 1: Single Node Failure¶

Scenario 2: Control Plane Failure¶

Scenario 3: Full Cluster Loss¶

Scenario 4: Namespace Accidentally Deleted¶

Scenario 5: Database Corruption/Loss¶

DR Readiness Checklist¶

RTO/RPO Summary¶

Post-Recovery¶

Wiki Navigation¶

Related Content¶

Pages that link here¶