Skip to content

Portal | Level: L2: Operations | Topics: Kubernetes Core | Domain: Kubernetes

Runbook: Disaster Recovery Plan

Overview

This runbook covers recovery from major failures. Follow the appropriate section based on the failure type.

Failure Scenarios

Scenario 1: Single Node Failure

Impact: Pods rescheduled to other nodes. Temporary capacity reduction.

Recovery:

# 1. Check node status
kubectl get nodes
kubectl describe node <failed-node>

# 2. Pods should auto-reschedule if using Deployments/StatefulSets
kubectl get pods -A -o wide | grep <failed-node>

# 3. If node is recoverable:
# - Reboot and wait for kubelet to rejoin
# - Check: kubectl get node <name> (should become Ready)

# 4. If node is dead:
kubectl delete node <failed-node>
# - Provision replacement node
# - Join to cluster

RTO: Minutes (auto-reschedule). RPO: Zero for stateless, depends on storage for stateful.


Scenario 2: Control Plane Failure

Impact: Cannot create/modify resources. Existing workloads keep running.

Recovery:

# 1. Check control plane components
kubectl get pods -n kube-system

# 2. If kubeadm: check static pod manifests
ls /etc/kubernetes/manifests/

# 3. Check etcd
etcdctl endpoint health

# 4. Check API server logs
journalctl -u kube-apiserver --since "10 min ago"  # or
crictl logs $(crictl ps --name kube-apiserver -q)

# 5. If etcd is corrupted: restore from backup (see etcd_backup_restore.md)
# 6. If certs expired: renew with kubeadm
kubeadm certs renew all

RTO: 15-60 minutes. RPO: Last etcd backup.


Scenario 3: Full Cluster Loss

Impact: Everything is gone. Total rebuild needed.

Recovery:

# 1. Provision new cluster infrastructure (Terraform)
terraform apply

# 2. Bootstrap new cluster (kubeadm/k3s)
# 3. Install core infrastructure (cert-manager, ingress, monitoring)
# 4. Restore etcd from backup OR re-deploy via GitOps

# Option A: etcd restore (fastest if backup is recent)
# See etcd_backup_restore.md

# Option B: GitOps re-deploy (more reliable if backup is old)
# - Install ArgoCD
# - Point to Git repo
# - Sync all applications
# - Restore data from Velero/database backups

# 5. Restore application data
velero restore create --from-backup <latest>

# 6. Update DNS to point to new cluster
# 7. Verify all services

RTO: 1-4 hours (automated), 4-8 hours (manual). RPO: Last backup.


Scenario 4: Namespace Accidentally Deleted

Impact: All resources in the namespace are gone.

Recovery:

# 1. Restore from Velero backup
velero restore create ns-restore \
  --from-backup <latest-backup> \
  --include-namespaces <deleted-namespace>

# 2. If no Velero: restore from GitOps
# ArgoCD will recreate the namespace and all managed resources
argocd app sync <app-name>

# 3. Restore data (databases, PVCs)
# PVCs are gone. Restore from database backup.
# See etcd_backup_restore.md or database-ops primer

# 4. Verify
kubectl get all -n <namespace>

RTO: 10-30 minutes. RPO: Last Velero/database backup.


Scenario 5: Database Corruption/Loss

Impact: Application data lost or corrupted.

Recovery:

[!WARNING] Scaling to zero replicas is a destructive operation that takes the application fully offline. Confirm with the team before executing — in some cases, isolating the database (NetworkPolicy or PG connection kill) is safer than stopping all traffic.

# 1. Stop the application (prevent further damage)
kubectl scale deployment grokdevops -n grokdevops --replicas=0

# 2. Assess damage
kubectl exec -it postgres-0 -n grokdevops -- psql -U postgres -c "\l"

# 3. Restore from backup
# Option A: Velero PVC snapshot
velero restore create db-restore --from-backup <latest> \
  --include-resources persistentvolumeclaims \
  --selector app=postgres

# Option B: pg_dump restore
cat backup.dump | kubectl exec -i postgres-0 -n grokdevops -- \
  pg_restore -U postgres -d grokdevops --clean

# Option C: PITR (point-in-time recovery)
# Requires WAL archiving. See database-ops primer.

# 4. Restart application
kubectl scale deployment grokdevops -n grokdevops --replicas=3

# 5. Verify data integrity
kubectl exec -it postgres-0 -n grokdevops -- \
  psql -U postgres -d grokdevops -c "SELECT count(*) FROM users;"

DR Readiness Checklist

Run quarterly:

  • etcd backup: automated, tested restore within 30 days
  • Velero backup: scheduled, tested restore within 30 days
  • Database backup: automated (pg_dump + WAL), tested restore
  • GitOps: all infrastructure and apps defined in Git
  • DNS failover: tested with health check simulation
  • Runbooks: reviewed and updated
  • Team: DR drill conducted (tabletop or live)
  • RTO/RPO: documented and agreed with stakeholders

RTO/RPO Summary

Scenario RTO RPO
Single node failure 2-5 min (auto-reschedule) 0 (stateless)
Control plane failure 15-60 min Last etcd backup
Full cluster loss 1-4 hours Last backup
Namespace deleted 10-30 min Last Velero backup
Database corruption 30-60 min Last DB backup / PITR

Post-Recovery

After any recovery: 1. Verify all services are healthy 2. Check monitoring dashboards for anomalies 3. Notify stakeholders 4. Write a postmortem if data loss occurred 5. Update runbooks with lessons learned 6. Schedule a follow-up to verify backup systems


Wiki Navigation