Incident Replay: Node Drain Blocked by PDB¶
Setup¶
- System context: Kubernetes cluster with a critical API service running 3 replicas across 3 nodes. PDB requires minAvailable: 2. One replica is already unhealthy, and you need to drain the node hosting one of the healthy replicas.
- Time: Tuesday 22:00 UTC
- Your role: Platform engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Node k8s-worker-05 has a critical kernel vulnerability and must be patched tonight. Drain is stuck. Security team says this cannot wait."]
What you see:
kubectl drain k8s-worker-05 is stuck with "Cannot evict pod as it would violate the pod's disruption budget." The api-service has 3 replicas, PDB minAvailable=2, but only 2 of 3 pods are healthy. Evicting one more would leave 1 healthy, violating minAvailable=2.
Choose your action:
- A) Force the drain with kubectl drain --force
- B) Investigate why the third api-service replica is unhealthy
- C) Temporarily reduce the PDB minAvailable to 1
- D) Patch the node without draining (risky in-place update)
If you chose B (recommended):¶
[Result:
kubectl get pods -o wideshows api-service-xxxxx on k8s-worker-02 is in CrashLoopBackOff.kubectl logsreveals a configuration error from last week's deploy — a missing environment variable. Fixing the config would restore 3 healthy replicas, allowing the drain. Proceed to Round 2.]
If you chose A:¶
[Result:
--forcedeletes non-replicated pods but does NOT bypass PDBs. The drain is still stuck on the PDB. No improvement.]
If you chose C:¶
[Result: Reducing minAvailable allows the drain but leaves the API service with potentially only 1 healthy replica during the upgrade. If that one crashes, zero replicas serve traffic.]
If you chose D:¶
[Result: In-place kernel patching requires a reboot anyway. You would still need to drain or accept ungraceful pod termination.]
Round 2: First Triage Data¶
[Pressure cue: "Third replica has been unhealthy for a week. Nobody noticed because the service was running fine on 2 replicas. Fix it now."]
What you see:
The CrashLoopBackOff is caused by a missing DATABASE_URL environment variable. It was removed from the ConfigMap during a cleanup a week ago. The other 2 replicas have it cached from their last restart.
Choose your action: - A) Restore the DATABASE_URL in the ConfigMap and restart the unhealthy pod - B) Redeploy the entire api-service from the last known-good config - C) Add the env var directly to the pod spec as a fix - D) Just delete the unhealthy pod and let the deployment create a new one
If you chose A (recommended):¶
[Result: ConfigMap updated with the missing env var. Unhealthy pod restarted. All 3 replicas now healthy. PDB shows disruptionsAllowed: 1. Drain can proceed. Proceed to Round 3.]
If you chose B:¶
[Result: Full redeploy works but cycles all 3 pods through restart. During the rollout, you may hit the PDB again.]
If you chose C:¶
[Result: Direct pod spec changes are overwritten by the deployment controller. Not a durable fix.]
If you chose D:¶
[Result: New pod gets the same missing env var from the ConfigMap and crashes. CrashLoopBackOff again.]
Round 3: Root Cause Identification¶
[Pressure cue: "3 replicas healthy. Drain proceeding. Why was this not caught earlier?"]
What you see: Root cause: ConfigMap cleanup removed a required env var. The 2 running replicas were never restarted so they retained the old config in memory. The third replica crashed on a node reboot a week ago and could not start with the broken config. No alert was configured for pod health below replica count.
Choose your action: - A) Add alerting when healthy replicas drop below desired count - B) Add ConfigMap change validation in CI - C) Add a readiness probe that checks required env vars on startup - D) All of the above
If you chose D (recommended):¶
[Result: Alerting catches future replica health drops. CI validation catches config deletions. Startup checks fail fast with clear error messages. Defense in depth. Proceed to Round 4.]
If you chose A:¶
[Result: Catches the symptom but not the root cause.]
If you chose B:¶
[Result: Catches the config change but does not detect runtime impact.]
If you chose C:¶
[Result: Fails fast but does not prevent the config error from being deployed.]
Round 4: Remediation¶
[Pressure cue: "Node drained, patched, and uncordoned. Verify everything."]
Actions:
1. Verify node is patched and uncordoned: kubectl get nodes
2. Verify all 3 api-service replicas are healthy: kubectl get pods
3. Verify PDB is satisfied: kubectl get pdb
4. Add replica health alerting to monitoring
5. Add ConfigMap diff review to the CI pipeline
Damage Report¶
- Total downtime: 0 (service ran on 2 replicas; drain delayed by 30 minutes)
- Blast radius: Node patch delayed by 30 minutes while fixing the third replica
- Optimal resolution time: 10 minutes (identify unhealthy replica -> fix config -> drain)
- If every wrong choice was made: 2+ hours with PDB workarounds and risk of service outage
Cross-References¶
- Primer: Kubernetes Ops
- Primer: Kubernetes Node Lifecycle
- Footguns: Kubernetes Ops