Skip to content

Incident Replay: Node Drain Blocked by PDB

Setup

  • System context: Kubernetes cluster with a critical API service running 3 replicas across 3 nodes. PDB requires minAvailable: 2. One replica is already unhealthy, and you need to drain the node hosting one of the healthy replicas.
  • Time: Tuesday 22:00 UTC
  • Your role: Platform engineer / on-call SRE

Round 1: Alert Fires

[Pressure cue: "Node k8s-worker-05 has a critical kernel vulnerability and must be patched tonight. Drain is stuck. Security team says this cannot wait."]

What you see: kubectl drain k8s-worker-05 is stuck with "Cannot evict pod as it would violate the pod's disruption budget." The api-service has 3 replicas, PDB minAvailable=2, but only 2 of 3 pods are healthy. Evicting one more would leave 1 healthy, violating minAvailable=2.

Choose your action: - A) Force the drain with kubectl drain --force - B) Investigate why the third api-service replica is unhealthy - C) Temporarily reduce the PDB minAvailable to 1 - D) Patch the node without draining (risky in-place update)

[Result: kubectl get pods -o wide shows api-service-xxxxx on k8s-worker-02 is in CrashLoopBackOff. kubectl logs reveals a configuration error from last week's deploy — a missing environment variable. Fixing the config would restore 3 healthy replicas, allowing the drain. Proceed to Round 2.]

If you chose A:

[Result: --force deletes non-replicated pods but does NOT bypass PDBs. The drain is still stuck on the PDB. No improvement.]

If you chose C:

[Result: Reducing minAvailable allows the drain but leaves the API service with potentially only 1 healthy replica during the upgrade. If that one crashes, zero replicas serve traffic.]

If you chose D:

[Result: In-place kernel patching requires a reboot anyway. You would still need to drain or accept ungraceful pod termination.]

Round 2: First Triage Data

[Pressure cue: "Third replica has been unhealthy for a week. Nobody noticed because the service was running fine on 2 replicas. Fix it now."]

What you see: The CrashLoopBackOff is caused by a missing DATABASE_URL environment variable. It was removed from the ConfigMap during a cleanup a week ago. The other 2 replicas have it cached from their last restart.

Choose your action: - A) Restore the DATABASE_URL in the ConfigMap and restart the unhealthy pod - B) Redeploy the entire api-service from the last known-good config - C) Add the env var directly to the pod spec as a fix - D) Just delete the unhealthy pod and let the deployment create a new one

[Result: ConfigMap updated with the missing env var. Unhealthy pod restarted. All 3 replicas now healthy. PDB shows disruptionsAllowed: 1. Drain can proceed. Proceed to Round 3.]

If you chose B:

[Result: Full redeploy works but cycles all 3 pods through restart. During the rollout, you may hit the PDB again.]

If you chose C:

[Result: Direct pod spec changes are overwritten by the deployment controller. Not a durable fix.]

If you chose D:

[Result: New pod gets the same missing env var from the ConfigMap and crashes. CrashLoopBackOff again.]

Round 3: Root Cause Identification

[Pressure cue: "3 replicas healthy. Drain proceeding. Why was this not caught earlier?"]

What you see: Root cause: ConfigMap cleanup removed a required env var. The 2 running replicas were never restarted so they retained the old config in memory. The third replica crashed on a node reboot a week ago and could not start with the broken config. No alert was configured for pod health below replica count.

Choose your action: - A) Add alerting when healthy replicas drop below desired count - B) Add ConfigMap change validation in CI - C) Add a readiness probe that checks required env vars on startup - D) All of the above

[Result: Alerting catches future replica health drops. CI validation catches config deletions. Startup checks fail fast with clear error messages. Defense in depth. Proceed to Round 4.]

If you chose A:

[Result: Catches the symptom but not the root cause.]

If you chose B:

[Result: Catches the config change but does not detect runtime impact.]

If you chose C:

[Result: Fails fast but does not prevent the config error from being deployed.]

Round 4: Remediation

[Pressure cue: "Node drained, patched, and uncordoned. Verify everything."]

Actions: 1. Verify node is patched and uncordoned: kubectl get nodes 2. Verify all 3 api-service replicas are healthy: kubectl get pods 3. Verify PDB is satisfied: kubectl get pdb 4. Add replica health alerting to monitoring 5. Add ConfigMap diff review to the CI pipeline

Damage Report

  • Total downtime: 0 (service ran on 2 replicas; drain delayed by 30 minutes)
  • Blast radius: Node patch delayed by 30 minutes while fixing the third replica
  • Optimal resolution time: 10 minutes (identify unhealthy replica -> fix config -> drain)
  • If every wrong choice was made: 2+ hours with PDB workarounds and risk of service outage

Cross-References