Solution¶
Triage¶
- Confirm the drain is stuck and identify which pod is blocking:
- List PodDisruptionBudgets in the namespace:
- Inspect the specific PDB:
- Check how many replicas the deployment has and how many are ready:
Root Cause¶
The PDB payment-service-pdb specifies minAvailable: 1. The deployment has exactly 1 replica. The Kubernetes eviction API refuses to evict a pod when doing so would violate the PDB. Since evicting the only replica would bring available pods to 0 (below the minimum of 1), the drain operation blocks indefinitely waiting for the PDB condition to be satisfiable.
This is a configuration conflict: the PDB guarantees at least 1 pod is always available, but the deployment only runs 1 pod, making voluntary disruption impossible.
Fix¶
Immediate (unblock the drain):
- Scale the deployment up so a second replica is running on another node:
- Wait for the new replica to become Ready:
- The drain should now proceed automatically, since evicting one of two pods still satisfies
minAvailable: 1. - After the drain completes and the node is cordoned, scale back if desired:
Alternative (if no capacity exists for a second replica):
Cancel the drain and use --disable-eviction to bypass the eviction API entirely:
Rollback / Safety¶
- If the drain was started with
--delete-emptydir-data, ensure no important data lives in emptyDir volumes. - Verify that the payment-service pod is healthy on its new node after drain completes.
- If the service has a readiness probe, confirm it passes before declaring success.
- Do not uncordon the drained node until maintenance is complete.
Common Traps¶
- Assuming
--forcebypasses PDBs. It does not.--forceonly handles pods not managed by a controller. Only--disable-eviction(Kubernetes 1.18+) bypasses PDB enforcement. - Forgetting to check cluster capacity. Scaling to 2 replicas does nothing if the second pod is stuck in Pending due to insufficient resources.
- Setting minAvailable equal to replicas count. This is a common misconfiguration. Use
maxUnavailable: 1instead, or ensure replicas always exceeds minAvailable. - Not cancelling the hanging drain. If you Ctrl+C the drain command, the node remains cordoned. You must
kubectl uncordonto allow scheduling again if you abort. - Ignoring PDBs in IaC. Fix the Helm chart or Terraform module that defines the PDB, not just the live object.