Skip to content

Solution

Triage

  1. Confirm the drain is stuck and identify which pod is blocking:
    kubectl get pods -n prod -o wide --field-selector spec.nodeName=node-3.internal
    
  2. List PodDisruptionBudgets in the namespace:
    kubectl get pdb -n prod
    
  3. Inspect the specific PDB:
    kubectl describe pdb payment-service-pdb -n prod
    
  4. Check how many replicas the deployment has and how many are ready:
    kubectl get deployment payment-service -n prod
    

Root Cause

The PDB payment-service-pdb specifies minAvailable: 1. The deployment has exactly 1 replica. The Kubernetes eviction API refuses to evict a pod when doing so would violate the PDB. Since evicting the only replica would bring available pods to 0 (below the minimum of 1), the drain operation blocks indefinitely waiting for the PDB condition to be satisfiable.

This is a configuration conflict: the PDB guarantees at least 1 pod is always available, but the deployment only runs 1 pod, making voluntary disruption impossible.

Fix

Immediate (unblock the drain):

  1. Scale the deployment up so a second replica is running on another node:
    kubectl scale deployment payment-service -n prod --replicas=2
    
  2. Wait for the new replica to become Ready:
    kubectl wait --for=condition=Ready pod -l app=payment-service -n prod --timeout=120s
    
  3. The drain should now proceed automatically, since evicting one of two pods still satisfies minAvailable: 1.
  4. After the drain completes and the node is cordoned, scale back if desired:
    kubectl scale deployment payment-service -n prod --replicas=1
    

Alternative (if no capacity exists for a second replica):

Cancel the drain and use --disable-eviction to bypass the eviction API entirely:

kubectl drain node-3.internal --ignore-daemonsets --disable-eviction
This deletes the pod directly instead of using the eviction API, bypassing PDB checks. This WILL cause downtime.

Rollback / Safety

  • If the drain was started with --delete-emptydir-data, ensure no important data lives in emptyDir volumes.
  • Verify that the payment-service pod is healthy on its new node after drain completes.
  • If the service has a readiness probe, confirm it passes before declaring success.
  • Do not uncordon the drained node until maintenance is complete.

Common Traps

  • Assuming --force bypasses PDBs. It does not. --force only handles pods not managed by a controller. Only --disable-eviction (Kubernetes 1.18+) bypasses PDB enforcement.
  • Forgetting to check cluster capacity. Scaling to 2 replicas does nothing if the second pod is stuck in Pending due to insufficient resources.
  • Setting minAvailable equal to replicas count. This is a common misconfiguration. Use maxUnavailable: 1 instead, or ensure replicas always exceeds minAvailable.
  • Not cancelling the hanging drain. If you Ctrl+C the drain command, the node remains cordoned. You must kubectl uncordon to allow scheduling again if you abort.
  • Ignoring PDBs in IaC. Fix the Helm chart or Terraform module that defines the PDB, not just the live object.