Thinking Out Loud: Kubernetes Node Lifecycle¶

A senior SRE's internal monologue while working through a real node maintenance task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation¶

AWS has sent a scheduled maintenance notification: three of our worker nodes will be retired in 48 hours for underlying hardware maintenance. I need to gracefully drain these nodes and ensure zero downtime for the services running on them.

The Monologue¶

Okay, AWS hardware maintenance on three nodes. This happens every few months. Let me first figure out what's running on these nodes so I know the blast radius.

kubectl get nodes ip-10-0-1-47.ec2.internal ip-10-0-2-83.ec2.internal ip-10-0-3-12.ec2.internal -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status,VERSION:.status.nodeInfo.kubeletVersion'

All three nodes are Ready and running the same kubelet version. Good. Now let me see what pods are on each.

for node in ip-10-0-1-47.ec2.internal ip-10-0-2-83.ec2.internal ip-10-0-3-12.ec2.internal; do
  echo "=== $node ==="
  kubectl get pods --all-namespaces --field-selector=spec.nodeName=$node -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,OWNER:.metadata.ownerReferences[0].kind' | grep -v kube-system
  echo
done

Let me count the unique namespaces and check for any stateful workloads.

Alright, I see pods from the commerce, analytics, ml, and monitoring namespaces. Some are Deployment-owned, some are StatefulSet-owned (the Elasticsearch data nodes), and there are a few standalone pods (looks like a debug pod someone left running). The Elasticsearch pods concern me the most — those have local PVCs.

Before I start draining, I need to check if there's enough capacity on the remaining nodes to absorb these workloads.

kubectl describe nodes | grep -A 8 "Allocated resources" | grep -E "Name:|cpu|memory"

Mental Model: Capacity Buffer Before Drain¶

Before draining nodes, calculate whether the remaining cluster has enough allocatable resources to absorb the evicted pods. If not, scale up new nodes FIRST, wait for them to be Ready, then drain. Draining without capacity planning causes a cascade of Pending pods.

The remaining 5 nodes have about 30% CPU headroom and 25% memory headroom. Three nodes' worth of pods is roughly 37% of the cluster. We don't have enough headroom. I need to add capacity first.

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name eks-workers-prod \
  --desired-capacity 11

I'm bumping from 8 to 11 — 3 replacement nodes. Let me wait for them to join.

kubectl get nodes -w

Three new nodes joining... NotReady... Ready. All three are up. Now I have 11 nodes total, and I'm going to drain the 3 old ones. Let me start with a cordon to prevent new pods from scheduling on them.

kubectl cordon ip-10-0-1-47.ec2.internal ip-10-0-2-83.ec2.internal ip-10-0-3-12.ec2.internal

Nodes cordoned. Now the drain. I need to be careful here — let me check for PodDisruptionBudgets first. If services have PDBs, the drain will respect them and might block if the PDB would be violated.

kubectl get pdb --all-namespaces

There are PDBs for elasticsearch-data (minAvailable: 2), payment-service (maxUnavailable: 1), and recommendation-engine (minAvailable: 50%). Good — the drain command will respect these. But I need to make sure I'm draining one node at a time, not all three simultaneously, or I might violate a PDB.

Mental Model: Drain Sequentially, Not in Parallel¶

When draining multiple nodes, do them one at a time. PodDisruptionBudgets calculate availability across the current state. If you drain 3 nodes simultaneously and pods from the same StatefulSet are on different nodes, you can violate the PDB even though each individual drain would have been fine. Sequential drains let the cluster rebalance between each one.

Let me drain the first node. I'll use --delete-emptydir-data because some pods use emptyDir for scratch space, and --ignore-daemonsets because DaemonSet pods can't be evicted.

kubectl drain ip-10-0-1-47.ec2.internal \
  --delete-emptydir-data \
  --ignore-daemonsets \
  --timeout=300s \
  --grace-period=60

Evicting pods... waiting for pod deletion... all pods evicted. Let me verify they landed somewhere healthy.

kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide

No Pending or Failed pods. All evicted pods rescheduled successfully. Let me check the Elasticsearch cluster specifically since those are stateful.

kubectl exec -it elasticsearch-data-0 -n logging -- curl -s localhost:9200/_cluster/health?pretty

Status: green. Good — ES handled the pod relocation. Now let me drain the second node.

kubectl drain ip-10-0-2-83.ec2.internal \
  --delete-emptydir-data \
  --ignore-daemonsets \
  --timeout=300s \
  --grace-period=60

Draining... "Cannot evict pod as it would violate the pod's disruption budget." Hmm. The PDB for elasticsearch-data is minAvailable: 2, and one of the ES pods was on the first node we drained and is still recovering on its new node. Let me check.

kubectl get pods -n logging -l app=elasticsearch-data -o wide

One pod is still in ContainerCreating on the new node — probably pulling the image or waiting for the PVC to attach. The PDB sees only 2 healthy pods out of 3, and draining this node would take another one out. So the drain is blocking. This is exactly what PDBs are supposed to do.

Let me wait for the ES pod to become Ready on its new node, then retry.

kubectl wait --for=condition=ready pod -l app=elasticsearch-data -n logging --timeout=180s

Pod is Ready. Now let me retry the drain.

kubectl drain ip-10-0-2-83.ec2.internal \
  --delete-emptydir-data \
  --ignore-daemonsets \
  --timeout=300s \
  --grace-period=60

This time it succeeded. One more node to go.

kubectl drain ip-10-0-3-12.ec2.internal \
  --delete-emptydir-data \
  --ignore-daemonsets \
  --timeout=300s \
  --grace-period=60

All three nodes drained. Let me verify the cluster state is fully healthy.

kubectl get nodes
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded | grep -v Completed

8 Ready nodes (3 SchedulingDisabled), all pods Running or Succeeded. The old nodes are empty. I'll leave them cordoned until AWS retires them, then clean up the node objects.

Now let me make a note to scale the ASG back down to 8 after the maintenance is complete and the old instances are terminated. I'll also set an alert on the node count so we don't accidentally forget and run extra nodes for weeks.

What Made This Senior-Level¶

Junior Would...	Senior Does...	Why
Start draining immediately without checking capacity	Calculate remaining headroom, scale up replacement nodes first, then drain	Draining without capacity causes cascading Pending pods
Drain all three nodes simultaneously	Drain sequentially, waiting for PDB satisfaction between each	Parallel drains can violate PodDisruptionBudgets even when individual drains wouldn't
Be surprised when a drain blocks on a PDB	Anticipate PDB interactions and wait for pod rescheduling to complete between drains	PDBs exist precisely for this scenario — understand them and work with them
Forget about the extra nodes after maintenance	Note the need to scale back down and set a reminder	Orphaned capacity costs money and masks capacity planning issues

Key Heuristics Used¶

Capacity Before Drain: Always verify (or create) sufficient headroom on remaining nodes before starting a drain operation.
Sequential Drain with PDB Awareness: Drain one node at a time and wait for rescheduled pods to become Ready before draining the next.
Clean Up After Maintenance: Scale back down, remove cordoned node objects, and verify the cluster returns to its normal state.

Cross-References¶

Primer — Node lifecycle states, cordon vs drain, and the eviction API
Street Ops — The drain flags reference and PDB interaction patterns
Footguns — Parallel drains violating PDBs and forgetting to scale back down