Skip to content

Thinking Out Loud: Kubernetes Pods & Scheduling

A senior SRE's internal monologue while working through a real scheduling task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

The data engineering team reports their batch processing pods have been stuck in Pending for 40 minutes. They need to process the overnight data pipeline before business hours, and the window is closing. I'm the platform team on-call.

The Monologue

Okay, Pending pods for 40 minutes. That's scheduling — the kubelet never even gets involved until the pod is scheduled. Let me see what's going on.

kubectl get pods -n data-eng -l job-name=overnight-etl --field-selector=status.phase=Pending

12 Pending pods. The job wants 12 parallel workers. Let me check the events — the scheduler always tells you why it can't place a pod.

kubectl describe pod overnight-etl-worker-7x9km -n data-eng | grep -A 10 Events

"0/8 nodes are available: 3 node(s) had taint {dedicated=ml-workloads:NoSchedule}, 2 node(s) had insufficient memory, 3 node(s) had insufficient cpu." That's pretty clear — no nodes with enough resources that also tolerate the taints.

But wait — we have 8 nodes and only 5 are excluded by resources. The other 3 are tainted for ML workloads. Those are the big GPU nodes. The ETL job doesn't need GPU, but those nodes have 64GB RAM and 16 cores each. Let me check what the ETL pods are actually requesting.

kubectl get pod overnight-etl-worker-7x9km -n data-eng -o jsonpath='{.spec.containers[0].resources}' | jq .

Requests: 4Gi memory, 2 CPU. That's not outrageous. So why are 5 nodes showing insufficient resources? Let me check actual allocations.

kubectl describe nodes | grep -A 5 "Allocated resources"

Mental Model: Requests vs Actual Usage

Kubernetes schedules based on resource requests, not actual usage. A node can be 10% utilized but fully allocated. Always check Allocated resources on the node, not CPU/memory usage dashboards, when debugging scheduling.

Right, there it is. Worker nodes 1-5 are 89-95% allocated on CPU even though actual utilization is probably 40%. The teams are over-requesting. That's a known problem but not one I can fix at 5 AM during an incident.

Okay, what are my options here? I need to get these 12 pods scheduled NOW. Let me think...

Option 1: Add a toleration to the ETL job so it can land on the ML nodes. Fast, but those nodes are tainted for a reason — ML team doesn't want noisy neighbors. I'd need approval.

Option 2: Evict or reduce requests on lower-priority workloads. Dangerous, could cascade.

Option 3: Temporarily scale up the node pool. If we're on a cloud provider with cluster autoscaler...

kubectl get nodes --show-labels | grep -o 'node.kubernetes.io/instance-type=[^ ]*' | sort | uniq -c

We're on EKS. The worker nodes are m5.2xlarge (8 vCPU, 32GB). Let me check the autoscaler.

kubectl get deployment cluster-autoscaler -n kube-system -o jsonpath='{.spec.replicas}'
kubectl logs deployment/cluster-autoscaler -n kube-system --tail=20

Autoscaler is running but... "max node group size reached (5/5)." There it is. The node group has a max of 5 and we're at 5. The autoscaler wants to add nodes but the ASG limit won't let it.

I have two choices now: bump the ASG max, or find another way. Bumping the ASG max is the right move, but it takes a few minutes for a new node to join.

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name eks-workers-data-eng \
  --max-size 8

While that's propagating, let me also check if there's anything I can do to help the scheduler on the existing nodes. Sometimes there are completed pods or failed jobs holding allocations.

kubectl get pods -n data-eng --field-selector=status.phase=Succeeded -o name | head -20

Yeah, there are 30+ Succeeded pods from yesterday's jobs that were never cleaned up. They don't hold resources, actually — Succeeded pods release their allocations. Never mind. But those Failed ones might still be counted if they're in a weird state.

kubectl get pods --all-namespaces --field-selector=status.phase=Failed -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,NODE:.spec.nodeName' | head -20

Hmm, a few Failed pods but they shouldn't affect scheduling either. Let me go back to watching the autoscaler.

kubectl get nodes -w

There we go — new node ip-10-0-3-47.ec2.internal is joining. Status NotReady... now Ready. Let me watch the pending pods.

Mental Model: The Scheduling Pipeline

A pod goes through: API admission -> scheduler queue -> filter (hard constraints: taints, affinity, resources) -> score (soft preferences: spread, balance) -> bind (assign to node). If a pod is Pending, the failure is in the filter phase. The scheduler event message tells you exactly which filter rejected it.

kubectl get pods -n data-eng -l job-name=overnight-etl -w

Pods are scheduling onto the new nodes. 4 running now... 8... all 12 running. Good. Let me verify the ETL job is actually processing.

kubectl logs overnight-etl-worker-7x9km -n data-eng --tail=5

Processing records. We're good. But I need to fix this properly so it doesn't happen tomorrow. Let me file the follow-ups:

1. Bump the ASG max permanently (or better, set it based on peak job parallelism + buffer) 2. Add a PriorityClass for the ETL jobs so they can preempt lower-priority workloads 3. Talk to teams about right-sizing their resource requests — we're wasting 50% of cluster capacity to over-provisioning

Actually, let me set the PriorityClass now while I'm thinking about it. This is a two-minute fix that prevents the next 3 AM page.

cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-critical
value: 100000
globalDefault: false
description: "Critical batch jobs that should preempt development workloads"
EOF

I'll update the job spec to use this PriorityClass in the morning. Right now the job is running, and that's what matters.

One last thing — I always check that the autoscaler will scale back down once the job finishes. Don't want to leave 3 extra nodes burning money.

kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml | grep -A 3 scaleDown

Scale-down is enabled with a 10-minute cooldown. Good. Once the job completes and pods are cleaned up, those nodes will drain and terminate.

What Made This Senior-Level

Junior Would... Senior Does... Why
Read the Pending event and not know what to do next Parse the scheduler message to identify which filter rejected the pod (resources, taints, affinity) The scheduler message IS the diagnosis — you just have to read it
Check Grafana dashboards for node utilization Check kubectl describe node for allocated resources Scheduling is based on requests, not utilization — dashboards show the wrong metric
Try to fix resource allocation across the cluster at 5 AM Scale up the node pool to unblock the job NOW, defer optimization Right-sizing is a project, not an incident response action
Not think about scale-down after scaling up Verify the cluster autoscaler will clean up the extra nodes Leaving extra nodes running costs money and masks the underlying problem

Key Heuristics Used

  1. Read the Scheduler Message: Pending pods always have a scheduler event explaining why — parse it instead of guessing.
  2. Requests vs Usage: Kubernetes schedules on requests, not actual utilization — check allocations, not monitoring dashboards.
  3. Unblock Now, Optimize Later: During incidents, expand capacity to unblock the workload, then address root causes (over-provisioning, ASG limits) as follow-up work.

Cross-References

  • Primer — Pod lifecycle, init containers, and resource request/limit mechanics
  • Street Ops — The scheduling decision flowchart and resource inspection commands
  • Footguns — Over-requesting resources and not setting PriorityClasses for critical workloads