Kubernetes Pods & Scheduling - Street Ops¶
What experienced Kubernetes operators know about pods and scheduling. The stuff that matters at 2am when your pods aren't starting.
Why Is My Pod Pending?¶
A Pending pod means the scheduler can't place it on any node. This is the single most common scheduling problem you'll encounter.
Step 1: Describe the Pod¶
Look at the Events section at the bottom. The scheduler will tell you exactly why it can't place the pod.
Insufficient Resources¶
The pod's resource requests exceed what's available on any node.
# Check what each node has available
kubectl describe nodes | grep -A 5 "Allocated resources"
# Compact view of resource usage
kubectl top nodes
# See exactly what's requested vs allocatable
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory
Fixes:
- Reduce the pod's resource requests
- Add more nodes to the cluster
- Evict or resize over-requesting pods on existing nodes
- Check if you're accidentally requesting 1 CPU (1 full core) when you meant 100m (0.1 cores)
Taint Mismatch¶
Warning FailedScheduling 0/5 nodes are available:
3 node(s) had untolerated taint {dedicated: gpu}, 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }.
Every node has a taint that the pod doesn't tolerate.
# See all node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# See taints on a specific node
kubectl describe node worker-1 | grep -A 5 Taints
Fixes:
- Add matching tolerations to the pod spec
- Remove the taint from the node: kubectl taint nodes worker-1 dedicated=gpu:NoSchedule-
- Check if a cluster autoscaler should be adding untainted nodes
Affinity Rules Can't Be Satisfied¶
Warning FailedScheduling 0/5 nodes are available:
5 node(s) didn't match Pod's node affinity/selector.
The pod has nodeSelector or nodeAffinity rules and no node matches.
# Check what labels the pod requires
kubectl get pod stuck-pod -o jsonpath='{.spec.nodeSelector}'
kubectl get pod stuck-pod -o jsonpath='{.spec.affinity}'
# Check what labels nodes actually have
kubectl get nodes --show-labels
PVC Not Bound¶
Warning FailedScheduling 0/5 nodes are available:
5 node(s) didn't find available persistent volumes to bind.
The pod references a PersistentVolumeClaim that isn't bound to a PersistentVolume.
# Check PVC status
kubectl get pvc -n production
kubectl describe pvc data-pvc -n production
# Check if there's a matching PV
kubectl get pv
Common causes: StorageClass doesn't exist, no available PV in the right zone, capacity exhausted, dynamic provisioner is broken.
Debugging CrashLoopBackOff¶
CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps restarting it with increasing backoff delays (10s, 20s, 40s... up to 5 minutes).
Step 1: Check the Logs¶
# Current (possibly empty if it crashes too fast)
kubectl logs crashy-pod -n production
# Previous crashed container
kubectl logs crashy-pod -n production --previous
# If multi-container pod, specify the container
kubectl logs crashy-pod -n production -c app --previous
Step 2: Check the Describe Output¶
Look for: - Last State — shows the exit code and reason of the previous crash - Exit Code 1 — application error (check logs) - Exit Code 137 — SIGKILL (OOMKilled or external kill) - Exit Code 139 — SIGSEGV (segmentation fault) - Exit Code 143 — SIGTERM (graceful shutdown that still exited non-zero)
Step 3: Get Into the Container¶
If the container crashes too fast to exec into it:
# Override the command to keep it alive
kubectl run debug-pod --image=myapp:v2.1.0 --restart=Never \
--command -- sleep 3600
# Then exec in and investigate
kubectl exec -it debug-pod -- /bin/sh
# Check if the binary exists, config files are present, etc.
ls -la /app/
cat /app/config.yaml
env | sort
Common CrashLoopBackOff Causes¶
| Symptom | Likely Cause |
|---|---|
| Exit code 1, "file not found" in logs | Wrong image, missing config, bad entrypoint |
| Exit code 137, OOMKilled in describe | Container exceeds memory limit |
| Exit code 1, "connection refused" | Dependency not ready, wrong endpoint |
| No logs at all | Container exits before writing anything — check command/args |
| Exit code 126 | Permission denied on the entrypoint binary |
| Exit code 127 | Entrypoint binary not found in PATH |
Debugging ImagePullBackOff¶
The kubelet can't pull the container image.
Look for events like:
Warning Failed Failed to pull image "myapp:v99": rpc error: code = NotFound
Warning Failed Failed to pull image "private.registry.io/myapp:v2": unauthorized
Common Causes and Fixes¶
Image doesn't exist: Typo in image name or tag. Verify the image exists in the registry.
# Test with docker/crane/skopeo outside the cluster
docker pull myapp:v2.1.0
crane digest myapp:v2.1.0
skopeo inspect docker://myapp:v2.1.0
Private registry, missing credentials: Create an imagePullSecret.
kubectl create secret docker-registry regcred \
--docker-server=private.registry.io \
--docker-username=myuser \
--docker-password=mypass \
-n production
Then reference it in the pod spec:
Rate limiting: Docker Hub rate limits anonymous and free-tier pulls. You'll see toomanyrequests in the error. Use a paid plan, mirror the image, or use a pull-through cache.
Evicted Pods (Node Pressure)¶
When a node runs low on resources (memory, disk, PIDs), the kubelet starts evicting pods.
# Find evicted pods
kubectl get pods -A --field-selector status.phase=Failed | grep Evicted
# Check why
kubectl describe pod evicted-pod-xyz -n production
# Look for: "The node was low on resource: memory"
Eviction Priority Order¶
The kubelet evicts in this order: 1. BestEffort pods (no resource requests/limits) 2. Burstable pods exceeding their requests 3. Burstable pods within their requests 4. Guaranteed pods (last resort)
Investigating Node Pressure¶
# Check node conditions
kubectl describe node worker-2 | grep -A 3 Conditions
# Look for:
# MemoryPressure True
# DiskPressure True
# PIDPressure True
# See what's consuming resources on the node
kubectl top pods --sort-by=memory -A | head -20
# Check kubelet eviction thresholds (on the node)
cat /var/lib/kubelet/config.yaml | grep -A 10 eviction
Default eviction thresholds: memory.available < 100Mi, nodefs.available < 10%, imagefs.available < 15%.
Cleaning Up Evicted Pods¶
Evicted pods stay in Failed state and clutter the namespace. Clean them up:
# Delete all evicted pods across all namespaces
kubectl get pods -A --field-selector status.phase=Failed -o json | \
kubectl delete -f -
# Or more targeted
kubectl delete pods --field-selector status.phase=Failed -n production
Pod Stuck in Terminating¶
You deleted a pod and it's been Terminating for minutes (or hours).
Check for Finalizers¶
Finalizers are metadata on the object that must be cleared before Kubernetes will remove it.
If there's a finalizer, something (a controller, operator) is supposed to do cleanup and remove it. If that controller is broken, the finalizer never gets cleared.
Nuclear option — remove the finalizer (understand the consequences first):
Force Delete¶
If the pod is stuck because the node is unreachable:
This tells the API server to remove the pod object immediately. If the node comes back, the kubelet will clean up the actual container.
Warning: Force-deleting a pod from a StatefulSet can cause split-brain — two pods with the same identity running simultaneously. Only force-delete after confirming the old node is truly dead.
Debugging Init Container Failures¶
If a pod is stuck in Init:0/2 or Init:CrashLoopBackOff, an init container is failing.
# See which init container is failing
kubectl describe pod myapp -n production
# Look at Init Containers section — the first one with state Waiting or Terminated with non-zero exit
# Get logs from a specific init container
kubectl logs myapp -n production -c wait-for-db
kubectl logs myapp -n production -c run-migrations --previous
Common init container failures: - DNS not resolving yet (init container runs before CoreDNS is ready in some bootstrap scenarios) - Database not accepting connections (init container waiting for a service that isn't ready) - Permission denied on config generation scripts - Init container image wrong or missing
Resource Right-Sizing¶
Over-requesting wastes cluster capacity. Under-requesting causes OOMKills and throttling.
Using metrics-server¶
# Install metrics-server if not present
kubectl top pods -n production --sort-by=cpu
kubectl top pods -n production --sort-by=memory
# Compare actual usage to requests
kubectl top pod myapp -n production --containers
Using VPA Recommendations¶
The Vertical Pod Autoscaler can run in recommendation-only mode:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Just recommend, don't auto-apply
# Check VPA recommendations
kubectl describe vpa api-vpa -n production
# Look at: Lower Bound, Target, Upper Bound, Uncapped Target
Rules of Thumb¶
- Set CPU request to the P95 usage. Set CPU limit to 2-4x the request (or remove it — CPU throttling is often worse than overcommit).
- Set memory request to the P99 usage. Set memory limit to 1.5-2x the request.
- Review and adjust monthly. Workload patterns change.
- Never set memory limit == request for bursty workloads. You're one spike away from OOMKill.
Scheduling Spread for HA¶
For production workloads, you want replicas spread across failure domains.
Zone Spread¶
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
This ensures no zone has more than 1 extra pod compared to any other zone. With 3 zones and 6 replicas, you get 2-2-2.
Node Spread + Zone Spread¶
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-server
Hard constraint: even zone spread. Soft constraint: even node spread within each zone.
Draining Nodes Safely¶
Before maintenance (kernel upgrades, instance replacement), drain the node properly.
Pre-Drain Checklist¶
# 1. Check PDBs — will drain be blocked?
kubectl get pdb -A
# 2. Check for bare pods (no controller) — they won't come back
kubectl get pods -A -o json | jq -r '.items[] |
select(.metadata.ownerReferences == null) |
"\(.metadata.namespace)/\(.metadata.name)"'
# 3. Check DaemonSets
kubectl get ds -A
# 4. Preview what will be evicted
kubectl drain worker-3 --dry-run=client --ignore-daemonsets --delete-emptydir-data
Execute the Drain¶
# Cordon first (stop new pods from being scheduled)
kubectl cordon worker-3
# Drain with safety flags
kubectl drain worker-3 \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=600s \
--pod-selector='app!=critical-singleton'
# After maintenance, uncordon
kubectl uncordon worker-3
If Drain Gets Stuck¶
# Check which pods are blocking
kubectl get pods -A --field-selector spec.nodeName=worker-3
# Check PDB status — is it blocking?
kubectl get pdb -A -o wide
# If a PDB is blocking, you might need to:
# 1. Scale up the deployment so drain can evict one pod
# 2. Temporarily increase maxUnavailable on the PDB
# 3. As a last resort, delete the PDB (and recreate after)
Inspecting Pod Resource Usage¶
# Cluster-wide — which pods are eating the most
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20
# Specific namespace
kubectl top pods -n production
# Per-container breakdown
kubectl top pod myapp -n production --containers
# Node-level resource usage
kubectl top nodes
# See resource requests vs actual usage (manual comparison)
kubectl describe node worker-1 | grep -A 20 "Allocated resources"
Detecting Resource Hogs¶
# Find pods with no resource limits (potential noisy neighbors)
kubectl get pods -A -o json | jq -r '.items[] |
.spec.containers[] |
select(.resources.limits == null) |
"\(.name): no limits set"'
# Find pods with BestEffort QoS
kubectl get pods -A -o json | jq -r '.items[] |
select(.status.qosClass == "BestEffort") |
"\(.metadata.namespace)/\(.metadata.name)"'
When Metrics-Server Isn't Enough¶
kubectl top uses metrics-server, which keeps only ~30 seconds of data. For historical resource usage:
- Prometheus + Grafana:
container_cpu_usage_seconds_total,container_memory_working_set_bytes - kubectl with raw metrics:
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods - cAdvisor: direct access on each node at port 10250 (kubelet) or 4194 (standalone)
# CPU usage over time (PromQL)
rate(container_cpu_usage_seconds_total{namespace="production", pod="myapp-xyz"}[5m])
# Memory working set
container_memory_working_set_bytes{namespace="production", pod="myapp-xyz"}
# OOMKilled events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}