Portal | Level: L2: Operations | Topics: Kubernetes Core | Domain: Kubernetes
Runbook: Pod Eviction¶
Symptoms¶
- Pods in
Evictedstatus across multiple deployments kubectl describe podshowsThe node was low on resource: [memory|ephemeral-storage]- Node shows
MemoryPressure,DiskPressure, orPIDPressureconditions - Alert for pod evictions or node pressure
Fast Triage¶
# Count evicted pods
kubectl get pods -A --field-selector status.phase=Failed | grep Evicted | wc -l
# See eviction details
kubectl get pods -A --field-selector status.phase=Failed -o json | \
jq -r '.items[] | select(.status.reason=="Evicted") | "\(.metadata.namespace)/\(.metadata.name): \(.status.message)"'
# Check node conditions
kubectl describe nodes | grep -A5 "Conditions:"
# Check node resource usage
kubectl top nodes
Likely Causes (ranked)¶
- Ephemeral storage exhaustion — container logs, emptyDir, or tmp files filling node disk
- Memory pressure — node running out of allocatable memory
- DiskPressure — node root disk filling up (images, logs, containers)
- PID exhaustion — too many processes on the node
- Best-effort pods evicted first — pods without resource requests get evicted first
Evidence Interpretation¶
What bad looks like:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.
Container app was using 2Gi, which exceeds its request of 0.
Eviction priority order:
1. BestEffort pods (no requests/limits) — evicted first
2. Burstable pods (partial requests) — evicted next
3. Guaranteed pods (requests == limits) — evicted last
Fix¶
1. Clean up evicted pods¶
[!TIP] Evicted pods are already dead — they will never run again. Don't troubleshoot them; delete them and focus on the root cause (node resource pressure) to prevent future evictions.
# Delete all evicted pods (they're already dead)
kubectl get pods -A --field-selector status.phase=Failed -o json | \
jq -r '.items[] | select(.status.reason=="Evicted") | "\(.metadata.namespace) \(.metadata.name)"' | \
xargs -n2 kubectl delete pod -n
2. Fix ephemeral storage pressure¶
# On the node:
# Check disk usage
df -h
du -sh /var/lib/containerd/* # Container runtime storage
du -sh /var/log/pods/* # Pod logs
# Clean up
crictl rmi --prune # Remove unused images
journalctl --vacuum-size=500M # Trim journal logs
# In manifests — set ephemeral storage limits:
# resources:
# requests:
# ephemeral-storage: "1Gi"
# limits:
# ephemeral-storage: "2Gi"
3. Fix memory pressure¶
# Find memory-hungry pods on the node
kubectl top pods -A --sort-by=memory | head -20
# Check which pods have no memory requests (BestEffort)
kubectl get pods -A -o json | jq -r '
.items[] |
select(.spec.containers[] | .resources.requests.memory == null) |
"\(.metadata.namespace)/\(.metadata.name)"'
# Fix: add memory requests and limits to all workloads
4. Fix disk pressure from images¶
# On the node:
crictl images | sort -k3 -rh | head -20 # Largest images
crictl rmi --prune # Remove unused
# If container runtime disk is separate:
df -h /var/lib/containerd
# May need to expand the disk or move to larger volume
5. Adjust eviction thresholds¶
# Check current kubelet eviction config:
kubectl get node <name> -o json | jq '.status.allocatable'
# kubelet flags (in kubelet config):
# evictionHard:
# memory.available: "100Mi"
# nodefs.available: "10%"
# imagefs.available: "15%"
# evictionSoft:
# memory.available: "200Mi"
# evictionSoftGracePeriod:
# memory.available: "1m30s"
Verification¶
# Confirm no evicted pods remain
kubectl get pods -A --field-selector status.phase=Failed | grep Evicted
# Expected: no output
# Confirm node pressure conditions are cleared
kubectl describe node <node-name> | grep -A5 Conditions
# Expected: MemoryPressure=False, DiskPressure=False
Prevention¶
- Always set resource requests and limits — prevents BestEffort pods
- Set
ephemeral-storagerequests on all containers - Monitor node disk usage and set alerts at 80%
- Use
emptyDir.sizeLimitto prevent runaway temp file usage - Configure log rotation (container runtime + application level)
- Right-size nodes to match workload requirements
- Use cluster autoscaler to add capacity before evictions happen
Automated Cleanup¶
# CronJob to clean evicted pods
kubectl create cronjob cleanup-evicted \
--image=bitnami/kubectl:latest \
--schedule="0 */6 * * *" \
-- /bin/sh -c 'kubectl get pods -A --field-selector status.phase=Failed -o json | jq -r ".items[] | select(.status.reason==\"Evicted\") | \"\(.metadata.namespace) \(.metadata.name)\"" | xargs -n2 kubectl delete pod -n'
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core