Incident Replay: Node Pressure Evictions¶
Setup¶
- System context: Kubernetes cluster with 8 worker nodes. Pods are being evicted from k8s-worker-06 with "The node was low on resource: ephemeral-storage." Applications are crashing and restarting on other nodes.
- Time: Friday 13:00 UTC
- Your role: Platform engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "15 pods evicted from k8s-worker-06 in the last 10 minutes. Services are degraded while rescheduling. More evictions happening."]
What you see:
kubectl get events --field-selector reason=Evicted shows a stream of evictions from worker-06. kubectl describe node k8s-worker-06 shows condition "DiskPressure=True" and "ephemeral-storage available: 1.2Gi (threshold: 2Gi)."
Choose your action: - A) Cordon the node to prevent new pods from being scheduled - B) Check what is consuming disk space on the node - C) Increase the eviction threshold to stop the evictions - D) Delete evicted pods to clean up the noise
If you chose B (recommended):¶
[Result: SSH to the node.
df -h /var/lib/kubeletshows 94% full.du -sh /var/lib/containerd/io.containerd.snapshotter/overlayfs/snapshots/*reveals old container image layers consuming 45GB. Image garbage collection is not keeping up. Proceed to Round 2.]
If you chose A:¶
[Result: Cordoning prevents new scheduling but existing pods continue to be evicted. Partially helps.]
If you chose C:¶
[Result: Lowering the eviction threshold means pods run on a nearly full disk — risk of complete disk exhaustion and kubelet crash.]
If you chose D:¶
[Result: Evicted pods are already terminated. Deleting them cleans up kubectl output but does not fix the disk pressure.]
Round 2: First Triage Data¶
[Pressure cue: "Worker-06 is almost empty from evictions. Other nodes are picking up the load but getting crowded."]
What you see:
Container image garbage collection is configured with imageGCHighThresholdPercent: 85 and imageGCLowThresholdPercent: 80. Disk usage hit 85% but GC failed because many images are still referenced by running pods. Old unused images are not being cleaned because the GC runs infrequently.
Choose your action:
- A) Manually trigger image garbage collection: crictl rmi --prune
- B) Delete old container logs that are also consuming space
- C) Identify and clean up the largest unused image layers
- D) Both A and C — prune images and clean up large unused layers
If you chose D (recommended):¶
[Result:
crictl rmi --pruneremoves 12GB of unused images. Additionally, stale build cache under/var/lib/containerdclears another 8GB. Disk usage drops to 68%. DiskPressure condition clears. Evictions stop. Proceed to Round 3.]
If you chose A:¶
[Result: Prune removes some images but the largest consumers are stale layers not caught by prune. Partial fix.]
If you chose B:¶
[Result: Container logs account for only 2GB. Helps marginally but the images are the main consumer.]
If you chose C:¶
[Result: Correct but slower than combining with automated prune.]
Round 3: Root Cause Identification¶
[Pressure cue: "Disk pressure resolved. Why did GC fail to prevent this?"]
What you see: Root cause: The cluster deploys many short-lived jobs that pull unique images. Each job creates image layers that are never reused. The GC threshold of 85% is too high for this workload pattern — by the time GC triggers, there is not enough headroom for the eviction threshold (2Gi below capacity).
Choose your action: - A) Lower the GC high threshold to 70% and low threshold to 60% - B) Add a CronJob to prune unused images nightly - C) Use a shared base image for jobs to reduce unique layer proliferation - D) All of the above
If you chose D (recommended):¶
[Result: Lower GC thresholds provide more headroom. Nightly prune catches anything GC misses. Shared base images reduce the rate of new layer accumulation. Proceed to Round 4.]
If you chose A:¶
[Result: Lower thresholds help but jobs still create many unique layers.]
If you chose B:¶
[Result: Nightly prune helps but GC should catch most cases automatically.]
If you chose C:¶
[Result: Reduces the problem rate but does not fix the GC configuration.]
Round 4: Remediation¶
[Pressure cue: "Node healthy. Uncordon and verify."]
Actions:
1. Uncordon the node: kubectl uncordon k8s-worker-06
2. Verify DiskPressure is False: kubectl describe node k8s-worker-06
3. Verify evicted pods have been rescheduled successfully
4. Apply new GC thresholds across all nodes via kubelet config
5. Deploy the image pruning CronJob as a DaemonSet
Damage Report¶
- Total downtime: 0 (pods rescheduled to other nodes)
- Blast radius: 15 pods evicted from one node; brief service disruption during reschedule
- Optimal resolution time: 15 minutes (identify disk usage -> prune images -> clear pressure)
- If every wrong choice was made: 60+ minutes with cascading evictions across multiple nodes
Cross-References¶
- Primer: Kubernetes Node Lifecycle
- Primer: Kubernetes Ops
- Primer: Container Runtime Debug
- Footguns: Kubernetes Ops