Incident Replay: Node Pressure Evictions¶

Setup¶

System context: Kubernetes cluster with 8 worker nodes. Pods are being evicted from k8s-worker-06 with "The node was low on resource: ephemeral-storage." Applications are crashing and restarting on other nodes.
Time: Friday 13:00 UTC
Your role: Platform engineer / on-call SRE

Round 1: Alert Fires¶

[Pressure cue: "15 pods evicted from k8s-worker-06 in the last 10 minutes. Services are degraded while rescheduling. More evictions happening."]

What you see: kubectl get events --field-selector reason=Evicted shows a stream of evictions from worker-06. kubectl describe node k8s-worker-06 shows condition "DiskPressure=True" and "ephemeral-storage available: 1.2Gi (threshold: 2Gi)."

Choose your action: - A) Cordon the node to prevent new pods from being scheduled - B) Check what is consuming disk space on the node - C) Increase the eviction threshold to stop the evictions - D) Delete evicted pods to clean up the noise

If you chose B (recommended):¶

[Result: SSH to the node. df -h /var/lib/kubelet shows 94% full. du -sh /var/lib/containerd/io.containerd.snapshotter/overlayfs/snapshots/* reveals old container image layers consuming 45GB. Image garbage collection is not keeping up. Proceed to Round 2.]

If you chose A:¶

[Result: Cordoning prevents new scheduling but existing pods continue to be evicted. Partially helps.]

If you chose C:¶

[Result: Lowering the eviction threshold means pods run on a nearly full disk — risk of complete disk exhaustion and kubelet crash.]

If you chose D:¶

[Result: Evicted pods are already terminated. Deleting them cleans up kubectl output but does not fix the disk pressure.]

Round 2: First Triage Data¶

[Pressure cue: "Worker-06 is almost empty from evictions. Other nodes are picking up the load but getting crowded."]

What you see: Container image garbage collection is configured with imageGCHighThresholdPercent: 85 and imageGCLowThresholdPercent: 80. Disk usage hit 85% but GC failed because many images are still referenced by running pods. Old unused images are not being cleaned because the GC runs infrequently.

Choose your action: - A) Manually trigger image garbage collection: crictl rmi --prune - B) Delete old container logs that are also consuming space - C) Identify and clean up the largest unused image layers - D) Both A and C — prune images and clean up large unused layers

If you chose D (recommended):¶

[Result: crictl rmi --prune removes 12GB of unused images. Additionally, stale build cache under /var/lib/containerd clears another 8GB. Disk usage drops to 68%. DiskPressure condition clears. Evictions stop. Proceed to Round 3.]

If you chose A:¶

[Result: Prune removes some images but the largest consumers are stale layers not caught by prune. Partial fix.]

If you chose B:¶

[Result: Container logs account for only 2GB. Helps marginally but the images are the main consumer.]

If you chose C:¶

[Result: Correct but slower than combining with automated prune.]

Round 3: Root Cause Identification¶

[Pressure cue: "Disk pressure resolved. Why did GC fail to prevent this?"]

What you see: Root cause: The cluster deploys many short-lived jobs that pull unique images. Each job creates image layers that are never reused. The GC threshold of 85% is too high for this workload pattern — by the time GC triggers, there is not enough headroom for the eviction threshold (2Gi below capacity).

Choose your action: - A) Lower the GC high threshold to 70% and low threshold to 60% - B) Add a CronJob to prune unused images nightly - C) Use a shared base image for jobs to reduce unique layer proliferation - D) All of the above

If you chose D (recommended):¶

[Result: Lower GC thresholds provide more headroom. Nightly prune catches anything GC misses. Shared base images reduce the rate of new layer accumulation. Proceed to Round 4.]

If you chose A:¶

[Result: Lower thresholds help but jobs still create many unique layers.]

If you chose B:¶

[Result: Nightly prune helps but GC should catch most cases automatically.]

If you chose C:¶

[Result: Reduces the problem rate but does not fix the GC configuration.]

Round 4: Remediation¶

[Pressure cue: "Node healthy. Uncordon and verify."]

Actions: 1. Uncordon the node: kubectl uncordon k8s-worker-06 2. Verify DiskPressure is False: kubectl describe node k8s-worker-06 3. Verify evicted pods have been rescheduled successfully 4. Apply new GC thresholds across all nodes via kubelet config 5. Deploy the image pruning CronJob as a DaemonSet

Damage Report¶

Total downtime: 0 (pods rescheduled to other nodes)
Blast radius: 15 pods evicted from one node; brief service disruption during reschedule
Optimal resolution time: 15 minutes (identify disk usage -> prune images -> clear pressure)
If every wrong choice was made: 60+ minutes with cascading evictions across multiple nodes

Incident Replay: Node Pressure Evictions¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 3: Root Cause Identification¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶