Skip to content

Incident Replay: Node Pressure Evictions

Setup

  • System context: Kubernetes cluster with 8 worker nodes. Pods are being evicted from k8s-worker-06 with "The node was low on resource: ephemeral-storage." Applications are crashing and restarting on other nodes.
  • Time: Friday 13:00 UTC
  • Your role: Platform engineer / on-call SRE

Round 1: Alert Fires

[Pressure cue: "15 pods evicted from k8s-worker-06 in the last 10 minutes. Services are degraded while rescheduling. More evictions happening."]

What you see: kubectl get events --field-selector reason=Evicted shows a stream of evictions from worker-06. kubectl describe node k8s-worker-06 shows condition "DiskPressure=True" and "ephemeral-storage available: 1.2Gi (threshold: 2Gi)."

Choose your action: - A) Cordon the node to prevent new pods from being scheduled - B) Check what is consuming disk space on the node - C) Increase the eviction threshold to stop the evictions - D) Delete evicted pods to clean up the noise

[Result: SSH to the node. df -h /var/lib/kubelet shows 94% full. du -sh /var/lib/containerd/io.containerd.snapshotter/overlayfs/snapshots/* reveals old container image layers consuming 45GB. Image garbage collection is not keeping up. Proceed to Round 2.]

If you chose A:

[Result: Cordoning prevents new scheduling but existing pods continue to be evicted. Partially helps.]

If you chose C:

[Result: Lowering the eviction threshold means pods run on a nearly full disk — risk of complete disk exhaustion and kubelet crash.]

If you chose D:

[Result: Evicted pods are already terminated. Deleting them cleans up kubectl output but does not fix the disk pressure.]

Round 2: First Triage Data

[Pressure cue: "Worker-06 is almost empty from evictions. Other nodes are picking up the load but getting crowded."]

What you see: Container image garbage collection is configured with imageGCHighThresholdPercent: 85 and imageGCLowThresholdPercent: 80. Disk usage hit 85% but GC failed because many images are still referenced by running pods. Old unused images are not being cleaned because the GC runs infrequently.

Choose your action: - A) Manually trigger image garbage collection: crictl rmi --prune - B) Delete old container logs that are also consuming space - C) Identify and clean up the largest unused image layers - D) Both A and C — prune images and clean up large unused layers

[Result: crictl rmi --prune removes 12GB of unused images. Additionally, stale build cache under /var/lib/containerd clears another 8GB. Disk usage drops to 68%. DiskPressure condition clears. Evictions stop. Proceed to Round 3.]

If you chose A:

[Result: Prune removes some images but the largest consumers are stale layers not caught by prune. Partial fix.]

If you chose B:

[Result: Container logs account for only 2GB. Helps marginally but the images are the main consumer.]

If you chose C:

[Result: Correct but slower than combining with automated prune.]

Round 3: Root Cause Identification

[Pressure cue: "Disk pressure resolved. Why did GC fail to prevent this?"]

What you see: Root cause: The cluster deploys many short-lived jobs that pull unique images. Each job creates image layers that are never reused. The GC threshold of 85% is too high for this workload pattern — by the time GC triggers, there is not enough headroom for the eviction threshold (2Gi below capacity).

Choose your action: - A) Lower the GC high threshold to 70% and low threshold to 60% - B) Add a CronJob to prune unused images nightly - C) Use a shared base image for jobs to reduce unique layer proliferation - D) All of the above

[Result: Lower GC thresholds provide more headroom. Nightly prune catches anything GC misses. Shared base images reduce the rate of new layer accumulation. Proceed to Round 4.]

If you chose A:

[Result: Lower thresholds help but jobs still create many unique layers.]

If you chose B:

[Result: Nightly prune helps but GC should catch most cases automatically.]

If you chose C:

[Result: Reduces the problem rate but does not fix the GC configuration.]

Round 4: Remediation

[Pressure cue: "Node healthy. Uncordon and verify."]

Actions: 1. Uncordon the node: kubectl uncordon k8s-worker-06 2. Verify DiskPressure is False: kubectl describe node k8s-worker-06 3. Verify evicted pods have been rescheduled successfully 4. Apply new GC thresholds across all nodes via kubelet config 5. Deploy the image pruning CronJob as a DaemonSet

Damage Report

  • Total downtime: 0 (pods rescheduled to other nodes)
  • Blast radius: 15 pods evicted from one node; brief service disruption during reschedule
  • Optimal resolution time: 15 minutes (identify disk usage -> prune images -> clear pressure)
  • If every wrong choice was made: 60+ minutes with cascading evictions across multiple nodes

Cross-References