Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention¶
Domains: linux_ops | observability | devops_tooling Level: L2 Estimated time: 30-45 min
Initial Alert¶
Nagios fires at 06:41 UTC:
CRITICAL: worker-node-03 — Disk usage /var at 97%
Host: worker-node-03.prod.internal (10.0.4.23)
Service: disk_/var
Status: CRITICAL — threshold 95%
Additional alerts within 5 minutes:
WARNING: worker-node-03 — Disk usage /var at 98%
CRITICAL: kubelet on worker-node-03 — DiskPressure condition True
WARNING: 3 pods evicted from worker-node-03 due to DiskPressure
Observable Symptoms¶
/varpartition onworker-node-03is at 97% usage (190GB of 200GB).- Kubelet has tainted the node with
node.kubernetes.io/disk-pressure:NoSchedule. - 3 pods have been evicted and rescheduled to other nodes.
df -hshows/varis the only partition under pressure;/and/homeare fine.- The node was at 62% disk usage 24 hours ago — it consumed 70GB in one day.
- Other worker nodes (01, 02, 04, 05) are all between 58-65% disk usage.
The Misleading Signal¶
A single node filling /var at an alarming rate looks like a classic Linux disk space problem — maybe a runaway process writing tmp files, a core dump, or a log file that is not being rotated. The engineer's instinct is to SSH into the node and start hunting for large files with du, check logrotate, and look for rogue processes. The fact that it is only one node reinforces the idea that something node-specific is broken.