Skip to content

Investigation: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention

Phase 1: Linux Ops Investigation (Dead End)

SSH into the node and hunt for the disk hog:

$ ssh worker-node-03.prod.internal
$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       200G  190G  7.8G  97% /var

$ du -sh /var/* | sort -rh | head -10
148G    /var/lib
22G     /var/log
14G     /var/cache
3.2G    /var/tmp
1.1G    /var/spool

/var/lib is 148GB. Drill down:

$ du -sh /var/lib/* | sort -rh | head -5
89G     /var/lib/docker
41G     /var/lib/loki
12G     /var/lib/kubelet
4.2G    /var/lib/containerd
1.8G    /var/lib/apt

/var/lib/docker at 89GB is expected for a busy node. But /var/lib/loki at 41GB is suspicious — this is a Loki ingester that stores log chunks locally before shipping to object storage.

$ ls -lh /var/lib/loki/chunks/
total 39G
drwxr-xr-x 2 loki loki 4.0K Mar 19 06:00 fake/
-rw-r--r-- 1 loki loki  38G Mar 19 06:40 data

$ du -sh /var/log/containers/ | sort -rh | head -5
18G     /var/log/containers/

Container logs are 18GB. Check which containers are chatty:

$ du -sh /var/log/containers/*.log | sort -rh | head -5
14G     /var/log/containers/event-processor-5c8d9e7f2-q8n4j_prod_event-processor-abc123.log
1.8G    /var/log/containers/loki-0_monitoring_loki-def456.log
1.1G    /var/log/containers/payment-service-7f8b9c6d4-xk2nm_prod_payment-service-789xyz.log

The event-processor pod is writing 14GB of logs per day. This is the immediate disk consumer. But why only on this node?

The Pivot

Check if the event-processor is scheduled only on this node:

$ kubectl get pods -n prod -l app=event-processor -o wide
NAME                               READY   STATUS    RESTARTS   NODE
event-processor-5c8d9e7f2-q8n4j   1/1     Running   0          worker-node-03

Only one replica, pinned to worker-node-03 by a nodeSelector. But the real question is: why is Loki taking 41GB locally? Check Loki's retention configuration:

$ kubectl get configmap loki-config -n monitoring -o yaml | grep -A10 "retention"
    compactor:
      working_directory: /var/lib/loki/compactor
      retention_enabled: false

retention_enabled: false. Loki is never cleaning up old chunks.

Phase 2: Observability Investigation (Root Cause)

Loki was deployed 3 months ago with retention disabled. The event-processor generates ~14GB/day of logs. Loki's ingester on this node is storing chunks locally in /var/lib/loki/ and never expiring them. Combined with the container log files on disk, this node is accumulating storage at ~15GB/day.

# Check when Loki was deployed and what retention was set
$ helm history loki -n monitoring
REVISION    UPDATED                     STATUS      CHART           APP VERSION
1           2025-12-20 10:00:00         deployed    loki-5.41.0     2.9.3

$ helm get values loki -n monitoring | grep -A5 "retention"
# (no retention section — using defaults, which means disabled)

The Loki Helm chart was deployed without configuring retention. The default for retention_enabled is false. Over 3 months, the ingester has accumulated 41GB of unchunked data. The event-processor's verbose logging pushed the node past the threshold.

# Confirm the event-processor's log volume
$ kubectl logs event-processor-5c8d9e7f2-q8n4j -n prod --tail=5
2026-03-19T06:40:58.122Z DEBUG  Processing event batch 849271: 847 events, 12 failed validation
2026-03-19T06:40:58.123Z DEBUG  Event 849271-001: user_id=u-38291, action=page_view, payload_size=2847
2026-03-19T06:40:58.123Z DEBUG  Event 849271-002: user_id=u-10382, action=purchase, payload_size=4102
2026-03-19T06:40:58.124Z DEBUG  Event 849271-003: user_id=u-55910, action=page_view, payload_size=1923
2026-03-19T06:40:58.124Z DEBUG  Batch 849271 complete: 12ms, 835 accepted, 12 rejected

DEBUG-level logging with full event payloads. This pod is logging every single event it processes, at DEBUG level, in production.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was disk full on a Linux node (linux_ops), the root cause was Loki accumulating log data without retention (observability), and the fix requires updating the Loki Helm chart's retention configuration (devops_tooling). This is common because: observability stacks consume infrastructure resources (disk, CPU, network). Without proper retention policies, log aggregation systems become the largest disk consumers on the nodes they run on.

Root Cause

Two compounding issues: (1) Loki was deployed without retention enabled, causing local chunk storage to grow unbounded. (2) The event-processor pod runs at DEBUG log level in production, generating ~14GB/day. The Loki ingester on worker-node-03 has been accumulating log chunks for 3 months without cleanup, and the verbose logging accelerated the disk fill.