Remediation: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention¶
Immediate Fix (DevOps Tooling — Domain C)¶
The fix is in the Helm values for Loki and the event-processor deployment configuration.
Step 1: Enable Loki retention via Helm values¶
$ cat >> devops/helm/values-monitoring.yaml <<'EOF'
loki:
config:
compactor:
retention_enabled: true
retention_delete_delay: 2h
delete_request_cancel_period: 24h
limits_config:
retention_period: 720h # 30 days
EOF
$ helm upgrade loki grafana/loki \
-f devops/helm/values-monitoring.yaml \
-n monitoring \
--wait
Release "loki" has been upgraded.
Step 2: Manually clean up stale chunks¶
# SSH to the node and remove old chunks while Loki compactor runs
$ ssh worker-node-03.prod.internal
$ sudo systemctl stop loki || true # If running as systemd
# Or scale down the Loki pod to safely clean up
$ kubectl scale statefulset loki -n monitoring --replicas=0
# Remove chunks older than 30 days
$ sudo find /var/lib/loki/chunks/ -type f -mtime +30 -delete
# Removed approximately 28GB of stale chunk data
$ kubectl scale statefulset loki -n monitoring --replicas=1
Step 3: Fix the event-processor log level¶
# Update the Helm values for event-processor
$ grep -n "LOG_LEVEL" devops/helm/values-prod.yaml
83: LOG_LEVEL: DEBUG
# Change to INFO
$ sed -i 's/LOG_LEVEL: DEBUG/LOG_LEVEL: INFO/' devops/helm/values-prod.yaml
$ helm upgrade event-processor devops/helm/grokdevops \
-f devops/helm/values-prod.yaml \
-n prod --wait
Step 4: Clean up container log files¶
$ ssh worker-node-03.prod.internal
$ sudo truncate -s 0 /var/log/containers/event-processor-*.log
# Kubernetes will continue appending to the file from this point
Verification¶
Domain A (Linux Ops) — Disk usage decreasing¶
$ ssh worker-node-03.prod.internal
$ df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 200G 121G 73G 62% /var
$ kubectl describe node worker-node-03 | grep -A2 "Conditions" | grep DiskPressure
DiskPressure False KubeletHasNoDiskPressure node condition cleared
Domain B (Observability) — Loki retention active¶
$ kubectl get configmap loki-config -n monitoring -o yaml | grep retention_enabled
retention_enabled: true
$ kubectl logs -n monitoring loki-0 --tail=5 | grep compactor
level=info ts=2026-03-19T07:15:00.123Z caller=compactor.go:142 msg="compactor started" retention_enabled=true
Domain C (DevOps Tooling) — Helm values correct¶
$ helm get values loki -n monitoring | grep retention_period
retention_period: 720h
$ helm get values event-processor -n prod | grep LOG_LEVEL
LOG_LEVEL: INFO
Prevention¶
- Monitoring: Add disk growth rate alerts. Fire WARNING when any node's disk usage increases by more than 5% per day.
- alert: DiskUsageGrowthRate
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/var"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} /var will fill within 24 hours at current rate"
-
Runbook: All log aggregation deployments must have retention configured from day one. Default retention: 30 days for prod, 7 days for dev/staging.
-
Architecture: Move Loki chunk storage to a dedicated volume or object storage (S3/GCS) to decouple log storage from node disk. Add per-tenant log rate limits in Loki to prevent a single chatty service from dominating storage.