Remediation: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention¶

Immediate Fix (DevOps Tooling — Domain C)¶

The fix is in the Helm values for Loki and the event-processor deployment configuration.

Step 1: Enable Loki retention via Helm values¶

$ cat >> devops/helm/values-monitoring.yaml <<'EOF'
loki:
  config:
    compactor:
      retention_enabled: true
      retention_delete_delay: 2h
      delete_request_cancel_period: 24h
    limits_config:
      retention_period: 720h  # 30 days
EOF

$ helm upgrade loki grafana/loki \
    -f devops/helm/values-monitoring.yaml \
    -n monitoring \
    --wait
Release "loki" has been upgraded.

Step 2: Manually clean up stale chunks¶

# SSH to the node and remove old chunks while Loki compactor runs
$ ssh worker-node-03.prod.internal
$ sudo systemctl stop loki || true  # If running as systemd
# Or scale down the Loki pod to safely clean up
$ kubectl scale statefulset loki -n monitoring --replicas=0

# Remove chunks older than 30 days
$ sudo find /var/lib/loki/chunks/ -type f -mtime +30 -delete
# Removed approximately 28GB of stale chunk data

$ kubectl scale statefulset loki -n monitoring --replicas=1

Step 3: Fix the event-processor log level¶

# Update the Helm values for event-processor
$ grep -n "LOG_LEVEL" devops/helm/values-prod.yaml
83:      LOG_LEVEL: DEBUG

# Change to INFO
$ sed -i 's/LOG_LEVEL: DEBUG/LOG_LEVEL: INFO/' devops/helm/values-prod.yaml

$ helm upgrade event-processor devops/helm/grokdevops \
    -f devops/helm/values-prod.yaml \
    -n prod --wait

Step 4: Clean up container log files¶

$ ssh worker-node-03.prod.internal
$ sudo truncate -s 0 /var/log/containers/event-processor-*.log
# Kubernetes will continue appending to the file from this point

Verification¶

Domain A (Linux Ops) — Disk usage decreasing¶

$ ssh worker-node-03.prod.internal
$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       200G  121G   73G  62% /var

$ kubectl describe node worker-node-03 | grep -A2 "Conditions" | grep DiskPressure
  DiskPressure    False   KubeletHasNoDiskPressure   node condition cleared

Domain B (Observability) — Loki retention active¶

$ kubectl get configmap loki-config -n monitoring -o yaml | grep retention_enabled
      retention_enabled: true

$ kubectl logs -n monitoring loki-0 --tail=5 | grep compactor
level=info ts=2026-03-19T07:15:00.123Z caller=compactor.go:142 msg="compactor started" retention_enabled=true

Domain C (DevOps Tooling) — Helm values correct¶

$ helm get values loki -n monitoring | grep retention_period
      retention_period: 720h

$ helm get values event-processor -n prod | grep LOG_LEVEL
      LOG_LEVEL: INFO

Prevention¶

Monitoring: Add disk growth rate alerts. Fire WARNING when any node's disk usage increases by more than 5% per day.

- alert: DiskUsageGrowthRate
  expr: |
    predict_linear(node_filesystem_avail_bytes{mountpoint="/var"}[6h], 24*3600) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.instance }} /var will fill within 24 hours at current rate"

Runbook: All log aggregation deployments must have retention configured from day one. Default retention: 30 days for prod, 7 days for dev/staging.
Architecture: Move Loki chunk storage to a dedicated volume or object storage (S3/GCS) to decouple log storage from node disk. Add per-tenant log rate limits in Loki to prevent a single chatty service from dominating storage.