Diagnostic Questions¶

Before revealing the investigation path:¶

/var on one node is at 97% while other nodes are at 58-65%. What is your systematic approach to finding which directory is consuming the most space? What tools would you use?
You find that /var/lib/loki is consuming 41GB and /var/log/containers/ is consuming 18GB. How do you determine whether the problem is the log volume, the log retention, or both?
The event-processor pod is writing 14GB/day of DEBUG logs. Is the correct fix to change the log level, configure log rotation, add retention to Loki, or all three? What is the order of priority?
Loki was deployed 3 months ago without retention enabled. Why is this a DevOps tooling fix (Helm values) rather than a Linux fix (logrotate) or an observability fix (Loki API)?
How would you prevent this from recurring? What monitoring, deployment checks, or architectural changes would catch unbounded storage growth before it causes node evictions?