Diagnostic Questions¶
Before revealing the investigation path:¶
-
/varon one node is at 97% while other nodes are at 58-65%. What is your systematic approach to finding which directory is consuming the most space? What tools would you use? -
You find that
/var/lib/lokiis consuming 41GB and/var/log/containers/is consuming 18GB. How do you determine whether the problem is the log volume, the log retention, or both? -
The event-processor pod is writing 14GB/day of DEBUG logs. Is the correct fix to change the log level, configure log rotation, add retention to Loki, or all three? What is the order of priority?
-
Loki was deployed 3 months ago without retention enabled. Why is this a DevOps tooling fix (Helm values) rather than a Linux fix (logrotate) or an observability fix (Loki API)?
-
How would you prevent this from recurring? What monitoring, deployment checks, or architectural changes would catch unbounded storage growth before it causes node evictions?