Progressive Hints¶

Hint 1 (after 5 min)¶

Look at the log volume numbers. On Dec 11, the rate was 12,400/min. On Dec 12, it jumped to 124,800/min — a 10x increase. The Helm values show logging.level: DEBUG was "Changed from INFO on Dec 11." Debug logging produces roughly 10x more log entries than INFO.

Hint 2 (after 10 min)¶

Follow the cascade: Debug logging at 124K lines/min writes to stdout, which the kubelet captures to /var/log/pods/. Three pods are generating 85GB of logs on node-3 (28+26+31 GB). The /var/log partition is 94% full. Disk I/O wait has climbed from 2.1% to 38.1% over the week because the disk is constantly writing logs. The increasing I/O wait is causing application latency to rise — from 45ms to 287ms p50.

Hint 3 (after 15 min)¶

This is a user service that handles authentication and profile lookups. On Dec 11, someone changed the log level to DEBUG (probably to investigate a bug and forgot to revert). DEBUG mode logs every SQL query, every cache lookup, every HTTP header — roughly 10x the volume of INFO. The logs go to stdout, kubelet writes them to disk, FluentBit ships them to Elasticsearch. But the disk write rate exceeds what the node can sustain. Over 7 days, 85GB accumulated on node-3's log partition (94% full). The I/O contention slows down everything on that node — not just logging, but database queries, cache lookups, and application request handling. The kubelet is now attempting ephemeral-storage eviction, which means it may start evicting pods. The logs themselves are perfectly valid debug output — the content is fine, it is the volume that is killing the system.