Answer Key: The Slow Death Nobody Noticed¶
The System¶
A user service handling authentication and profile management, with a centralized logging pipeline:
[user-service pods (3x)] --stdout--> [kubelet log capture]
| |
DEBUG logging /var/log/pods/ (85GB and growing)
124K lines/min |
| [FluentBit] --> [Elasticsearch]
|
Serves: /api/v1/users, /api/v1/auth
Backend: PostgreSQL + Redis cache
[Node-3 disk]
/var/log: 94% full (94GB / 100GB)
I/O wait: 38.1% (was 2.1% a week ago)
Latency impact: p50 45ms -> 287ms
What's Broken¶
Root cause: A log level change from INFO to DEBUG on December 11 triggered a cascading failure that took 7 days to become severe:
Day 1 (Dec 11): Log level changed to DEBUG in Helm values. Pods redeployed.
Day 1-3 (Dec 11-13): Log volume jumps 10x (12,400/min to 124,800/min). Each log entry includes full SQL queries, cache operations, HTTP headers. The disk starts filling faster than rotation can handle. I/O wait begins climbing. No immediate user impact.
Day 4-5 (Dec 14-15): /var/log fills past 50%. I/O wait hits 8.4%. Application latency starts increasing noticeably (45ms to 68ms p50). Still within SLO, no alerts fire.
Day 6-7 (Dec 16-18): I/O wait hits 38.1%. Application p50 latency reaches 287ms (6.4x normal). Every disk operation on node-3 contends with the log write load. The kubelet's eviction manager starts attempting to reclaim ephemeral-storage — if it cannot free enough space, it will evict pods from the node.
Why nobody noticed: - The change was 7 days ago — no one connects today's latency to last week's config change - The latency increase is gradual (45ms to 287ms over 7 days), not a sudden spike - The logs themselves are valid debug output — nothing looks "wrong" in the log content - Disk usage alerts may be set at 90% (already passed) or may not exist for /var/log - CPU and memory metrics look normal (850-920m CPU, 398-421Mi memory)
Key clue: The 10x jump in log volume on Dec 12 correlating with logging.level: DEBUG changed on Dec 11, combined with disk usage at 94% and I/O wait climbing linearly.
The Fix¶
Immediate (stop the bleeding)¶
-
Revert log level to INFO:
-
Clean up accumulated logs:
-
If disk is critically full, temporarily reduce log retention:
Permanent¶
-
Revert the Helm values:
-
Add log rotation limits to kubelet config:
-
Add disk usage alerting:
-
Add a log volume anomaly alert:
-
Add a change management note that DEBUG logging in production requires a revert date:
Verification¶
# Verify log level is back to INFO
kubectl logs -n backend deploy/user-service --tail=5 | jq '.level'
# Check disk usage is recovering
ssh node-3 'df -h /var/log'
# Monitor I/O wait (should drop within minutes)
ssh node-3 'iostat -x 1 5'
# Check latency is recovering
curl -s http://user-service.backend:8080/metrics | grep http_request_duration_seconds
# Verify no eviction pressure
kubectl describe node node-3 | grep -A5 Conditions
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | 85GB of logs on node-3, 94% disk full, all pods 1/1 Running | Pods look completely healthy — Running, no restarts, reasonable CPU/memory |
| Metrics | Latency 6.4x increase over 7 days; I/O wait from 2.1% to 38.1%; 10x log volume jump on Dec 12 | CPU and memory usage look normal; latency increase is gradual enough to miss |
| IaC Snippet | logging.level: DEBUG changed on Dec 11 — the trigger for everything |
The config looks like a simple, innocuous one-line change |
| Log Lines | Debug logs show every SQL query and cache lookup (massive volume); kubelet eviction manager is active | The log content is perfectly valid — nothing looks like an error or anomaly in the messages |
Skills Demonstrated¶
- Connecting a configuration change to symptoms that appear days later
- Understanding the disk I/O impact of excessive logging
- Recognizing cascading failures (config change -> log volume -> disk fill -> I/O contention -> latency)
- Understanding kubelet ephemeral-storage eviction
- Designing monitoring to catch slow-moving degradation (predictive disk alerts, volume anomaly detection)
Prerequisite Topic Packs¶
- Logging
- Log Pipelines
- Linux Performance
- Disk and Storage Ops
- Monitoring Fundamentals
- Change Management