Answer Key: The Slow Death Nobody Noticed¶

The System¶

A user service handling authentication and profile management, with a centralized logging pipeline:

[user-service pods (3x)] --stdout--> [kubelet log capture]
        |                                    |
   DEBUG logging                     /var/log/pods/ (85GB and growing)
   124K lines/min                            |
        |                            [FluentBit] --> [Elasticsearch]
        |
   Serves: /api/v1/users, /api/v1/auth
   Backend: PostgreSQL + Redis cache

[Node-3 disk]
  /var/log: 94% full (94GB / 100GB)
  I/O wait: 38.1% (was 2.1% a week ago)
  Latency impact: p50 45ms -> 287ms

What's Broken¶

Root cause: A log level change from INFO to DEBUG on December 11 triggered a cascading failure that took 7 days to become severe:

Day 1 (Dec 11): Log level changed to DEBUG in Helm values. Pods redeployed.

Day 1-3 (Dec 11-13): Log volume jumps 10x (12,400/min to 124,800/min). Each log entry includes full SQL queries, cache operations, HTTP headers. The disk starts filling faster than rotation can handle. I/O wait begins climbing. No immediate user impact.

Day 4-5 (Dec 14-15): /var/log fills past 50%. I/O wait hits 8.4%. Application latency starts increasing noticeably (45ms to 68ms p50). Still within SLO, no alerts fire.

Day 6-7 (Dec 16-18): I/O wait hits 38.1%. Application p50 latency reaches 287ms (6.4x normal). Every disk operation on node-3 contends with the log write load. The kubelet's eviction manager starts attempting to reclaim ephemeral-storage — if it cannot free enough space, it will evict pods from the node.

Why nobody noticed: - The change was 7 days ago — no one connects today's latency to last week's config change - The latency increase is gradual (45ms to 287ms over 7 days), not a sudden spike - The logs themselves are valid debug output — nothing looks "wrong" in the log content - Disk usage alerts may be set at 90% (already passed) or may not exist for /var/log - CPU and memory metrics look normal (850-920m CPU, 398-421Mi memory)

Key clue: The 10x jump in log volume on Dec 12 correlating with logging.level: DEBUG changed on Dec 11, combined with disk usage at 94% and I/O wait climbing linearly.

The Fix¶

Immediate (stop the bleeding)¶

Revert log level to INFO:

kubectl set env deployment/user-service -n backend LOG_LEVEL=INFO
# Or patch the Helm values and upgrade

Clean up accumulated logs:

# On node-3:
# Truncate (don't delete) active log files to preserve file handles
for f in /var/log/pods/backend_user-service*/user-service/0.log; do
  truncate -s 0 "$f"
done

If disk is critically full, temporarily reduce log retention:

# Kubelet container log max settings
# Edit /var/lib/kubelet/config.yaml on node-3:
#   containerLogMaxSize: "50Mi"
#   containerLogMaxFiles: 3
systemctl restart kubelet

Permanent¶

Revert the Helm values:

logging:
  level: INFO    # Reverted from DEBUG
  format: json
  output: stdout

Add log rotation limits to kubelet config:

# In kubelet configuration
containerLogMaxSize: "100Mi"
containerLogMaxFiles: 5

Add disk usage alerting:

- alert: NodeDiskFillingFast
  expr: |
    predict_linear(node_filesystem_avail_bytes{mountpoint="/var/log"}[6h], 24*3600) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "/var/log on {{ $labels.instance }} will fill within 24 hours"

Add a log volume anomaly alert:

- alert: LogVolumeSpike
  expr: |
    rate(container_log_entries_total[5m]) > 10 * avg_over_time(rate(container_log_entries_total[5m])[7d:1h])
  for: 15m
  labels:
    severity: warning

Add a change management note that DEBUG logging in production requires a revert date:

logging:
  level: INFO
  # NEVER change to DEBUG without a revert plan and time-boxed duration
  # See incident INC-2024-1218 for why

Verification¶

# Verify log level is back to INFO
kubectl logs -n backend deploy/user-service --tail=5 | jq '.level'

# Check disk usage is recovering
ssh node-3 'df -h /var/log'

# Monitor I/O wait (should drop within minutes)
ssh node-3 'iostat -x 1 5'

# Check latency is recovering
curl -s http://user-service.backend:8080/metrics | grep http_request_duration_seconds

# Verify no eviction pressure
kubectl describe node node-3 | grep -A5 Conditions

Artifact Decoder¶

Artifact	What It Revealed	What Was Misleading
CLI Output	85GB of logs on node-3, 94% disk full, all pods 1/1 Running	Pods look completely healthy — Running, no restarts, reasonable CPU/memory
Metrics	Latency 6.4x increase over 7 days; I/O wait from 2.1% to 38.1%; 10x log volume jump on Dec 12	CPU and memory usage look normal; latency increase is gradual enough to miss
IaC Snippet	`logging.level: DEBUG` changed on Dec 11 — the trigger for everything	The config looks like a simple, innocuous one-line change
Log Lines	Debug logs show every SQL query and cache lookup (massive volume); kubelet eviction manager is active	The log content is perfectly valid — nothing looks like an error or anomaly in the messages

Skills Demonstrated¶

Connecting a configuration change to symptoms that appear days later
Understanding the disk I/O impact of excessive logging
Recognizing cascading failures (config change -> log volume -> disk fill -> I/O contention -> latency)
Understanding kubelet ephemeral-storage eviction
Designing monitoring to catch slow-moving degradation (predictive disk alerts, volume anomaly detection)

Answer Key: The Slow Death Nobody Noticed¶

The System¶

What's Broken¶

The Fix¶

Immediate (stop the bleeding)¶

Permanent¶

Verification¶

Artifact Decoder¶

Skills Demonstrated¶

Prerequisite Topic Packs¶

Pages that link here¶