Skip to content

Answer Key: The Slow Death Nobody Noticed

The System

A user service handling authentication and profile management, with a centralized logging pipeline:

[user-service pods (3x)] --stdout--> [kubelet log capture]
        |                                    |
   DEBUG logging                     /var/log/pods/ (85GB and growing)
   124K lines/min                            |
        |                            [FluentBit] --> [Elasticsearch]
        |
   Serves: /api/v1/users, /api/v1/auth
   Backend: PostgreSQL + Redis cache

[Node-3 disk]
  /var/log: 94% full (94GB / 100GB)
  I/O wait: 38.1% (was 2.1% a week ago)
  Latency impact: p50 45ms -> 287ms

What's Broken

Root cause: A log level change from INFO to DEBUG on December 11 triggered a cascading failure that took 7 days to become severe:

Day 1 (Dec 11): Log level changed to DEBUG in Helm values. Pods redeployed.

Day 1-3 (Dec 11-13): Log volume jumps 10x (12,400/min to 124,800/min). Each log entry includes full SQL queries, cache operations, HTTP headers. The disk starts filling faster than rotation can handle. I/O wait begins climbing. No immediate user impact.

Day 4-5 (Dec 14-15): /var/log fills past 50%. I/O wait hits 8.4%. Application latency starts increasing noticeably (45ms to 68ms p50). Still within SLO, no alerts fire.

Day 6-7 (Dec 16-18): I/O wait hits 38.1%. Application p50 latency reaches 287ms (6.4x normal). Every disk operation on node-3 contends with the log write load. The kubelet's eviction manager starts attempting to reclaim ephemeral-storage — if it cannot free enough space, it will evict pods from the node.

Why nobody noticed: - The change was 7 days ago — no one connects today's latency to last week's config change - The latency increase is gradual (45ms to 287ms over 7 days), not a sudden spike - The logs themselves are valid debug output — nothing looks "wrong" in the log content - Disk usage alerts may be set at 90% (already passed) or may not exist for /var/log - CPU and memory metrics look normal (850-920m CPU, 398-421Mi memory)

Key clue: The 10x jump in log volume on Dec 12 correlating with logging.level: DEBUG changed on Dec 11, combined with disk usage at 94% and I/O wait climbing linearly.

The Fix

Immediate (stop the bleeding)

  1. Revert log level to INFO:

    kubectl set env deployment/user-service -n backend LOG_LEVEL=INFO
    # Or patch the Helm values and upgrade
    

  2. Clean up accumulated logs:

    # On node-3:
    # Truncate (don't delete) active log files to preserve file handles
    for f in /var/log/pods/backend_user-service*/user-service/0.log; do
      truncate -s 0 "$f"
    done
    

  3. If disk is critically full, temporarily reduce log retention:

    # Kubelet container log max settings
    # Edit /var/lib/kubelet/config.yaml on node-3:
    #   containerLogMaxSize: "50Mi"
    #   containerLogMaxFiles: 3
    systemctl restart kubelet
    

Permanent

  1. Revert the Helm values:

    logging:
      level: INFO    # Reverted from DEBUG
      format: json
      output: stdout
    

  2. Add log rotation limits to kubelet config:

    # In kubelet configuration
    containerLogMaxSize: "100Mi"
    containerLogMaxFiles: 5
    

  3. Add disk usage alerting:

    - alert: NodeDiskFillingFast
      expr: |
        predict_linear(node_filesystem_avail_bytes{mountpoint="/var/log"}[6h], 24*3600) < 0
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "/var/log on {{ $labels.instance }} will fill within 24 hours"
    

  4. Add a log volume anomaly alert:

    - alert: LogVolumeSpike
      expr: |
        rate(container_log_entries_total[5m]) > 10 * avg_over_time(rate(container_log_entries_total[5m])[7d:1h])
      for: 15m
      labels:
        severity: warning
    

  5. Add a change management note that DEBUG logging in production requires a revert date:

    logging:
      level: INFO
      # NEVER change to DEBUG without a revert plan and time-boxed duration
      # See incident INC-2024-1218 for why
    

Verification

# Verify log level is back to INFO
kubectl logs -n backend deploy/user-service --tail=5 | jq '.level'

# Check disk usage is recovering
ssh node-3 'df -h /var/log'

# Monitor I/O wait (should drop within minutes)
ssh node-3 'iostat -x 1 5'

# Check latency is recovering
curl -s http://user-service.backend:8080/metrics | grep http_request_duration_seconds

# Verify no eviction pressure
kubectl describe node node-3 | grep -A5 Conditions

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output 85GB of logs on node-3, 94% disk full, all pods 1/1 Running Pods look completely healthy — Running, no restarts, reasonable CPU/memory
Metrics Latency 6.4x increase over 7 days; I/O wait from 2.1% to 38.1%; 10x log volume jump on Dec 12 CPU and memory usage look normal; latency increase is gradual enough to miss
IaC Snippet logging.level: DEBUG changed on Dec 11 — the trigger for everything The config looks like a simple, innocuous one-line change
Log Lines Debug logs show every SQL query and cache lookup (massive volume); kubelet eviction manager is active The log content is perfectly valid — nothing looks like an error or anomaly in the messages

Skills Demonstrated

  • Connecting a configuration change to symptoms that appear days later
  • Understanding the disk I/O impact of excessive logging
  • Recognizing cascading failures (config change -> log volume -> disk fill -> I/O contention -> latency)
  • Understanding kubelet ephemeral-storage eviction
  • Designing monitoring to catch slow-moving degradation (predictive disk alerts, volume anomaly detection)

Prerequisite Topic Packs