Incident Replay: Runaway Logs Fill Disk¶
Setup¶
- System context: Production API server with a 100GB root partition. Application logs are written to
/var/log/app/with no log rotation configured. Disk is filling rapidly. - Time: Sunday 23:00 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Disk usage alert — server api-prod-02 root partition at 92% and climbing. 1% increase every 10 minutes. At this rate, full in 80 minutes."]
What you see:
df -h / shows 92% used. du -sh /var/log/* shows /var/log/app/ at 68GB and growing. The application log file is 65GB and currently being written to.
Choose your action:
- A) Delete the log file: rm /var/log/app/application.log
- B) Truncate the log file: > /var/log/app/application.log
- C) Set up logrotate immediately
- D) Find out why the log is so large before taking action
If you chose B (recommended):¶
[Result: Truncation frees the space immediately while keeping the file handle valid. Disk drops to 24%. The application continues writing to the same file without interruption. Immediate pressure relieved. Proceed to Round 2.]
If you chose A:¶
[Result:
rmremoves the directory entry but the application still holds the file handle.dfstill shows 92% full. The space is not reclaimed until the process is restarted. Classic pitfall.]
If you chose D:¶
[Result: Good practice normally but the disk is 80 minutes from full. You need to act first, then investigate.]
If you chose C:¶
[Result: logrotate will not take effect until it runs (usually via cron at midnight or manually). The disk will be full before then.]
Round 2: First Triage Data¶
[Pressure cue: "Disk pressure relieved. Now find out why the logs are so large."]
What you see: The application log is growing at ~700MB/hour. Normal rate is ~50MB/hour. The application has been logging a stack trace for every request since 20:00 UTC — a new middleware component is catching exceptions and logging them verbosely.
Choose your action: - A) Identify and fix the middleware component causing excessive logging - B) Set up logrotate to manage log size - C) Redirect logs to /dev/null until the issue is fixed - D) Reduce the log level from DEBUG to WARN
If you chose A (recommended):¶
[Result: The middleware was deployed at 19:45 UTC. It has a try/catch block that logs the full stack trace at INFO level for every request, including successful ones. It was meant to log only on errors. Fix the conditional and redeploy. Log rate drops to normal. Proceed to Round 3.]
If you chose B:¶
[Result: logrotate manages file size but does not fix the 14x increase in log volume. You would be rotating 700MB/hour of mostly garbage. Treat the cause, not the symptom.]
If you chose C:¶
[Result: Sending logs to /dev/null means you lose all observability. If something breaks, you are blind.]
If you chose D:¶
[Result: The excessive logging is at INFO level, not DEBUG. Reducing to WARN would help but might suppress legitimate INFO logs.]
Round 3: Root Cause Identification¶
[Pressure cue: "Logging fixed. Why was there no protection?"]
What you see: Root cause: No logrotate configuration existed for the application logs. The application has been running for 6 months with a single growing log file. The middleware bug accelerated the problem but even normal logging would have eventually filled the disk (~1.2GB/day, 100GB partition, ~80 day runway).
Choose your action: - A) Add logrotate config: daily rotation, 7-day retention, 100MB max size, compress - B) Send logs to a centralized log aggregator (ELK/Loki) and reduce local retention - C) Add disk usage monitoring and alerting - D) All of the above
If you chose D (recommended):¶
[Result: logrotate for local protection, centralized logging for analysis, disk alerting for early warning. Defense in depth. Proceed to Round 4.]
If you chose A:¶
[Result: logrotate is essential but does not help with analysis or alerting.]
If you chose B:¶
[Result: Centralized logging is best practice but local retention is still needed for when the aggregator is unavailable.]
If you chose C:¶
[Result: Alerting catches the problem earlier but does not prevent it.]
Round 4: Remediation¶
[Pressure cue: "Protections in place. Verify and close."]
Actions:
1. Verify log rate is normal: ls -lh /var/log/app/application.log (check growth rate)
2. Verify logrotate config: logrotate -d /etc/logrotate.d/application
3. Verify disk usage alert: check monitoring dashboard
4. Test logrotate manually: logrotate -f /etc/logrotate.d/application
5. Review other servers for missing logrotate configs
Damage Report¶
- Total downtime: 0 (disk was filling but had not yet caused service impact)
- Blast radius: Single server at risk; no user impact because truncation was done in time
- Optimal resolution time: 10 minutes (truncate -> identify verbose middleware -> fix -> add logrotate)
- If every wrong choice was made: 90+ minutes culminating in disk-full service outage
Cross-References¶
- Primer: Logging
- Primer: Linux Ops
- Primer: Disk & Storage Ops
- Footguns: Linux Ops