Incident Replay: Disk Full on Root Partition — Services Down¶
Setup¶
- System context: Production application server running multiple microservices. Root partition (/) is 50GB. Separate /var/log mount was planned but never implemented.
- Time: Tuesday 02:47 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "PagerDuty fires. Multiple services failing health checks. Auto-escalation in 5 minutes."]
What you see:
Monitoring dashboard shows 4 services on host app-prod-03 returning 503 errors. SSH connection is sluggish. df -h shows / at 100%.
Choose your action:
- A) Immediately delete /var/log/*.log to free space
- B) Run df -h and du -sh /* to identify what is consuming space
- C) Restart all failing services to clear any cached data
- D) Expand the root partition using LVM
If you chose A:¶
[Result: You free 2GB but some running services had open file handles to the deleted logs — the space is not actually reclaimed until the processes are restarted.
dfstill shows 100%. Confusion ensues.]
If you chose B (recommended):¶
[Result:
du -sh /*reveals /var/log is 38GB. Within that, a single application log file (app-debug.log) is 35GB. The application had DEBUG logging accidentally enabled in production. Proceed to Round 2.]
If you chose C:¶
[Result: Services fail to restart because they cannot write PID files or temp files to the full disk. You have made things worse.]
If you chose D:¶
[Result: Root is not on LVM — it is a standard partition. LVM expansion is not possible without downtime and disk restructuring. Wrong approach.]
Round 2: First Triage Data¶
[Pressure cue: "Customer-facing API is returning errors. Support tickets are flooding in."]
What you see:
The 35GB debug log is from a deploy 6 hours ago that set LOG_LEVEL=DEBUG in the service config. The log is growing at ~6GB/hour.
Choose your action:
- A) Truncate the log file in place: > /var/log/app/app-debug.log
- B) Stop the application, delete the log, fix the log level, restart
- C) Set up log rotation to prevent future issues
- D) Move the log file to a different mount with more space
If you chose A (recommended):¶
[Result: Truncating in place frees the space immediately (inode preserved, file handles still valid). Disk drops to 15% used. Services begin recovering. Proceed to Round 3.]
If you chose B:¶
[Result: Works but stopping the application extends the outage by another 3-5 minutes. Unnecessary when truncation is available.]
If you chose C:¶
[Result: Log rotation is a prevention measure, not an immediate fix. The disk is still full right now.]
If you chose D:¶
[Result: Moving a 35GB file on a full disk fails — there is not enough space for the copy operation.
mvacross filesystems is copy + delete.]
Round 3: Root Cause Identification¶
[Pressure cue: "Services are recovering. Find the root cause and prevent recurrence."]
What you see: Root cause: A deployment 6 hours ago changed the log level from INFO to DEBUG via environment variable. The change was in a config commit that also included a legitimate feature flag update — the DEBUG setting was not caught in code review.
Choose your action: - A) Revert the log level to INFO in the running service config - B) Roll back the entire deployment to the previous version - C) Fix the config and implement log-level guardrails - D) Just add log rotation and leave DEBUG on
If you chose C (recommended):¶
[Result: Config fixed to INFO. Additionally: add a CI check that flags LOG_LEVEL=DEBUG in production configs, implement log rotation as defense in depth, add disk usage alerting at 80%. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes the immediate issue but no prevention. Will happen again.]
If you chose B:¶
[Result: Overkill — the feature flag change was needed. Only the log level was wrong.]
If you chose D:¶
[Result: DEBUG logging in production has significant performance overhead beyond just disk space. Not acceptable.]
Round 4: Remediation¶
[Pressure cue: "Services recovered. Close the incident."]
Actions:
1. Verify disk usage is healthy: df -h /
2. Verify all services are healthy: check health endpoints
3. Deploy config fix: LOG_LEVEL=INFO
4. Add logrotate config: daily rotation, 7 days retention, 1GB max size
5. Add disk usage alert at 80% threshold
6. Add CI lint rule to reject LOG_LEVEL=DEBUG in production configs
Damage Report¶
- Total downtime: 25 minutes (from first alert to services healthy)
- Blast radius: 4 microservices on one host; customer-facing API degraded
- Optimal resolution time: 8 minutes (identify large file -> truncate -> fix config)
- If every wrong choice was made: 60+ minutes plus risk of cascading failures from service restart attempts
Cross-References¶
- Primer: Linux Ops
- Primer: Disk & Storage Ops
- Primer: Logging
- Footguns: Linux Ops