Incident Replay: Disk Full on Root Partition — Services Down¶

Setup¶

System context: Production application server running multiple microservices. Root partition (/) is 50GB. Separate /var/log mount was planned but never implemented.
Time: Tuesday 02:47 UTC
Your role: On-call SRE

Round 1: Alert Fires¶

[Pressure cue: "PagerDuty fires. Multiple services failing health checks. Auto-escalation in 5 minutes."]

What you see: Monitoring dashboard shows 4 services on host app-prod-03 returning 503 errors. SSH connection is sluggish. df -h shows / at 100%.

Choose your action: - A) Immediately delete /var/log/*.log to free space - B) Run df -h and du -sh /* to identify what is consuming space - C) Restart all failing services to clear any cached data - D) Expand the root partition using LVM

If you chose A:¶

[Result: You free 2GB but some running services had open file handles to the deleted logs — the space is not actually reclaimed until the processes are restarted. df still shows 100%. Confusion ensues.]

If you chose B (recommended):¶

[Result: du -sh /* reveals /var/log is 38GB. Within that, a single application log file (app-debug.log) is 35GB. The application had DEBUG logging accidentally enabled in production. Proceed to Round 2.]

If you chose C:¶

[Result: Services fail to restart because they cannot write PID files or temp files to the full disk. You have made things worse.]

If you chose D:¶

[Result: Root is not on LVM — it is a standard partition. LVM expansion is not possible without downtime and disk restructuring. Wrong approach.]

Round 2: First Triage Data¶

[Pressure cue: "Customer-facing API is returning errors. Support tickets are flooding in."]

What you see: The 35GB debug log is from a deploy 6 hours ago that set LOG_LEVEL=DEBUG in the service config. The log is growing at ~6GB/hour.

Choose your action: - A) Truncate the log file in place: > /var/log/app/app-debug.log - B) Stop the application, delete the log, fix the log level, restart - C) Set up log rotation to prevent future issues - D) Move the log file to a different mount with more space

If you chose A (recommended):¶

[Result: Truncating in place frees the space immediately (inode preserved, file handles still valid). Disk drops to 15% used. Services begin recovering. Proceed to Round 3.]

If you chose B:¶

[Result: Works but stopping the application extends the outage by another 3-5 minutes. Unnecessary when truncation is available.]

If you chose C:¶

[Result: Log rotation is a prevention measure, not an immediate fix. The disk is still full right now.]

If you chose D:¶

[Result: Moving a 35GB file on a full disk fails — there is not enough space for the copy operation. mv across filesystems is copy + delete.]

Round 3: Root Cause Identification¶

[Pressure cue: "Services are recovering. Find the root cause and prevent recurrence."]

What you see: Root cause: A deployment 6 hours ago changed the log level from INFO to DEBUG via environment variable. The change was in a config commit that also included a legitimate feature flag update — the DEBUG setting was not caught in code review.

Choose your action: - A) Revert the log level to INFO in the running service config - B) Roll back the entire deployment to the previous version - C) Fix the config and implement log-level guardrails - D) Just add log rotation and leave DEBUG on

If you chose C (recommended):¶

[Result: Config fixed to INFO. Additionally: add a CI check that flags LOG_LEVEL=DEBUG in production configs, implement log rotation as defense in depth, add disk usage alerting at 80%. Proceed to Round 4.]

If you chose A:¶

[Result: Fixes the immediate issue but no prevention. Will happen again.]

If you chose B:¶

[Result: Overkill — the feature flag change was needed. Only the log level was wrong.]

If you chose D:¶

[Result: DEBUG logging in production has significant performance overhead beyond just disk space. Not acceptable.]

Round 4: Remediation¶

[Pressure cue: "Services recovered. Close the incident."]

Actions: 1. Verify disk usage is healthy: df -h / 2. Verify all services are healthy: check health endpoints 3. Deploy config fix: LOG_LEVEL=INFO 4. Add logrotate config: daily rotation, 7 days retention, 1GB max size 5. Add disk usage alert at 80% threshold 6. Add CI lint rule to reject LOG_LEVEL=DEBUG in production configs

Damage Report¶

Total downtime: 25 minutes (from first alert to services healthy)
Blast radius: 4 microservices on one host; customer-facing API degraded
Optimal resolution time: 8 minutes (identify large file -> truncate -> fix config)
If every wrong choice was made: 60+ minutes plus risk of cascading failures from service restart attempts

Cross-References¶

Primer: Linux Ops
Primer: Disk & Storage Ops
Primer: Logging
Footguns: Linux Ops

Incident Replay: Disk Full on Root Partition — Services Down¶

Setup¶

Round 1: Alert Fires¶

If you chose A:¶

If you chose B (recommended):¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose C (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶