Solution: Disk Full Root - Services Down¶
Triage¶
-
Confirm the situation:
-
Find the largest consumers immediately:
-
Check for deleted files still held open (these consume space but are invisible to
du): -
Quick win -- free space immediately so services can restart:
Root Cause¶
In this scenario, the application on api-gateway-03 had debug-level logging enabled (set 3 days ago during a troubleshooting session and never reverted). The application wrote ~38GB of debug logs to /var/log/app/debug.log over 72 hours. Log rotation was configured with a 7-day retention and weekly rotation, so it had not yet rotated the file.
Combined with normal system logs and package cache, the 50GB root partition filled completely.
Fix¶
-
Immediate space recovery (get services running):
-
Restart affected services:
-
Fix the root cause -- disable debug logging:
-
Improve log rotation:
Adding# /etc/logrotate.d/app-gateway /var/log/app/*.log { daily rotate 7 compress maxsize 500M missingok notifempty postrotate systemctl reload app-gateway endscript }maxsize 500Mensures rotation triggers on size, not just schedule. -
Add disk space monitoring threshold: Set alerts at 75% (warning) and 90% (critical) instead of only alerting at 100%.
Rollback / Safety¶
- Never
rma file that a running process has open. Use truncation (: >) instead, which zeroes the file but keeps the file descriptor valid. - If you must delete and the process has the file open, restart the process afterward to release the space.
- Before cleaning anything in
/var/log, check if there are compliance or audit retention requirements. - Keep at least the most recent rotated logs for post-incident analysis.
Common Traps¶
- Trap: Deleting a large log file with
rmwhile the process still has it open. Space is NOT freed until the process closes the file descriptor. Uselsof +L1to find these. - Trap: Only checking
duoutput. Deleted-but-open files don't show indubut consume space. Thedfanddutotals won't match. - Trap: Cleaning space but forgetting to restart failed services. Systemd won't auto-restart services that failed due to disk full.
- Trap: Not finding the root cause. If you just clean up without fixing the debug logging, the disk will fill again in 3 days.
- Trap: Running
apt autoremovewithout checking what it will remove -- verify the package list first.