Solution: Disk Full Root - Services Down¶

Triage¶

Confirm the situation:
```
df -h /
df -i /
```

Find the largest consumers immediately:

du -sh /var/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* 2>/dev/null | sort -rh | head -10
du -sh /tmp/* 2>/dev/null | sort -rh | head -5

Check for deleted files still held open (these consume space but are invisible to du):
```
lsof +L1 | grep deleted
```

Quick win -- free space immediately so services can restart:

# Truncate large log files (do NOT delete, as the process holding them open won't release space)
: > /var/log/large-app-log.log

# Clean systemd journal if it's large
journalctl --vacuum-size=100M

# Clean apt cache
apt clean

Root Cause¶

In this scenario, the application on api-gateway-03 had debug-level logging enabled (set 3 days ago during a troubleshooting session and never reverted). The application wrote ~38GB of debug logs to /var/log/app/debug.log over 72 hours. Log rotation was configured with a 7-day retention and weekly rotation, so it had not yet rotated the file.

Combined with normal system logs and package cache, the 50GB root partition filled completely.

Fix¶

Immediate space recovery (get services running):

# Truncate the oversized debug log (preserves file handle)
: > /var/log/app/debug.log

# Clean old journals
journalctl --vacuum-size=200M

# Remove old kernels
apt autoremove -y

# Clean package cache
apt clean

# Verify space recovered
df -h /

Restart affected services:

systemctl restart nginx
systemctl restart app-gateway
systemctl status nginx app-gateway

Fix the root cause -- disable debug logging:

# In application config, change log level from DEBUG back to WARN/INFO
sed -i 's/log_level: debug/log_level: warn/' /etc/app/config.yml
systemctl restart app-gateway

Improve log rotation:

# /etc/logrotate.d/app-gateway
/var/log/app/*.log {
    daily
    rotate 7
    compress
    maxsize 500M
    missingok
    notifempty
    postrotate
        systemctl reload app-gateway
    endscript
}

Adding maxsize 500M ensures rotation triggers on size, not just schedule.

Add disk space monitoring threshold: Set alerts at 75% (warning) and 90% (critical) instead of only alerting at 100%.

Rollback / Safety¶

Never rm a file that a running process has open. Use truncation (: >) instead, which zeroes the file but keeps the file descriptor valid.
If you must delete and the process has the file open, restart the process afterward to release the space.
Before cleaning anything in /var/log, check if there are compliance or audit retention requirements.
Keep at least the most recent rotated logs for post-incident analysis.

Common Traps¶

Trap: Deleting a large log file with rm while the process still has it open. Space is NOT freed until the process closes the file descriptor. Use lsof +L1 to find these.
Trap: Only checking du output. Deleted-but-open files don't show in du but consume space. The df and du totals won't match.
Trap: Cleaning space but forgetting to restart failed services. Systemd won't auto-restart services that failed due to disk full.
Trap: Not finding the root cause. If you just clean up without fixing the debug logging, the disk will fill again in 3 days.
Trap: Running apt autoremove without checking what it will remove -- verify the package list first.