Skip to content

Solution: Disk Full Root - Services Down

Triage

  1. Confirm the situation:

    df -h /
    df -i /
    

  2. Find the largest consumers immediately:

    du -sh /var/* 2>/dev/null | sort -rh | head -10
    du -sh /var/log/* 2>/dev/null | sort -rh | head -10
    du -sh /tmp/* 2>/dev/null | sort -rh | head -5
    

  3. Check for deleted files still held open (these consume space but are invisible to du):

    lsof +L1 | grep deleted
    

  4. Quick win -- free space immediately so services can restart:

    # Truncate large log files (do NOT delete, as the process holding them open won't release space)
    : > /var/log/large-app-log.log
    
    # Clean systemd journal if it's large
    journalctl --vacuum-size=100M
    
    # Clean apt cache
    apt clean
    

Root Cause

In this scenario, the application on api-gateway-03 had debug-level logging enabled (set 3 days ago during a troubleshooting session and never reverted). The application wrote ~38GB of debug logs to /var/log/app/debug.log over 72 hours. Log rotation was configured with a 7-day retention and weekly rotation, so it had not yet rotated the file.

Combined with normal system logs and package cache, the 50GB root partition filled completely.

Fix

  1. Immediate space recovery (get services running):

    # Truncate the oversized debug log (preserves file handle)
    : > /var/log/app/debug.log
    
    # Clean old journals
    journalctl --vacuum-size=200M
    
    # Remove old kernels
    apt autoremove -y
    
    # Clean package cache
    apt clean
    
    # Verify space recovered
    df -h /
    

  2. Restart affected services:

    systemctl restart nginx
    systemctl restart app-gateway
    systemctl status nginx app-gateway
    

  3. Fix the root cause -- disable debug logging:

    # In application config, change log level from DEBUG back to WARN/INFO
    sed -i 's/log_level: debug/log_level: warn/' /etc/app/config.yml
    systemctl restart app-gateway
    

  4. Improve log rotation:

    # /etc/logrotate.d/app-gateway
    /var/log/app/*.log {
        daily
        rotate 7
        compress
        maxsize 500M
        missingok
        notifempty
        postrotate
            systemctl reload app-gateway
        endscript
    }
    
    Adding maxsize 500M ensures rotation triggers on size, not just schedule.

  5. Add disk space monitoring threshold: Set alerts at 75% (warning) and 90% (critical) instead of only alerting at 100%.

Rollback / Safety

  • Never rm a file that a running process has open. Use truncation (: >) instead, which zeroes the file but keeps the file descriptor valid.
  • If you must delete and the process has the file open, restart the process afterward to release the space.
  • Before cleaning anything in /var/log, check if there are compliance or audit retention requirements.
  • Keep at least the most recent rotated logs for post-incident analysis.

Common Traps

  • Trap: Deleting a large log file with rm while the process still has it open. Space is NOT freed until the process closes the file descriptor. Use lsof +L1 to find these.
  • Trap: Only checking du output. Deleted-but-open files don't show in du but consume space. The df and du totals won't match.
  • Trap: Cleaning space but forgetting to restart failed services. Systemd won't auto-restart services that failed due to disk full.
  • Trap: Not finding the root cause. If you just clean up without fixing the debug logging, the disk will fill again in 3 days.
  • Trap: Running apt autoremove without checking what it will remove -- verify the package list first.