Skip to content

Runbook: Disk Full

Field Value
Domain Linux
Alert node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.10 (< 10% free)
Severity P1 (if filesystem at 100%), P2 (if >90% full)
Est. Resolution Time 15-30 minutes
Escalation Timeout 20 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites SSH access to the node, sudo or root access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
df -h
If output shows: One filesystem at 100% → That is the culprit, note the mount point and proceed to Step 2 If output shows: All filesystems under 90% → This is an inode exhaustion issue, not block usage — run df -i and skip to Step 2 using the full-inode mount point

Step 1: Identify Which Filesystem Is Full

Why: A system has multiple filesystems. You need to know whether /, /var, /var/log, or a data partition is full — the fix differs for each.

# Show all filesystems with usage
df -h

# Also check inode usage (can be full even when blocks are free)
df -i

# List mount points to understand what lives where
lsblk -f
Expected output:
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   47G  3.0G  94% /
/dev/sdb1       200G  180G   20G  90% /var/lib/kubelet
If this fails: If df hangs, the filesystem may be unmountable due to corruption. Check dmesg | tail -50 for I/O errors before proceeding.

Step 2: Find What Is Consuming Space

Why: You cannot safely delete files without knowing what they are and whether any process is actively writing to them.

# Find the top 10 directories by size, starting from the full mount point
sudo du -hx --max-depth=3 <MOUNT_POINT> | sort -rh | head -20

# Quick top-level scan
sudo du -hx --max-depth=1 <MOUNT_POINT>

# If ncdu is available (much faster interactive view)
sudo ncdu -x <MOUNT_POINT>

# Check specifically for large files
sudo find <MOUNT_POINT> -xdev -type f -size +500M -exec ls -lh {} \;
Expected output:
4.2G    /var/log
2.1G    /var/log/containers
800M    /var/log/pods
If this fails: If du is very slow, the filesystem may have millions of small files (common in /var/lib/kubelet). Use find with -maxdepth to narrow down rather than scanning everything.

Step 3: Clear Logs and Temp Files Safely

Why: Log files and temp files are usually the safest things to delete — they are regenerated and rarely needed long-term.

# Check log sizes before deleting
ls -lh /var/log/

# Truncate (not delete) large log files that are actively written to
sudo truncate -s 0 /var/log/<LOGFILE_NAME>

# For journald, vacuum old logs (keep last 500MB or last 7 days)
sudo journalctl --vacuum-size=500M
sudo journalctl --vacuum-time=7d

# Clear old rotated/compressed logs
sudo find /var/log -name "*.gz" -mtime +7 -delete
sudo find /var/log -name "*.1" -o -name "*.old" | xargs sudo rm -f

# Clear /tmp
sudo find /tmp -type f -atime +3 -delete
Expected output:
# df -h after cleanup should show freed space
Deleted archived journals: 2.1G freed
If this fails: If log files are regenerated within minutes to fill the disk, a process is in a crash loop producing excessive logs. Find it: sudo du -hx /var/log/pods | sort -rh | head -10.

Step 4: Check for Deleted-But-Open Files

Why: When you delete a file that a running process still has open, the space is NOT freed on Linux until the process closes the file descriptor. This is a frequent gotcha.

# Find processes holding open deleted files
sudo lsof +L1 2>/dev/null | grep deleted

# Show the size of deleted-but-held files
sudo lsof +L1 2>/dev/null | awk 'NR>1 {print $7, $1, $2, $9}' | sort -rn | head -20
Expected output:
# If processes are holding deleted files, you'll see something like:
123456789 java 4521 /var/log/app.log (deleted)
If this fails: Once you find the culprit process, the safest fix is to restart it gracefully (not kill -9). After restart, the deleted file descriptor is released and the space is freed. Confirm with df -h after restart.

Step 5: Clean Docker/Container Images and Volumes

Why: On nodes that run container workloads, /var/lib/docker or /var/lib/containerd accumulates unused images, stopped containers, and anonymous volumes.

# Check container runtime disk usage
sudo du -hx --max-depth=2 /var/lib/docker 2>/dev/null | sort -rh | head -10
sudo du -hx --max-depth=2 /var/lib/containerd 2>/dev/null | sort -rh | head -10

# Docker: prune unused images, containers, networks, and build cache
sudo docker system prune -f
sudo docker image prune -a -f   # WARNING: removes ALL unused images, not just dangling

# containerd: remove unused snapshots
sudo crictl rmi --prune

# Check Kubernetes eviction threshold (do not go below kubelet reserved space)
cat /var/lib/kubelet/config.yaml | grep -A5 eviction
Expected output:
Total reclaimed space: 12.5 GB
If this fails: If /var/lib/kubelet/pods is the main culprit, evict or reschedule the pods consuming the space: kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data.

Step 6: Add Capacity or Move Data

Why: If cleanup frees enough space for now but the trend will fill the disk again within hours, a permanent fix is needed.

# Check disk growth rate from node metrics
# (In Prometheus: rate(node_filesystem_size_bytes[1h]) - rate(node_filesystem_avail_bytes[1h]))

# Resize cloud volume (AWS example — adjust for your cloud provider)
aws ec2 modify-volume --volume-id <VOLUME_ID> --size <NEW_SIZE_GB>

# After cloud resize, extend the filesystem (for ext4 on LVM)
sudo growpart /dev/<DISK_DEVICE> <PARTITION_NUMBER>
sudo resize2fs /dev/<PARTITION_DEVICE>

# For XFS:
sudo xfs_growfs <MOUNT_POINT>

# Verify new size
df -h <MOUNT_POINT>
Expected output:
/dev/sda1        100G   47G   53G  47% /
If this fails: If you cannot resize the volume (on-prem hardware), move the largest directory to a new mount point using a bind mount or symlink. Page infrastructure team for hardware capacity additions.

Verification

# Confirm the issue is resolved
df -h && df -i
Success looks like: All filesystems below 80% block usage and below 80% inode usage. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 20 min Infrastructure on-call "Disk full on node , filesystem , cannot free enough space, node may evict pods"
Data loss suspected Application team lead "Filesystem full on may have caused write failures to application data in "
Scope expanding to multiple nodes SRE lead "Multiple nodes hitting disk capacity — possible cluster-wide log/metrics explosion or storage misconfiguration"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Deleting files that are still open by processes: Space is not freed on Linux when you delete a file that a running process has open — the space is only released when the process closes or restarts. Always check with lsof +L1 after deleting large files.
  2. Not checking /var/log separately: The root filesystem / may show as 100% but /var/log is the real culprit mounted separately. Check both block usage and the mount layout with lsblk.
  3. Ignoring inode exhaustion: A filesystem can have 0% blocks used but still be "full" if all inodes are consumed. Always run df -i alongside df -h. Inode exhaustion is common in /var/lib/kubelet with many small files.

Cross-References


Wiki Navigation