- linux
- l1
- runbook
- disk-troubleshooting
- filesystem --- Portal | Level: L1: Foundations | Topics: Filesystems & Storage | Domain: Linux
Runbook: Disk Full¶
| Field | Value |
|---|---|
| Domain | Linux |
| Alert | node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.10 (< 10% free) |
| Severity | P1 (if filesystem at 100%), P2 (if >90% full) |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | SSH access to the node, sudo or root access |
Quick Assessment (30 seconds)¶
If output shows: One filesystem at 100% → That is the culprit, note the mount point and proceed to Step 2 If output shows: All filesystems under 90% → This is an inode exhaustion issue, not block usage — rundf -i and skip to Step 2 using the full-inode mount point
Step 1: Identify Which Filesystem Is Full¶
Why: A system has multiple filesystems. You need to know whether /, /var, /var/log, or a data partition is full — the fix differs for each.
# Show all filesystems with usage
df -h
# Also check inode usage (can be full even when blocks are free)
df -i
# List mount points to understand what lives where
lsblk -f
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 47G 3.0G 94% /
/dev/sdb1 200G 180G 20G 90% /var/lib/kubelet
df hangs, the filesystem may be unmountable due to corruption. Check dmesg | tail -50 for I/O errors before proceeding.
Step 2: Find What Is Consuming Space¶
Why: You cannot safely delete files without knowing what they are and whether any process is actively writing to them.
# Find the top 10 directories by size, starting from the full mount point
sudo du -hx --max-depth=3 <MOUNT_POINT> | sort -rh | head -20
# Quick top-level scan
sudo du -hx --max-depth=1 <MOUNT_POINT>
# If ncdu is available (much faster interactive view)
sudo ncdu -x <MOUNT_POINT>
# Check specifically for large files
sudo find <MOUNT_POINT> -xdev -type f -size +500M -exec ls -lh {} \;
du is very slow, the filesystem may have millions of small files (common in /var/lib/kubelet). Use find with -maxdepth to narrow down rather than scanning everything.
Step 3: Clear Logs and Temp Files Safely¶
Why: Log files and temp files are usually the safest things to delete — they are regenerated and rarely needed long-term.
# Check log sizes before deleting
ls -lh /var/log/
# Truncate (not delete) large log files that are actively written to
sudo truncate -s 0 /var/log/<LOGFILE_NAME>
# For journald, vacuum old logs (keep last 500MB or last 7 days)
sudo journalctl --vacuum-size=500M
sudo journalctl --vacuum-time=7d
# Clear old rotated/compressed logs
sudo find /var/log -name "*.gz" -mtime +7 -delete
sudo find /var/log -name "*.1" -o -name "*.old" | xargs sudo rm -f
# Clear /tmp
sudo find /tmp -type f -atime +3 -delete
sudo du -hx /var/log/pods | sort -rh | head -10.
Step 4: Check for Deleted-But-Open Files¶
Why: When you delete a file that a running process still has open, the space is NOT freed on Linux until the process closes the file descriptor. This is a frequent gotcha.
# Find processes holding open deleted files
sudo lsof +L1 2>/dev/null | grep deleted
# Show the size of deleted-but-held files
sudo lsof +L1 2>/dev/null | awk 'NR>1 {print $7, $1, $2, $9}' | sort -rn | head -20
# If processes are holding deleted files, you'll see something like:
123456789 java 4521 /var/log/app.log (deleted)
df -h after restart.
Step 5: Clean Docker/Container Images and Volumes¶
Why: On nodes that run container workloads, /var/lib/docker or /var/lib/containerd accumulates unused images, stopped containers, and anonymous volumes.
# Check container runtime disk usage
sudo du -hx --max-depth=2 /var/lib/docker 2>/dev/null | sort -rh | head -10
sudo du -hx --max-depth=2 /var/lib/containerd 2>/dev/null | sort -rh | head -10
# Docker: prune unused images, containers, networks, and build cache
sudo docker system prune -f
sudo docker image prune -a -f # WARNING: removes ALL unused images, not just dangling
# containerd: remove unused snapshots
sudo crictl rmi --prune
# Check Kubernetes eviction threshold (do not go below kubelet reserved space)
cat /var/lib/kubelet/config.yaml | grep -A5 eviction
/var/lib/kubelet/pods is the main culprit, evict or reschedule the pods consuming the space: kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data.
Step 6: Add Capacity or Move Data¶
Why: If cleanup frees enough space for now but the trend will fill the disk again within hours, a permanent fix is needed.
# Check disk growth rate from node metrics
# (In Prometheus: rate(node_filesystem_size_bytes[1h]) - rate(node_filesystem_avail_bytes[1h]))
# Resize cloud volume (AWS example — adjust for your cloud provider)
aws ec2 modify-volume --volume-id <VOLUME_ID> --size <NEW_SIZE_GB>
# After cloud resize, extend the filesystem (for ext4 on LVM)
sudo growpart /dev/<DISK_DEVICE> <PARTITION_NUMBER>
sudo resize2fs /dev/<PARTITION_DEVICE>
# For XFS:
sudo xfs_growfs <MOUNT_POINT>
# Verify new size
df -h <MOUNT_POINT>
Verification¶
Success looks like: All filesystems below 80% block usage and below 80% inode usage. If still broken: Escalate — see below.Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | Infrastructure on-call | "Disk full on node |
| Data loss suspected | Application team lead | "Filesystem full on |
| Scope expanding to multiple nodes | SRE lead | "Multiple nodes hitting disk capacity — possible cluster-wide log/metrics explosion or storage misconfiguration" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Deleting files that are still open by processes: Space is not freed on Linux when you delete a file that a running process has open — the space is only released when the process closes or restarts. Always check with
lsof +L1after deleting large files. - Not checking /var/log separately: The root filesystem
/may show as 100% but/var/logis the real culprit mounted separately. Check both block usage and the mount layout withlsblk. - Ignoring inode exhaustion: A filesystem can have 0% blocks used but still be "full" if all inodes are consumed. Always run
df -ialongsidedf -h. Inode exhaustion is common in/var/lib/kubeletwith many small files.
Cross-References¶
- Topic Pack: Linux Storage and Filesystems (deep background)
- Related Runbook: OOM Killer Activated
Wiki Navigation¶
Related Content¶
- Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage
- Case Study: NVMe Drive Disappeared (Case Study, L2) — Filesystems & Storage
- Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage
- Case Study: Stuck NFS Mount (Case Study, L2) — Filesystems & Storage
- Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage
- Deep Dive: Linux Performance Debugging (deep_dive, L2) — Filesystems & Storage
- Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage
- Inodes (Topic Pack, L1) — Filesystems & Storage
- Inodes Flashcards (CLI) (flashcard_deck, L1) — Filesystems & Storage
- Kernel Troubleshooting (Topic Pack, L3) — Filesystems & Storage