linux
l1
runbook
disk-troubleshooting
filesystem --- Portal | Level: L1: Foundations | Topics: Filesystems & Storage | Domain: Linux

Runbook: Disk Full¶

Field	Value
Domain	Linux
Alert	`node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.10` (< 10% free)
Severity	P1 (if filesystem at 100%), P2 (if >90% full)
Est. Resolution Time	15-30 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	SSH access to the node, sudo or root access

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
df -h

If output shows: One filesystem at 100% → That is the culprit, note the mount point and proceed to Step 2 If output shows: All filesystems under 90% → This is an inode exhaustion issue, not block usage — run df -i and skip to Step 2 using the full-inode mount point

Step 1: Identify Which Filesystem Is Full¶

Why: A system has multiple filesystems. You need to know whether /, /var, /var/log, or a data partition is full — the fix differs for each.

# Show all filesystems with usage
df -h

# Also check inode usage (can be full even when blocks are free)
df -i

# List mount points to understand what lives where
lsblk -f

Expected output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   47G  3.0G  94% /
/dev/sdb1       200G  180G   20G  90% /var/lib/kubelet

If this fails: If df hangs, the filesystem may be unmountable due to corruption. Check dmesg | tail -50 for I/O errors before proceeding.

Step 2: Find What Is Consuming Space¶

Why: You cannot safely delete files without knowing what they are and whether any process is actively writing to them.

# Find the top 10 directories by size, starting from the full mount point
sudo du -hx --max-depth=3 <MOUNT_POINT> | sort -rh | head -20

# Quick top-level scan
sudo du -hx --max-depth=1 <MOUNT_POINT>

# If ncdu is available (much faster interactive view)
sudo ncdu -x <MOUNT_POINT>

# Check specifically for large files
sudo find <MOUNT_POINT> -xdev -type f -size +500M -exec ls -lh {} \;

Expected output:

4.2G    /var/log
2.1G    /var/log/containers
800M    /var/log/pods

If this fails: If du is very slow, the filesystem may have millions of small files (common in /var/lib/kubelet). Use find with -maxdepth to narrow down rather than scanning everything.

Step 3: Clear Logs and Temp Files Safely¶

Why: Log files and temp files are usually the safest things to delete — they are regenerated and rarely needed long-term.

# Check log sizes before deleting
ls -lh /var/log/

# Truncate (not delete) large log files that are actively written to
sudo truncate -s 0 /var/log/<LOGFILE_NAME>

# For journald, vacuum old logs (keep last 500MB or last 7 days)
sudo journalctl --vacuum-size=500M
sudo journalctl --vacuum-time=7d

# Clear old rotated/compressed logs
sudo find /var/log -name "*.gz" -mtime +7 -delete
sudo find /var/log -name "*.1" -o -name "*.old" | xargs sudo rm -f

# Clear /tmp
sudo find /tmp -type f -atime +3 -delete

Expected output:

# df -h after cleanup should show freed space
Deleted archived journals: 2.1G freed

If this fails: If log files are regenerated within minutes to fill the disk, a process is in a crash loop producing excessive logs. Find it: sudo du -hx /var/log/pods | sort -rh | head -10.

Step 4: Check for Deleted-But-Open Files¶

Why: When you delete a file that a running process still has open, the space is NOT freed on Linux until the process closes the file descriptor. This is a frequent gotcha.

# Find processes holding open deleted files
sudo lsof +L1 2>/dev/null | grep deleted

# Show the size of deleted-but-held files
sudo lsof +L1 2>/dev/null | awk 'NR>1 {print $7, $1, $2, $9}' | sort -rn | head -20

Expected output:

# If processes are holding deleted files, you'll see something like:
123456789 java 4521 /var/log/app.log (deleted)

If this fails: Once you find the culprit process, the safest fix is to restart it gracefully (not kill -9). After restart, the deleted file descriptor is released and the space is freed. Confirm with df -h after restart.

Step 5: Clean Docker/Container Images and Volumes¶

Why: On nodes that run container workloads, /var/lib/docker or /var/lib/containerd accumulates unused images, stopped containers, and anonymous volumes.

# Check container runtime disk usage
sudo du -hx --max-depth=2 /var/lib/docker 2>/dev/null | sort -rh | head -10
sudo du -hx --max-depth=2 /var/lib/containerd 2>/dev/null | sort -rh | head -10

# Docker: prune unused images, containers, networks, and build cache
sudo docker system prune -f
sudo docker image prune -a -f   # WARNING: removes ALL unused images, not just dangling

# containerd: remove unused snapshots
sudo crictl rmi --prune

# Check Kubernetes eviction threshold (do not go below kubelet reserved space)
cat /var/lib/kubelet/config.yaml | grep -A5 eviction

Expected output:

Total reclaimed space: 12.5 GB

If this fails: If /var/lib/kubelet/pods is the main culprit, evict or reschedule the pods consuming the space: kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data.

Step 6: Add Capacity or Move Data¶

Why: If cleanup frees enough space for now but the trend will fill the disk again within hours, a permanent fix is needed.

# Check disk growth rate from node metrics
# (In Prometheus: rate(node_filesystem_size_bytes[1h]) - rate(node_filesystem_avail_bytes[1h]))

# Resize cloud volume (AWS example — adjust for your cloud provider)
aws ec2 modify-volume --volume-id <VOLUME_ID> --size <NEW_SIZE_GB>

# After cloud resize, extend the filesystem (for ext4 on LVM)
sudo growpart /dev/<DISK_DEVICE> <PARTITION_NUMBER>
sudo resize2fs /dev/<PARTITION_DEVICE>

# For XFS:
sudo xfs_growfs <MOUNT_POINT>

# Verify new size
df -h <MOUNT_POINT>

Expected output:

/dev/sda1        100G   47G   53G  47% /

If this fails: If you cannot resize the volume (on-prem hardware), move the largest directory to a new mount point using a bind mount or symlink. Page infrastructure team for hardware capacity additions.

Verification¶

# Confirm the issue is resolved
df -h && df -i

Success looks like: All filesystems below 80% block usage and below 80% inode usage. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	Infrastructure on-call	"Disk full on node , filesystem , cannot free enough space, node may evict pods"
Data loss suspected	Application team lead	"Filesystem full on may have caused write failures to application data in "
Scope expanding to multiple nodes	SRE lead	"Multiple nodes hitting disk capacity — possible cluster-wide log/metrics explosion or storage misconfiguration"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete

Common Mistakes¶

Deleting files that are still open by processes: Space is not freed on Linux when you delete a file that a running process has open — the space is only released when the process closes or restarts. Always check with lsof +L1 after deleting large files.
Not checking /var/log separately: The root filesystem / may show as 100% but /var/log is the real culprit mounted separately. Check both block usage and the mount layout with lsblk.
Ignoring inode exhaustion: A filesystem can have 0% blocks used but still be "full" if all inodes are consumed. Always run df -i alongside df -h. Inode exhaustion is common in /var/lib/kubelet with many small files.

Cross-References¶

Topic Pack: Linux Storage and Filesystems (deep background)
Related Runbook: OOM Killer Activated

Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage
Case Study: NVMe Drive Disappeared (Case Study, L2) — Filesystems & Storage
Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage
Case Study: Stuck NFS Mount (Case Study, L2) — Filesystems & Storage
Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage
Deep Dive: Linux Performance Debugging (deep_dive, L2) — Filesystems & Storage
Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage
Inodes (Topic Pack, L1) — Filesystems & Storage
Inodes Flashcards (CLI) (flashcard_deck, L1) — Filesystems & Storage
Kernel Troubleshooting (Topic Pack, L3) — Filesystems & Storage

Runbook: Disk Full¶

Quick Assessment (30 seconds)¶

Step 1: Identify Which Filesystem Is Full¶

Step 2: Find What Is Consuming Space¶

Step 3: Clear Logs and Temp Files Safely¶

Step 4: Check for Deleted-But-Open Files¶

Step 5: Clean Docker/Container Images and Volumes¶

Step 6: Add Capacity or Move Data¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Disk Full¶

Quick Assessment (30 seconds)¶

Step 1: Identify Which Filesystem Is Full¶

Step 2: Find What Is Consuming Space¶

Step 3: Clear Logs and Temp Files Safely¶

Step 4: Check for Deleted-But-Open Files¶

Step 5: Clean Docker/Container Images and Volumes¶

Step 6: Add Capacity or Move Data¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶