Skip to content

Pattern: Disk Full (Reserved Blocks Gone)

ID: FP-003 Family: Resource Exhaustion Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Moderate

The Shape

ext4 and other Linux filesystems reserve 5% of blocks for the root user by default. This means the filesystem appears full to non-root processes at 95% utilization, not 100%. The gap between "disk full alert at 90%" and "actual failure at 95%" is often smaller than expected, and runaway log files or growing data can consume that buffer overnight.

How You'll See It

In Linux/Infrastructure

$ df -h /var/log
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   47G     0  100% /var/log

$ tune2fs -l /dev/sda1 | grep "Reserved block"
Reserved block count:      131072    # 5% of 50GB = 2.5GB reserved for root
Services running as non-root (nginx, postgres, app processes) cannot write new files. Root can still write (the reserved blocks), so sudo touch /var/log/test works while touch /var/log/test (as the app user) fails. This asymmetry causes confusing debugging.

In Kubernetes

Node ephemeral storage fills from pod logs. Kubelet triggers eviction when available (non-reserved) space drops to the imagefs.available threshold. Pods are evicted even though df shows the node isn't at 100% — the reserved block buffer is the gap.

In CI/CD

Build artifacts accumulate in the agent workspace. The disk appears to have 5% free but all writes fail because the 5% is the reserved root block allocation, not actually available to the build process running as ci-user.

The Tell

Non-root processes get ENOSPC but root can still write. df -h shows 100% or the filesystem-reported "Avail" is 0, even though the filesystem isn't physically at 100% capacity (reserved blocks account for the gap).

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Inode exhaustion Block exhaustion df -i shows inodes OK; df -h shows 100%
Permissions error Disk full strace shows ENOSPC; root user write succeeds
Filesystem corruption Block limit fsck passes clean; df -h explains the failure

The Fix (Generic)

  1. Immediate: Delete or truncate large files (logs, core dumps, temp files). For logs of running processes, use truncate -s 0 /var/log/app.log rather than rm (avoid FP-029).
  2. Short-term: Tune reserved block percentage: tune2fs -m 1 /dev/sda1 (reduce to 1% for non-root filesystems). Implement log rotation with maxsize limits.
  3. Long-term: Separate /var/log onto its own filesystem to prevent log fills from affecting the root filesystem; add alerting at 70% and 85% to catch growth before it hits the ceiling.

Real-World Examples

  • Example 1: Postgres WAL files grew faster than archiving could remove them. At 95%, postgres (non-root) could no longer create new WAL segments. Database went read-only. Root still had the reserved 5%.
  • Example 2: Docker image layers accumulated on a build node. At 95% df showed "100% used, 0 avail" for non-root builds, while root docker pull still worked.

War Story

Alert fired: "disk at 90%". We said we'd clean it up "in the morning." Overnight the WAL archiver fell behind and within 4 hours the filesystem hit 95%. Postgres stopped accepting writes. We spent 30 minutes confused because root could still write fine — even creating test files in /var/lib/postgresql/. The service account didn't have root's reserved blocks. Lesson: 90% is "clean it now," not "clean it tomorrow."

Cross-References