The Disk That Filled Up
- lesson
- filesystems
- logging
- log-rotation
- inodes
- docker-storage
- kubernetes-pvcs
- lvm ---# The Disk That Filled Up
Topics: filesystems, logging, log rotation, inodes, Docker storage, Kubernetes PVCs, LVM Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's Monday morning. Alerts are firing: the API is returning 500 errors, PostgreSQL has gone into recovery mode, and Redis lost 20 minutes of session data. You SSH into the server and find this:
The disk is full. Everything is broken. And you need to fix it in the next 5 minutes before the CEO's demo.
This lesson teaches you how to diagnose and fix disk space emergencies, then prevent them from happening again. Along the way, you'll learn about filesystems, inodes, log rotation, Docker storage, and Kubernetes PVCs — because a full disk touches all of them.
Part 1: Emergency Response — Free Space Now¶
When the disk is 100% full, everything breaks at once. Databases can't write journals.
Logging daemons can't write logs (which means you can't see error messages about the disk
being full). Package managers can't run. Even vim might refuse to open a file because it
can't create a swap file.
Here's the emergency playbook, in order:
# Step 1: What's full?
df -h
# Look for anything at 100% or close
# Step 2: Where is the space going?
du -sh /* 2>/dev/null | sort -rh | head
# → 47G /var
# → 2.1G /usr
# → 512M /opt
# Step 3: Drill down into the biggest directory
du -sh /var/* 2>/dev/null | sort -rh | head
# → 47G /var/log
# du: read error on '/var/log/app': Input/output error ← might see this too
# Step 4: Find the specific large files
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head
# → -rw-r--r-- 1 root root 47G Mar 22 09:00 /var/log/application.log
Found it. A 47GB log file. But here's where most people make a critical mistake:
# WRONG — this doesn't free the space!
rm /var/log/application.log
# The application still has the file open.
# The space is NOT freed until the process closes its file descriptor.
Under the Hood: When you
rma file, you're removing the directory entry (the name). The actual disk blocks aren't freed until all file descriptors pointing to those blocks are closed. If a process has the file open, the kernel keeps the inode and blocks allocated — the file is "deleted but still in use." This is one of the most common Linux gotchas.
# RIGHT — truncate the file in place
truncate -s 0 /var/log/application.log
# This zeros the file without removing it.
# The file descriptor stays valid, and the space is freed immediately.
# The application keeps writing to the same file, starting from offset 0.
Or if the file has already been deleted but space isn't freed:
# Find deleted-but-open files still consuming space
lsof +L1
# → COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
# → java 5678 app 1w REG 253,1 47000000 0 1234 /var/log/application.log (deleted)
# Truncate through the file descriptor
: > /proc/5678/fd/1
# The colon (:) is a no-op that returns true; > redirects nothing into the fd
War Story: A 47GB log file grew unchecked for 7 months on a 50GB root partition. Logrotate was configured, but the config had a path typo: it was rotating
/var/log/app/*.logbut the actual file was/var/log/application.log— one character difference. DEBUG logging was left enabled in production, tripling normal log growth from 200MB/day to ~600MB/day. When the disk filled, PostgreSQL couldn't write WAL files and went into recovery mode, Redis couldn't persist its AOF file and lost 20 minutes of sessions, and Nginx started returning 500s becauseerror_logcouldn't write.
Part 2: Why Logs Are the #1 Disk Killer¶
Logs fill disks more than anything else. Here's why, and how to stop it.
How Linux logging works¶
Application → stdout/stderr or file
↓
systemd-journald (binary journal, size-limited)
↓
rsyslog / syslog-ng (text files in /var/log/)
↓
logrotate (rotation, compression, cleanup)
Three common logging paths, three things that can go wrong:
| Path | What breaks | Symptom |
|---|---|---|
| App writes directly to file | No rotation, infinite growth | One massive log file |
| systemd journal | Journal not size-limited | /var/log/journal/ grows forever |
| rsyslog | Facility misconfigured | Logs go to unexpected files |
Fixing the journal¶
# Check journal disk usage
journalctl --disk-usage
# → Archived and active journals take up 4.0G on disk.
# Trim to 500MB immediately
journalctl --vacuum-size=500M
# Trim anything older than 7 days
journalctl --vacuum-time=7d
# Set permanent limits in /etc/systemd/journald.conf
# SystemMaxUse=500M
# MaxRetentionSec=7d
Fixing logrotate¶
Logrotate is the standard tool for rotating log files. It runs daily via cron or systemd timer. When it fails, logs grow until the disk fills.
# Test logrotate configuration (dry run)
sudo logrotate -d /etc/logrotate.conf
# Force rotation right now
sudo logrotate -f /etc/logrotate.conf
# Check a specific config
cat /etc/logrotate.d/nginx
A good logrotate config:
/var/log/application.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
copytruncate # ← Critical for apps that keep the file open
maxsize 500M # ← Rotate even mid-day if file exceeds this
}
The critical options:
| Option | What it does | When to use |
|---|---|---|
copytruncate |
Copy the log, then truncate the original | App keeps the file descriptor open (most apps) |
create 644 root root |
Delete old file, create new one | App reopens the file on rotation (nginx with postrotate) |
compress |
gzip old logs | Almost always — saves 90% space |
delaycompress |
Don't compress the most recent rotated file | When you need to grep recent logs quickly |
maxsize |
Rotate if file exceeds this size (regardless of daily schedule) | Prevent runaway growth between rotation intervals |
missingok |
Don't error if the file doesn't exist | Always — prevents logrotate from failing entirely |
Gotcha: The
copytruncatevscreatechoice matters. If the app keeps its file descriptor open (most do),createdoesn't work — the app keeps writing to the old, now-renamed file, and the new empty file stays empty. Usecopytruncatefor apps that don't reopen their log file, or use apostrotatescript to signal the app (likekill -USR1for nginx).Gotcha: Logrotate configs in
/etc/logrotate.d/must match the actual log file paths exactly. A config for/var/log/app/*.logwon't rotate/var/log/application.log. This typo is the #1 reason logrotate "stops working" — it's working fine, just on the wrong path.
Part 3: The Inode Trap — "Disk Full" But It Isn't¶
Here's a scenario that makes people question reality:
df -h /var
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda3 20G 8G 12G 40% /var
# ↑ Plenty of space!
touch /var/test
# touch: cannot touch '/var/test': No space left on device
# ↑ BUT IT SAYS THERE'S SPACE!
The disk has free blocks but no free inodes. Every file and directory on the filesystem uses one inode to store its metadata (permissions, timestamps, block pointers). When you run out of inodes, you can't create new files even with gigabytes of free space.
# Check inode usage
df -i /var
# Filesystem Inodes IUsed IFree IUse% Mounted on
# /dev/sda3 1310720 1310720 0 100% /var
# ↑ Zero free inodes!
What eats inodes¶
Millions of tiny files. Common culprits:
| Cause | Location | How many files |
|---|---|---|
| Mail queue explosion | /var/spool/postfix/deferred/ |
Millions of 1KB queue files |
| PHP session files | /var/lib/php/sessions/ |
One per user session, never cleaned |
| Package manager cache | /var/cache/apt/ or /var/cache/yum/ |
Thousands of temp files |
| Container overlay layers | /var/lib/docker/overlay2/ |
Whiteout files from deleted layers |
| Monitoring agents | Various | Per-metric files, per-check files |
# Find which directory has the most files
find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20
# → 2800000 /var/spool/postfix/deferred
# Ah ha — 2.8 million mail queue files
War Story: A mail relay accumulating 2.8 million deferred queue files exhausted all inodes on
/vardespite 70% disk space being free. Services couldn't write PID files, create temp files, or write logs. The fix: 40 minutes offind /var/spool/postfix/deferred -type f -delete(can't userm *because the argument list is too long — there are too many files for the shell to expand the glob).Gotcha:
rm /path/*fails with "Argument list too long" when there are too many files. The shell tries to expand*into a list of all filenames, and the list exceeds the kernel'sARG_MAXlimit (~2MB). Usefind /path -type f -deleteinstead — it processes files one at a time without shell expansion.
Part 4: Docker Storage — The Hidden Disk Consumer¶
Docker can consume enormous amounts of disk space without you realizing it:
# See Docker's total disk usage
docker system df
# TYPE TOTAL ACTIVE SIZE RECLAIMABLE
# Images 45 5 12.5GB 9.8GB (78%)
# Containers 12 3 2.1GB 1.8GB (85%)
# Local Volumes 8 3 15.3GB 10.2GB (66%)
# Build Cache - - 3.4GB 3.4GB
# That's 33GB of Docker data, 25GB reclaimable
Where Docker puts things¶
| What | Path | Grows because |
|---|---|---|
| Images | /var/lib/docker/overlay2/ |
Old images not pruned |
| Container writable layers | /var/lib/docker/overlay2/ |
Containers write to ephemeral layer |
| Volumes | /var/lib/docker/volumes/ |
Data persists after container deletion |
| Build cache | /var/lib/docker/ |
Layer caching from docker build |
| Container logs | /var/lib/docker/containers/<id>/ |
JSON log driver with no limit |
The sneakiest one: container logs. Docker's default JSON log driver has no size limit.
A container writing to stdout fills /var/lib/docker/containers/<id>/<id>-json.log forever.
# Check container log sizes
find /var/lib/docker/containers -name "*.log" -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head
# → -rw-r----- 1 root root 12G ... abc123-json.log
# Set log limits (per container)
docker run --log-opt max-size=100m --log-opt max-file=3 myimage
# Set log limits globally in /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3"
}
}
The nuclear cleanup¶
# Remove stopped containers, unused networks, dangling images, and build cache
docker system prune
# Also remove unused volumes (CAREFUL — this deletes data)
docker system prune --volumes
# Remove images older than 24 hours
docker image prune -a --filter "until=24h"
Gotcha:
docker system prune --volumesdeletes volumes that aren't attached to any container. If you stopped a database container but plan to restart it, the volume (with all your data) gets deleted. Always checkdocker volume lsbefore running prune with--volumes.Under the Hood: Docker's overlay2 storage driver uses OverlayFS — a union filesystem that stacks read-only image layers with a writable container layer on top. When a container modifies a file from a lower layer, the file is copied up into the writable layer. This is why write-heavy workloads inside containers can be slow and consume unexpected disk space — use volumes for data-intensive operations instead.
Part 5: Kubernetes PVCs and Ephemeral Storage¶
In Kubernetes, disk problems show up differently:
Ephemeral storage eviction¶
Kubernetes monitors disk usage and evicts pods when the node runs low:
# Default eviction thresholds
# imagefs.available < 15% → evict pods (container images/writable layers)
# nodefs.available < 10% → evict pods (logs, emptyDir, etc.)
# nodefs.inodesFree < 5% → evict pods
When a pod is evicted for disk pressure, it shows:
kubectl describe pod myapp
# → Status: Failed
# → Reason: Evicted
# → Message: The node was low on resource: ephemeral-storage
PVC troubleshooting¶
# PVC stuck in Pending
kubectl get pvc
# → NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# → data-pvc Pending standard
kubectl describe pvc data-pvc
# → Events:
# → Warning ProvisioningFailed no persistent volumes available for this claim
Common PVC failures:
| Symptom | Cause | Fix |
|---|---|---|
| PVC Pending | No matching PV or StorageClass | Check kubectl get sc, create missing StorageClass |
| PVC Pending | WaitForFirstConsumer | Normal — PV provisioned when pod is scheduled |
| Multi-attach error on RWO | Previous node still has volume attached | Force-detach via cloud API, or delete old VolumeAttachment |
| Pod evicted | Ephemeral storage limit exceeded | Set resources.limits.ephemeral-storage |
Gotcha: Deleting a StatefulSet does NOT delete its PVCs. The PVCs and their data persist. This is by design (safety) but catches people who expect cleanup. You must manually delete PVCs after removing a StatefulSet.
Gotcha:
RWO(ReadWriteOnce) means single node, not single pod. Multiple pods on the same node can mount an RWO volume simultaneously. This is not what most people expect.
Part 6: Preventing Disk Emergencies¶
The emergency response gets you through today. Prevention keeps it from happening again.
Separate partitions¶
The single most effective prevention: put /var/log, /var/lib/docker, and /tmp on
separate partitions (or LVM logical volumes). When logs fill /var/log, the root filesystem
is unaffected — the system keeps running, and you can still SSH in to fix it.
Partition Layout (recommended)
/ 20GB — OS, should never fill
/var/log 50GB — Logs, can fill without killing the OS
/var/lib/docker 100GB — Docker data, isolated
/tmp 10GB — Temp files, auto-cleared on boot
/home remaining — User data
Monitoring and alerting¶
# Prometheus node_exporter provides these metrics:
# node_filesystem_avail_bytes — free space
# node_filesystem_files_free — free inodes
# node_filesystem_size_bytes — total size
# Alert at 70% (warning) and 85% (critical)
# Also alert on GROWTH RATE — "disk will be full in 4 hours" is more useful
# than "disk is 86% full"
LVM for flexibility¶
If your disk fills up and you have LVM, you can expand online without downtime:
# Add a new disk to the volume group
sudo pvcreate /dev/sdc
sudo vgextend vg_data /dev/sdc
# Extend the logical volume and resize the filesystem in one command
sudo lvextend -L +50G --resizefs /dev/vg_data/lv_appdata
Under the Hood: LVM (Logical Volume Manager) adds a layer between physical disks and filesystems. Physical Volumes (PVs) are pooled into Volume Groups (VGs), which are divided into Logical Volumes (LVs). The filesystem sits on the LV. This indirection is what lets you resize without downtime — the filesystem doesn't know or care that the underlying storage just grew. ext4 and XFS both support online grow; only ext4 supports shrink (offline only). XFS cannot shrink at all — plan carefully.
The Complete Diagnostic Ladder¶
"No space left on device"
│
├── Is it disk blocks or inodes?
│ ├── df -h shows full → blocks (the usual case)
│ │ │
│ │ ├── du -sh /* → find the biggest directory
│ │ ├── find -size +100M → find the biggest files
│ │ ├── lsof +L1 → find deleted-but-open files
│ │ └── truncate -s 0 file → free space immediately
│ │
│ └── df -i shows full → inodes
│ ├── find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn
│ └── find /path -type f -delete (NOT rm *)
│
├── Is it Docker?
│ ├── docker system df → how much is Docker using?
│ ├── Container logs → find /var/lib/docker/containers -name "*.log"
│ ├── Old images → docker image prune -a
│ └── Build cache → docker builder prune
│
├── Is it Kubernetes?
│ ├── kubectl describe node → DiskPressure condition?
│ ├── kubectl describe pod → Evicted for ephemeral-storage?
│ └── kubectl get pvc → PVC stuck or full?
│
└── Prevention
├── Separate partitions for /var/log, /var/lib/docker
├── Logrotate with maxsize + copytruncate
├── Docker log limits (max-size, max-file)
├── Journal limits (SystemMaxUse)
└── Alerts at 70% (warning) and 85% (critical) + growth rate
Flashcard Check¶
Q1: You rm a 10GB log file but df -h shows no change. Why?
A process still has the file open. The inode and blocks stay allocated until the file descriptor is closed. Use
lsof +L1to find it, thentruncatevia/proc/PID/fd/Nor restart the process.
Q2: df -h shows 40% used but touch says "No space left on device." What's wrong?
Inodes are exhausted. Check with
df -i. Millions of tiny files consumed all inodes while leaving plenty of block space. Find and delete the tiny files.
Q3: Why use truncate -s 0 file instead of rm file?
truncatezeros the file in place — the file descriptor stays valid and space is freed immediately.rmremoves the name but doesn't free space until all file descriptors are closed.
Q4: Docker container logs are 12GB. How do you prevent this?
Set log limits:
--log-opt max-size=100m --log-opt max-file=3per container, or globally in/etc/docker/daemon.json.
Q5: rm /path/* fails with "Argument list too long." What do you use instead?
find /path -type f -delete. The shell can't expand millions of filenames into*, butfindprocesses files one at a time.
Q6: Can you shrink an XFS filesystem?
No. XFS can only grow, never shrink. ext4 can shrink but only while unmounted. Plan partition sizes carefully when using XFS.
Q7: Kubernetes evicts your pod for ephemeral-storage. What triggered it?
The pod's total ephemeral storage (logs + emptyDir + container writable layer) exceeded the node's eviction threshold (default: nodefs.available < 10%).
Q8: What's the difference between logrotate's copytruncate and create?
copytruncate: copies the log then truncates the original — works when the app keeps the file descriptor open.create: deletes old file and creates new one — only works if the app reopens the file (viapostrotatesignal).
Exercises¶
Exercise 1: The emergency drill (hands-on)¶
Simulate a full disk and practice the recovery:
# Create a big file to fill the disk (use a temp directory)
dd if=/dev/zero of=/tmp/bigfile bs=1M count=500
# Now practice the diagnostic sequence
df -h /tmp
du -sh /tmp/*
find /tmp -type f -size +100M -ls
Clean it up, then practice the lsof +L1 scenario:
# Start a process that holds a file open
python3 -c "f=open('/tmp/testlog','w'); import time; [f.write('x'*1000+'\n') or time.sleep(0.1) for _ in iter(int,1)]" &
# Delete the file
rm /tmp/testlog
# Verify space is NOT freed (file still open)
lsof +L1 | grep testlog
# Free space through the file descriptor
: > /proc/$(pgrep -f testlog)/fd/3
# Kill the process
kill %1
What to observe
After `rm`, `lsof +L1` shows the file as `(deleted)` with its size still reported. The `df` output won't change until you either truncate via `/proc/PID/fd/N` or kill the process. This is the most important disk debugging skill.Exercise 2: Inode exhaustion (hands-on)¶
Create an inode exhaustion scenario and diagnose it:
# Create a directory with many tiny files
mkdir /tmp/inode-test
for i in $(seq 1 10000); do touch /tmp/inode-test/file$i; done
# Check inode usage
df -i /tmp
# Try the diagnostic command
find /tmp -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -5
# Clean up (note: rm * would fail with enough files)
find /tmp/inode-test -type f -delete
rmdir /tmp/inode-test
Exercise 3: Docker disk audit (hands-on)¶
If you have Docker:
# Check current Docker disk usage
docker system df
# Check container log sizes
sudo find /var/lib/docker/containers -name "*.log" -exec ls -lh {} \; 2>/dev/null
# Run a container that writes to stdout endlessly
docker run -d --name logspam alpine sh -c 'while true; do echo "$(date) spam"; sleep 0.1; done'
# Wait 30 seconds, check the log size
sleep 30
docker inspect --format='{{.LogPath}}' logspam | xargs ls -lh
# Now run the same with log limits
docker rm -f logspam
docker run -d --name logspam --log-opt max-size=1m --log-opt max-file=2 \
alpine sh -c 'while true; do echo "$(date) spam"; sleep 0.1; done'
# Clean up
docker rm -f logspam
Exercise 4: Write a disk monitoring one-liner (bash)¶
Write a command that checks all mounted filesystems and prints any that are over 80% full (by blocks) or over 80% full (by inodes).
Solution
Cheat Sheet¶
Emergency Response¶
| Step | Command | Purpose |
|---|---|---|
| What's full? | df -h |
Block usage per filesystem |
| Inodes full? | df -i |
Inode usage per filesystem |
| Biggest dirs | du -sh /* \| sort -rh \| head |
Find top consumers |
| Biggest files | find / -xdev -type f -size +100M -ls 2>/dev/null |
Find large files |
| Deleted but open | lsof +L1 |
Files deleted but still held open |
| Free space now | truncate -s 0 /path/to/file |
Zero a file in place |
| Many tiny files | find /path -type f -delete |
Delete when rm * fails |
Log Management¶
| Task | Command |
|---|---|
| Journal size | journalctl --disk-usage |
| Trim journal | journalctl --vacuum-size=500M |
| Test logrotate | sudo logrotate -d /etc/logrotate.conf |
| Force rotation | sudo logrotate -f /etc/logrotate.conf |
| Docker log size | docker inspect --format='{{.LogPath}}' CONTAINER \| xargs ls -lh |
Docker Cleanup¶
| Command | What it removes |
|---|---|
docker system prune |
Stopped containers, unused networks, dangling images, build cache |
docker image prune -a |
All unused images (not just dangling) |
docker volume prune |
Volumes not attached to any container (data loss risk) |
docker system prune --volumes |
Everything including volumes |
Filesystem Quick Reference¶
| Filesystem | Max file | Shrink? | Online grow? | Best for |
|---|---|---|---|---|
| ext4 | 16 TiB | Yes (offline) | Yes | General purpose |
| XFS | 8 EiB | No | Yes | Large files, databases |
| Btrfs | 16 EiB | Yes (online) | Yes | Snapshots, checksums |
Takeaways¶
-
truncate, notrm. Deleting an open file doesn't free space. Truncating it does.lsof +L1finds deleted-but-open files. -
Inodes can run out independently of disk space.
df -iis the check everyone forgets. Millions of tiny files (mail queues, session files, temp files) consume inodes while leaving block space free. -
Docker logs have no size limit by default. Set
max-sizeandmax-fileglobally in/etc/docker/daemon.jsonor per container with--log-opt. -
Logrotate only works if the path matches. The #1 logrotate failure is a path typo. Test with
logrotate -d(dry run). -
Separate partitions save lives. Put
/var/logon its own partition so runaway logs can't fill the root filesystem and crash everything. -
XFS cannot shrink. Plan partition sizes carefully. If you might need to shrink later, use ext4 or LVM for flexibility.
Related Lessons¶
- The Hanging Deploy — when processes and systemd cause problems
- Permission Denied — when you can't write to disk for a different reason