Skip to content

The Disk That Filled Up

  • lesson
  • filesystems
  • logging
  • log-rotation
  • inodes
  • docker-storage
  • kubernetes-pvcs
  • lvm ---# The Disk That Filled Up

Topics: filesystems, logging, log rotation, inodes, Docker storage, Kubernetes PVCs, LVM Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)


The Mission

It's Monday morning. Alerts are firing: the API is returning 500 errors, PostgreSQL has gone into recovery mode, and Redis lost 20 minutes of session data. You SSH into the server and find this:

df -h /
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        50G   50G     0 100% /

The disk is full. Everything is broken. And you need to fix it in the next 5 minutes before the CEO's demo.

This lesson teaches you how to diagnose and fix disk space emergencies, then prevent them from happening again. Along the way, you'll learn about filesystems, inodes, log rotation, Docker storage, and Kubernetes PVCs — because a full disk touches all of them.


Part 1: Emergency Response — Free Space Now

When the disk is 100% full, everything breaks at once. Databases can't write journals. Logging daemons can't write logs (which means you can't see error messages about the disk being full). Package managers can't run. Even vim might refuse to open a file because it can't create a swap file.

Here's the emergency playbook, in order:

# Step 1: What's full?
df -h
# Look for anything at 100% or close

# Step 2: Where is the space going?
du -sh /* 2>/dev/null | sort -rh | head
# → 47G  /var
# → 2.1G /usr
# → 512M /opt

# Step 3: Drill down into the biggest directory
du -sh /var/* 2>/dev/null | sort -rh | head
# → 47G  /var/log
# du: read error on '/var/log/app': Input/output error  ← might see this too

# Step 4: Find the specific large files
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head
# → -rw-r--r-- 1 root root 47G Mar 22 09:00 /var/log/application.log

Found it. A 47GB log file. But here's where most people make a critical mistake:

# WRONG — this doesn't free the space!
rm /var/log/application.log
# The application still has the file open.
# The space is NOT freed until the process closes its file descriptor.

Under the Hood: When you rm a file, you're removing the directory entry (the name). The actual disk blocks aren't freed until all file descriptors pointing to those blocks are closed. If a process has the file open, the kernel keeps the inode and blocks allocated — the file is "deleted but still in use." This is one of the most common Linux gotchas.

# RIGHT — truncate the file in place
truncate -s 0 /var/log/application.log
# This zeros the file without removing it.
# The file descriptor stays valid, and the space is freed immediately.
# The application keeps writing to the same file, starting from offset 0.

Or if the file has already been deleted but space isn't freed:

# Find deleted-but-open files still consuming space
lsof +L1
# → COMMAND  PID USER  FD  TYPE DEVICE  SIZE/OFF NLINK NODE NAME
# → java    5678 app   1w  REG  253,1  47000000     0 1234 /var/log/application.log (deleted)

# Truncate through the file descriptor
: > /proc/5678/fd/1
# The colon (:) is a no-op that returns true; > redirects nothing into the fd

War Story: A 47GB log file grew unchecked for 7 months on a 50GB root partition. Logrotate was configured, but the config had a path typo: it was rotating /var/log/app/*.log but the actual file was /var/log/application.log — one character difference. DEBUG logging was left enabled in production, tripling normal log growth from 200MB/day to ~600MB/day. When the disk filled, PostgreSQL couldn't write WAL files and went into recovery mode, Redis couldn't persist its AOF file and lost 20 minutes of sessions, and Nginx started returning 500s because error_log couldn't write.


Part 2: Why Logs Are the #1 Disk Killer

Logs fill disks more than anything else. Here's why, and how to stop it.

How Linux logging works

Application → stdout/stderr or file
              systemd-journald (binary journal, size-limited)
              rsyslog / syslog-ng (text files in /var/log/)
              logrotate (rotation, compression, cleanup)

Three common logging paths, three things that can go wrong:

Path What breaks Symptom
App writes directly to file No rotation, infinite growth One massive log file
systemd journal Journal not size-limited /var/log/journal/ grows forever
rsyslog Facility misconfigured Logs go to unexpected files

Fixing the journal

# Check journal disk usage
journalctl --disk-usage
# → Archived and active journals take up 4.0G on disk.

# Trim to 500MB immediately
journalctl --vacuum-size=500M

# Trim anything older than 7 days
journalctl --vacuum-time=7d

# Set permanent limits in /etc/systemd/journald.conf
# SystemMaxUse=500M
# MaxRetentionSec=7d

Fixing logrotate

Logrotate is the standard tool for rotating log files. It runs daily via cron or systemd timer. When it fails, logs grow until the disk fills.

# Test logrotate configuration (dry run)
sudo logrotate -d /etc/logrotate.conf

# Force rotation right now
sudo logrotate -f /etc/logrotate.conf

# Check a specific config
cat /etc/logrotate.d/nginx

A good logrotate config:

/var/log/application.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate        # ← Critical for apps that keep the file open
    maxsize 500M        # ← Rotate even mid-day if file exceeds this
}

The critical options:

Option What it does When to use
copytruncate Copy the log, then truncate the original App keeps the file descriptor open (most apps)
create 644 root root Delete old file, create new one App reopens the file on rotation (nginx with postrotate)
compress gzip old logs Almost always — saves 90% space
delaycompress Don't compress the most recent rotated file When you need to grep recent logs quickly
maxsize Rotate if file exceeds this size (regardless of daily schedule) Prevent runaway growth between rotation intervals
missingok Don't error if the file doesn't exist Always — prevents logrotate from failing entirely

Gotcha: The copytruncate vs create choice matters. If the app keeps its file descriptor open (most do), create doesn't work — the app keeps writing to the old, now-renamed file, and the new empty file stays empty. Use copytruncate for apps that don't reopen their log file, or use a postrotate script to signal the app (like kill -USR1 for nginx).

Gotcha: Logrotate configs in /etc/logrotate.d/ must match the actual log file paths exactly. A config for /var/log/app/*.log won't rotate /var/log/application.log. This typo is the #1 reason logrotate "stops working" — it's working fine, just on the wrong path.


Part 3: The Inode Trap — "Disk Full" But It Isn't

Here's a scenario that makes people question reality:

df -h /var
# Filesystem  Size  Used  Avail  Use%  Mounted on
# /dev/sda3   20G   8G    12G    40%   /var
#                                 ↑ Plenty of space!

touch /var/test
# touch: cannot touch '/var/test': No space left on device
#                                   ↑ BUT IT SAYS THERE'S SPACE!

The disk has free blocks but no free inodes. Every file and directory on the filesystem uses one inode to store its metadata (permissions, timestamps, block pointers). When you run out of inodes, you can't create new files even with gigabytes of free space.

# Check inode usage
df -i /var
# Filesystem   Inodes  IUsed   IFree  IUse%  Mounted on
# /dev/sda3   1310720  1310720      0  100%   /var
#                                    ↑ Zero free inodes!

What eats inodes

Millions of tiny files. Common culprits:

Cause Location How many files
Mail queue explosion /var/spool/postfix/deferred/ Millions of 1KB queue files
PHP session files /var/lib/php/sessions/ One per user session, never cleaned
Package manager cache /var/cache/apt/ or /var/cache/yum/ Thousands of temp files
Container overlay layers /var/lib/docker/overlay2/ Whiteout files from deleted layers
Monitoring agents Various Per-metric files, per-check files
# Find which directory has the most files
find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20
# → 2800000 /var/spool/postfix/deferred
#   Ah ha — 2.8 million mail queue files

War Story: A mail relay accumulating 2.8 million deferred queue files exhausted all inodes on /var despite 70% disk space being free. Services couldn't write PID files, create temp files, or write logs. The fix: 40 minutes of find /var/spool/postfix/deferred -type f -delete (can't use rm * because the argument list is too long — there are too many files for the shell to expand the glob).

Gotcha: rm /path/* fails with "Argument list too long" when there are too many files. The shell tries to expand * into a list of all filenames, and the list exceeds the kernel's ARG_MAX limit (~2MB). Use find /path -type f -delete instead — it processes files one at a time without shell expansion.


Part 4: Docker Storage — The Hidden Disk Consumer

Docker can consume enormous amounts of disk space without you realizing it:

# See Docker's total disk usage
docker system df
# TYPE          TOTAL  ACTIVE  SIZE     RECLAIMABLE
# Images        45     5       12.5GB   9.8GB (78%)
# Containers    12     3       2.1GB    1.8GB (85%)
# Local Volumes 8      3       15.3GB   10.2GB (66%)
# Build Cache   -      -       3.4GB    3.4GB

# That's 33GB of Docker data, 25GB reclaimable

Where Docker puts things

What Path Grows because
Images /var/lib/docker/overlay2/ Old images not pruned
Container writable layers /var/lib/docker/overlay2/ Containers write to ephemeral layer
Volumes /var/lib/docker/volumes/ Data persists after container deletion
Build cache /var/lib/docker/ Layer caching from docker build
Container logs /var/lib/docker/containers/<id>/ JSON log driver with no limit

The sneakiest one: container logs. Docker's default JSON log driver has no size limit. A container writing to stdout fills /var/lib/docker/containers/<id>/<id>-json.log forever.

# Check container log sizes
find /var/lib/docker/containers -name "*.log" -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head
# → -rw-r----- 1 root root 12G ... abc123-json.log

# Set log limits (per container)
docker run --log-opt max-size=100m --log-opt max-file=3 myimage

# Set log limits globally in /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}

The nuclear cleanup

# Remove stopped containers, unused networks, dangling images, and build cache
docker system prune

# Also remove unused volumes (CAREFUL — this deletes data)
docker system prune --volumes

# Remove images older than 24 hours
docker image prune -a --filter "until=24h"

Gotcha: docker system prune --volumes deletes volumes that aren't attached to any container. If you stopped a database container but plan to restart it, the volume (with all your data) gets deleted. Always check docker volume ls before running prune with --volumes.

Under the Hood: Docker's overlay2 storage driver uses OverlayFS — a union filesystem that stacks read-only image layers with a writable container layer on top. When a container modifies a file from a lower layer, the file is copied up into the writable layer. This is why write-heavy workloads inside containers can be slow and consume unexpected disk space — use volumes for data-intensive operations instead.


Part 5: Kubernetes PVCs and Ephemeral Storage

In Kubernetes, disk problems show up differently:

Ephemeral storage eviction

Kubernetes monitors disk usage and evicts pods when the node runs low:

# Default eviction thresholds
# imagefs.available < 15%  → evict pods (container images/writable layers)
# nodefs.available < 10%   → evict pods (logs, emptyDir, etc.)
# nodefs.inodesFree < 5%   → evict pods

When a pod is evicted for disk pressure, it shows:

kubectl describe pod myapp
# → Status: Failed
# → Reason: Evicted
# → Message: The node was low on resource: ephemeral-storage

PVC troubleshooting

# PVC stuck in Pending
kubectl get pvc
# → NAME      STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
# → data-pvc  Pending                                       standard

kubectl describe pvc data-pvc
# → Events:
# →   Warning  ProvisioningFailed  no persistent volumes available for this claim

Common PVC failures:

Symptom Cause Fix
PVC Pending No matching PV or StorageClass Check kubectl get sc, create missing StorageClass
PVC Pending WaitForFirstConsumer Normal — PV provisioned when pod is scheduled
Multi-attach error on RWO Previous node still has volume attached Force-detach via cloud API, or delete old VolumeAttachment
Pod evicted Ephemeral storage limit exceeded Set resources.limits.ephemeral-storage

Gotcha: Deleting a StatefulSet does NOT delete its PVCs. The PVCs and their data persist. This is by design (safety) but catches people who expect cleanup. You must manually delete PVCs after removing a StatefulSet.

Gotcha: RWO (ReadWriteOnce) means single node, not single pod. Multiple pods on the same node can mount an RWO volume simultaneously. This is not what most people expect.


Part 6: Preventing Disk Emergencies

The emergency response gets you through today. Prevention keeps it from happening again.

Separate partitions

The single most effective prevention: put /var/log, /var/lib/docker, and /tmp on separate partitions (or LVM logical volumes). When logs fill /var/log, the root filesystem is unaffected — the system keeps running, and you can still SSH in to fix it.

Partition Layout (recommended)
/           20GB   — OS, should never fill
/var/log    50GB   — Logs, can fill without killing the OS
/var/lib/docker  100GB  — Docker data, isolated
/tmp        10GB   — Temp files, auto-cleared on boot
/home       remaining — User data

Monitoring and alerting

# Prometheus node_exporter provides these metrics:
# node_filesystem_avail_bytes    — free space
# node_filesystem_files_free     — free inodes
# node_filesystem_size_bytes     — total size

# Alert at 70% (warning) and 85% (critical)
# Also alert on GROWTH RATE — "disk will be full in 4 hours" is more useful
# than "disk is 86% full"

LVM for flexibility

If your disk fills up and you have LVM, you can expand online without downtime:

# Add a new disk to the volume group
sudo pvcreate /dev/sdc
sudo vgextend vg_data /dev/sdc

# Extend the logical volume and resize the filesystem in one command
sudo lvextend -L +50G --resizefs /dev/vg_data/lv_appdata

Under the Hood: LVM (Logical Volume Manager) adds a layer between physical disks and filesystems. Physical Volumes (PVs) are pooled into Volume Groups (VGs), which are divided into Logical Volumes (LVs). The filesystem sits on the LV. This indirection is what lets you resize without downtime — the filesystem doesn't know or care that the underlying storage just grew. ext4 and XFS both support online grow; only ext4 supports shrink (offline only). XFS cannot shrink at all — plan carefully.


The Complete Diagnostic Ladder

"No space left on device"
├── Is it disk blocks or inodes?
│   ├── df -h shows full → blocks (the usual case)
│   │   │
│   │   ├── du -sh /* → find the biggest directory
│   │   ├── find -size +100M → find the biggest files
│   │   ├── lsof +L1 → find deleted-but-open files
│   │   └── truncate -s 0 file → free space immediately
│   │
│   └── df -i shows full → inodes
│       ├── find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn
│       └── find /path -type f -delete (NOT rm *)
├── Is it Docker?
│   ├── docker system df → how much is Docker using?
│   ├── Container logs → find /var/lib/docker/containers -name "*.log"
│   ├── Old images → docker image prune -a
│   └── Build cache → docker builder prune
├── Is it Kubernetes?
│   ├── kubectl describe node → DiskPressure condition?
│   ├── kubectl describe pod → Evicted for ephemeral-storage?
│   └── kubectl get pvc → PVC stuck or full?
└── Prevention
    ├── Separate partitions for /var/log, /var/lib/docker
    ├── Logrotate with maxsize + copytruncate
    ├── Docker log limits (max-size, max-file)
    ├── Journal limits (SystemMaxUse)
    └── Alerts at 70% (warning) and 85% (critical) + growth rate

Flashcard Check

Q1: You rm a 10GB log file but df -h shows no change. Why?

A process still has the file open. The inode and blocks stay allocated until the file descriptor is closed. Use lsof +L1 to find it, then truncate via /proc/PID/fd/N or restart the process.

Q2: df -h shows 40% used but touch says "No space left on device." What's wrong?

Inodes are exhausted. Check with df -i. Millions of tiny files consumed all inodes while leaving plenty of block space. Find and delete the tiny files.

Q3: Why use truncate -s 0 file instead of rm file?

truncate zeros the file in place — the file descriptor stays valid and space is freed immediately. rm removes the name but doesn't free space until all file descriptors are closed.

Q4: Docker container logs are 12GB. How do you prevent this?

Set log limits: --log-opt max-size=100m --log-opt max-file=3 per container, or globally in /etc/docker/daemon.json.

Q5: rm /path/* fails with "Argument list too long." What do you use instead?

find /path -type f -delete. The shell can't expand millions of filenames into *, but find processes files one at a time.

Q6: Can you shrink an XFS filesystem?

No. XFS can only grow, never shrink. ext4 can shrink but only while unmounted. Plan partition sizes carefully when using XFS.

Q7: Kubernetes evicts your pod for ephemeral-storage. What triggered it?

The pod's total ephemeral storage (logs + emptyDir + container writable layer) exceeded the node's eviction threshold (default: nodefs.available < 10%).

Q8: What's the difference between logrotate's copytruncate and create?

copytruncate: copies the log then truncates the original — works when the app keeps the file descriptor open. create: deletes old file and creates new one — only works if the app reopens the file (via postrotate signal).


Exercises

Exercise 1: The emergency drill (hands-on)

Simulate a full disk and practice the recovery:

# Create a big file to fill the disk (use a temp directory)
dd if=/dev/zero of=/tmp/bigfile bs=1M count=500

# Now practice the diagnostic sequence
df -h /tmp
du -sh /tmp/*
find /tmp -type f -size +100M -ls

Clean it up, then practice the lsof +L1 scenario:

# Start a process that holds a file open
python3 -c "f=open('/tmp/testlog','w'); import time; [f.write('x'*1000+'\n') or time.sleep(0.1) for _ in iter(int,1)]" &

# Delete the file
rm /tmp/testlog

# Verify space is NOT freed (file still open)
lsof +L1 | grep testlog

# Free space through the file descriptor
: > /proc/$(pgrep -f testlog)/fd/3

# Kill the process
kill %1
What to observe After `rm`, `lsof +L1` shows the file as `(deleted)` with its size still reported. The `df` output won't change until you either truncate via `/proc/PID/fd/N` or kill the process. This is the most important disk debugging skill.

Exercise 2: Inode exhaustion (hands-on)

Create an inode exhaustion scenario and diagnose it:

# Create a directory with many tiny files
mkdir /tmp/inode-test
for i in $(seq 1 10000); do touch /tmp/inode-test/file$i; done

# Check inode usage
df -i /tmp

# Try the diagnostic command
find /tmp -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -5

# Clean up (note: rm * would fail with enough files)
find /tmp/inode-test -type f -delete
rmdir /tmp/inode-test

Exercise 3: Docker disk audit (hands-on)

If you have Docker:

# Check current Docker disk usage
docker system df

# Check container log sizes
sudo find /var/lib/docker/containers -name "*.log" -exec ls -lh {} \; 2>/dev/null

# Run a container that writes to stdout endlessly
docker run -d --name logspam alpine sh -c 'while true; do echo "$(date) spam"; sleep 0.1; done'

# Wait 30 seconds, check the log size
sleep 30
docker inspect --format='{{.LogPath}}' logspam | xargs ls -lh

# Now run the same with log limits
docker rm -f logspam
docker run -d --name logspam --log-opt max-size=1m --log-opt max-file=2 \
  alpine sh -c 'while true; do echo "$(date) spam"; sleep 0.1; done'

# Clean up
docker rm -f logspam

Exercise 4: Write a disk monitoring one-liner (bash)

Write a command that checks all mounted filesystems and prints any that are over 80% full (by blocks) or over 80% full (by inodes).

Solution
# Blocks over 80%
df -h | awk 'NR>1 && +$5 > 80 {print "DISK:", $6, $5}'

# Inodes over 80%
df -i | awk 'NR>1 && +$5 > 80 {print "INODE:", $6, $5}'

# Combined
{ df -h | awk 'NR>1 && +$5>80 {print "BLOCK "$6": "$5}'; \
  df -i | awk 'NR>1 && +$5>80 {print "INODE "$6": "$5}'; }

Cheat Sheet

Emergency Response

Step Command Purpose
What's full? df -h Block usage per filesystem
Inodes full? df -i Inode usage per filesystem
Biggest dirs du -sh /* \| sort -rh \| head Find top consumers
Biggest files find / -xdev -type f -size +100M -ls 2>/dev/null Find large files
Deleted but open lsof +L1 Files deleted but still held open
Free space now truncate -s 0 /path/to/file Zero a file in place
Many tiny files find /path -type f -delete Delete when rm * fails

Log Management

Task Command
Journal size journalctl --disk-usage
Trim journal journalctl --vacuum-size=500M
Test logrotate sudo logrotate -d /etc/logrotate.conf
Force rotation sudo logrotate -f /etc/logrotate.conf
Docker log size docker inspect --format='{{.LogPath}}' CONTAINER \| xargs ls -lh

Docker Cleanup

Command What it removes
docker system prune Stopped containers, unused networks, dangling images, build cache
docker image prune -a All unused images (not just dangling)
docker volume prune Volumes not attached to any container (data loss risk)
docker system prune --volumes Everything including volumes

Filesystem Quick Reference

Filesystem Max file Shrink? Online grow? Best for
ext4 16 TiB Yes (offline) Yes General purpose
XFS 8 EiB No Yes Large files, databases
Btrfs 16 EiB Yes (online) Yes Snapshots, checksums

Takeaways

  1. truncate, not rm. Deleting an open file doesn't free space. Truncating it does. lsof +L1 finds deleted-but-open files.

  2. Inodes can run out independently of disk space. df -i is the check everyone forgets. Millions of tiny files (mail queues, session files, temp files) consume inodes while leaving block space free.

  3. Docker logs have no size limit by default. Set max-size and max-file globally in /etc/docker/daemon.json or per container with --log-opt.

  4. Logrotate only works if the path matches. The #1 logrotate failure is a path typo. Test with logrotate -d (dry run).

  5. Separate partitions save lives. Put /var/log on its own partition so runaway logs can't fill the root filesystem and crash everything.

  6. XFS cannot shrink. Plan partition sizes carefully. If you might need to shrink later, use ext4 or LVM for flexibility.


  • The Hanging Deploy — when processes and systemd cause problems
  • Permission Denied — when you can't write to disk for a different reason