Decision Tree: Disk Is Filling Up¶
Category: Incident Triage Starting Question: "Disk usage is high or growing — what's consuming space?" Estimated traversal: 2-5 minutes Domains: linux-performance, kubernetes, postgresql
The Tree¶
Disk usage is high or growing — what's consuming space?
│
├── First: which machine / volume?
│ │
│ ├── Kubernetes node → SSH to node first, then follow below
│ │ `kubectl get node <node> -o wide` → get node IP
│ │ `ssh ubuntu@<node-ip>`
│ │
│ └── Database pod / persistent volume → exec into pod
│ `kubectl exec -it <db-pod> -- bash`
│
├── `df -h` — which filesystem is full or near full?
│ │
│ ├── / (root filesystem)
│ │ `du -sh /* 2>/dev/null | sort -rh | head -15`
│ │ │
│ │ ├── /var/log is largest
│ │ │ `du -sh /var/log/* | sort -rh | head -10`
│ │ │ │
│ │ │ ├── Specific log file huge (e.g., syslog, auth.log)
│ │ │ │ └── ✅ ACTION: Rotate / Truncate Logs
│ │ │ │
│ │ │ └── Many container logs in /var/log/containers/
│ │ │ `ls -lhS /var/log/containers/ | head -10`
│ │ │ └── ✅ ACTION: Configure Container Log Rotation / Reduce verbosity
│ │ │
│ │ ├── /tmp is largest
│ │ │ `ls -lhS /tmp/ | head -10`
│ │ │ └── Old temp files / uncleaned job artifacts
│ │ │ └── ✅ ACTION: Clean /tmp (check before deleting)
│ │ │
│ │ └── /home or /opt or /srv is largest
│ │ → application data growth — check below
│ │
│ ├── /var/lib/containerd or /var/lib/docker (container storage)
│ │ `du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots`
│ │ `crictl images | sort -k4 -rh | head -10`
│ │ │
│ │ ├── Many unused / old images
│ │ │ └── ✅ ACTION: Prune Unused Container Images
│ │ │
│ │ ├── Many stopped containers accumulating layers
│ │ │ `crictl ps -a --state Exited | wc -l`
│ │ │ └── ✅ ACTION: Remove Stopped Containers
│ │ │
│ │ └── Container overlay snapshots growing (active containers writing logs)
│ │ → configure container log size limits in kubelet config
│ │ └── ✅ ACTION: Set Container Log Rotation Limits
│ │
│ ├── /var/lib/postgresql or /data (database volume)
│ │ │
│ │ ├── Is the DB a PostgreSQL instance?
│ │ │ `du -sh /var/lib/postgresql/*/main/pg_wal/`
│ │ │ │
│ │ │ ├── pg_wal is large (WAL accumulation)
│ │ │ │ Check: `SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))
│ │ │ │ FROM pg_stat_replication;`
│ │ │ │ │
│ │ │ │ ├── Replica lagging → WAL retained for replica
│ │ │ │ │ └── ✅ ACTION: Fix Replica Lag / Adjust wal_keep_size
│ │ │ │ │
│ │ │ │ └── No replicas? Check archiving: `SHOW archive_status;`
│ │ │ │ └── Archive stuck → ✅ ACTION: Fix WAL Archive / Clear pg_wal
│ │ │ │
│ │ │ ├── Dead tuples / table bloat
│ │ │ │ `SELECT relname, n_dead_tup, pg_size_pretty(pg_total_relation_size(oid))
│ │ │ │ FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;`
│ │ │ │ └── High dead tuple count → ✅ ACTION: Run VACUUM / VACUUM FULL
│ │ │ │
│ │ │ └── Large tables growing (unbounded data)
│ │ │ `SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
│ │ │ FROM pg_stat_user_tables ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;`
│ │ │ └── ✅ ACTION: Implement Data Retention / Partitioning
│ │ │
│ │ └── Is it another database (MySQL, MongoDB, etc.)?
│ │ Check data dir for binary logs / oplog / slow query log size
│ │ → adapt queries above to the DB engine
│ │
│ └── /var/lib/kubelet (kubelet data directory)
│ `du -sh /var/lib/kubelet/pods/*/volumes/ | sort -rh | head -10`
│ │
│ ├── hostPath or emptyDir volumes accumulating data
│ │ → identify the pod by the UUID path component
│ │ `kubectl get pod --all-namespaces -o yaml | grep -B5 <uuid>`
│ │ └── ✅ ACTION: Fix Pod Volume / Clean Up Orphaned Volumes
│ │
│ └── Orphaned pod directories (pod deleted but dir remains)
│ `ls /var/lib/kubelet/pods/ | while read id; do
│ kubectl get pod --all-namespaces -o yaml | grep -q $id || echo "orphan: $id"; done`
│ └── ✅ ACTION: Remove Orphaned Pod Volume Directories
│
└── Check inodes too! `df -i`
│
├── Inode usage >90% on any filesystem?
│ `find /var/log -maxdepth 3 -type f | wc -l`
│ `find /tmp -maxdepth 3 -type f | wc -l`
│ │
│ └── Thousands of tiny files → ✅ ACTION: Clean Up Inode-Consuming Files
│
└── Inodes fine → block usage is the issue (already addressed above)
Node Details¶
Check 1: Initial filesystem survey¶
Command: df -h && df -i — run both together. Block usage and inode usage require separate checks.
What you're looking for: Any filesystem at >85% block usage or >80% inode usage. Pay attention to the "Mounted on" column to understand which filesystem to investigate.
Common pitfall: Kubernetes nodes typically have a large /var/lib/containerd partition that isn't visible in df -h / because it may be a separate mount. Check all mounts, not just root.
Check 2: Finding large directories¶
Command: du -sh /* 2>/dev/null | sort -rh | head -15 then drill down into the largest directory. Use du -sh /var/log/* | sort -rh | head -10 to go one level deeper.
What you're looking for: Any single directory or file consuming unexpectedly large space. A healthy node's top consumers should be predictable (OS, containerd, kubelet).
Common pitfall: du can take 30-60 seconds on large filesystems. If the node is in DiskPressure, this delay matters — start with df -h to identify the specific mount before running du.
Check 3: PostgreSQL WAL directory¶
Command: du -sh $PGDATA/pg_wal/ and: SELECT count(*), pg_size_pretty(sum(size)) FROM pg_ls_waldir(); (requires superuser, PG 10+).
What you're looking for: WAL directory larger than ~1GB suggests either a lagging replica retaining WAL segments, a stuck WAL archive process, or wal_keep_size set too high.
Common pitfall: Never manually delete files from pg_wal/ — this corrupts the database. Use pg_archivecleanup or fix the root cause (replica lag / archive process) instead.
Check 4: PostgreSQL dead tuples / bloat¶
Command: SELECT relname, n_dead_tup, n_live_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC;
What you're looking for: Tables with n_dead_tup much larger than n_live_tup — these have significant bloat. Also check if last_autovacuum is recent (within hours for active tables).
Common pitfall: autovacuum may be disabled or its cost_delay may be too high, causing bloat to accumulate. Check: SHOW autovacuum_vacuum_cost_delay;
Check 5: Container images¶
Command: crictl images (containerd) or docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" — sorted by size.
What you're looking for: Multiple versions of large base images (node, python, ubuntu) that are no longer in use by any pod. Also look for images with <none> tag (dangling layers).
Common pitfall: crictl rmi --prune only removes images not referenced by any container spec — it is safe. However, if you remove an image that a pod spec references, the pod will fail with ImagePullBackOff if the registry is unreachable.
Check 6: Container log size limits¶
Command: Check current kubelet config: sudo cat /var/lib/kubelet/config.yaml | grep -i log. Also: sudo ls -lhS /var/log/containers/ | head -5.
What you're looking for: Kubelet containerLogMaxSize (default 10Mi) and containerLogMaxFiles (default 5). If not set, a single verbose container can fill the disk.
Common pitfall: Setting containerLogMaxSize to a very small value causes logs to rotate so fast that kubectl logs returns almost no history. A good default is 50Mi with 3 files.
Terminal Actions¶
Action: Rotate / Truncate Logs¶
Do:
1. View largest logs: sudo ls -lhS /var/log/*.log | head -10
2. Force rotation: sudo logrotate -f /etc/logrotate.conf
3. For journald: sudo journalctl --vacuum-size=500M && sudo journalctl --vacuum-time=7d
4. If a specific log file must be truncated (and you've reviewed it): sudo truncate -s 0 /var/log/syslog
Verify: df -h shows free space recovered. sudo logrotate -d /etc/logrotate.conf runs without error.
Action: Prune Unused Container Images¶
Do:
1. Dry run (containerd): sudo crictl images | grep -v REPOSITORY — identify candidates
2. Prune: sudo crictl rmi --prune
3. For Docker: sudo docker system prune -a --volumes (be careful with --volumes on stateful nodes)
4. Check space freed: df -h /var/lib/containerd
Verify: df -h shows free space. kubectl get node <name> no longer shows DiskPressure.
Action: Remove Stopped Containers¶
Do:
1. List stopped containers: sudo crictl ps -a --state Exited
2. Remove all: sudo crictl rm $(sudo crictl ps -a -q --state Exited) 2>/dev/null || true
3. Verify: sudo crictl ps -a | grep Exited | wc -l — should be 0
Verify: Space freed. No impact to running containers.
Action: Set Container Log Rotation Limits¶
Do:
1. Edit kubelet config: sudo vim /var/lib/kubelet/config.yaml
2. Add or update: containerLogMaxSize: "50Mi" and containerLogMaxFiles: 3
3. Restart kubelet: sudo systemctl restart kubelet
4. Existing large logs: sudo truncate -s 100M /var/log/containers/<large-log-file>.log
Verify: sudo ls -lhS /var/log/containers/ | head shows no files >50Mi. kubectl logs still works.
Action: Run VACUUM / VACUUM FULL¶
Do:
1. For routine cleanup (online, safe): VACUUM ANALYZE <table_name>;
2. For full space reclamation (requires exclusive lock, takes table offline): VACUUM FULL <table_name>;
3. Check progress: SELECT phase, heap_blks_total, heap_blks_scanned FROM pg_stat_progress_vacuum;
4. Post-vacuum: SELECT pg_size_pretty(pg_total_relation_size('<table>')); — confirm size reduced
Verify: n_dead_tup in pg_stat_user_tables drops to near zero. Table size decreases (VACUUM FULL only).
Action: Fix Replica Lag / Adjust wal_keep_size¶
Do:
1. Check replica lag: SELECT client_addr, sent_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag FROM pg_stat_replication;
2. If replica is far behind, investigate replica health (network, disk, CPU)
3. Temporary fix to allow WAL cleanup: SELECT pg_drop_replication_slot('<slot_name>'); (only if replica is permanently gone)
4. Adjust: ALTER SYSTEM SET wal_keep_size = '512MB'; SELECT pg_reload_conf();
Verify: du -sh $PGDATA/pg_wal/ decreases after next checkpoint.
Action: Implement Data Retention / Partitioning¶
Do:
1. Identify oldest data you can delete: SELECT min(created_at) FROM <table>;
2. Delete in batches to avoid long locks: DELETE FROM <table> WHERE created_at < now() - interval '90 days' LIMIT 10000; — repeat until done
3. Long-term: implement table partitioning by date and use DROP TABLE partition_name instead of DELETE
4. Add a retention job (cron or pg_cron) to run cleanup automatically
Verify: Table size decreases. Autovacuum processes the freed pages. df -h shows free space.
Action: Clean Up Inode-Consuming Files¶
Do:
1. Find directories with most files: find / -xdev -type f | sed 's/\/[^\/]*$//' | sort | uniq -c | sort -rn | head -20
2. Common culprits: /var/log/journal/ (journald), /tmp/, application temp directories
3. For journald: sudo journalctl --vacuum-files=5
4. For tmp: sudo find /tmp -maxdepth 1 -mtime +7 -delete
Verify: df -i shows inode usage below 80%. No new file creation errors.
Action: Clean /tmp (check before deleting)¶
Do:
1. ls -lhS /tmp/ | head -20 — identify largest files
2. stat /tmp/<file> — check modification time and owner
3. Delete old files: sudo find /tmp -maxdepth 1 -mtime +1 -delete
4. For application temp dirs: notify app owner before deleting
Verify: df -h /tmp shows free space. Application still functions.
Edge Cases¶
- Disk fills suddenly in minutes: A process is actively writing (log flooding, runaway query result set, core dump). Use
lsof +D /var/lib/containerdorinotifywait -m /var/logto catch it live. - Disk usage high but
dufinds nothing: An open file handle is keeping a deleted file's disk blocks allocated.sudo lsof | grep deleted | sort -k7 -rh | head -10. Solution: restart the process holding the file handle. - PostgreSQL disk full with no large tables: Check pg_wal, temporary sort files (
pg_base/pgsql_tmp/), and crash recovery files. A long-running sort/hash query can fill/tmpinside the Postgres data dir. - Node DiskPressure after container image build: CI/CD pipelines that build images on Kubernetes nodes generate large intermediate layers. Use a dedicated build node or remote builder instead.
- PVC shows high usage but pod can't write: PVC may be ReadOnlyMany or the pod may be writing to a different path than the mount point. Check
kubectl exec -it <pod> -- df -hinside the pod.
Cross-References¶
- Topic Packs: linux-performance, linux-ops-storage, postgresql, k8s-storage, k8s-ops
- Runbooks: pod_eviction.md, node_not_ready.md