Decision Tree: Disk Is Filling Up¶

Category: Incident Triage Starting Question: "Disk usage is high or growing — what's consuming space?" Estimated traversal: 2-5 minutes Domains: linux-performance, kubernetes, postgresql

The Tree¶

Disk usage is high or growing — what's consuming space?
│
├── First: which machine / volume?
│   │
│   ├── Kubernetes node → SSH to node first, then follow below
│   │   `kubectl get node <node> -o wide` → get node IP
│   │   `ssh ubuntu@<node-ip>`
│   │
│   └── Database pod / persistent volume → exec into pod
│       `kubectl exec -it <db-pod> -- bash`
│
├── `df -h` — which filesystem is full or near full?
│   │
│   ├── / (root filesystem)
│   │   `du -sh /* 2>/dev/null | sort -rh | head -15`
│   │   │
│   │   ├── /var/log is largest
│   │   │   `du -sh /var/log/* | sort -rh | head -10`
│   │   │   │
│   │   │   ├── Specific log file huge (e.g., syslog, auth.log)
│   │   │   │   └── ✅ ACTION: Rotate / Truncate Logs
│   │   │   │
│   │   │   └── Many container logs in /var/log/containers/
│   │   │       `ls -lhS /var/log/containers/ | head -10`
│   │   │       └── ✅ ACTION: Configure Container Log Rotation / Reduce verbosity
│   │   │
│   │   ├── /tmp is largest
│   │   │   `ls -lhS /tmp/ | head -10`
│   │   │   └── Old temp files / uncleaned job artifacts
│   │   │       └── ✅ ACTION: Clean /tmp (check before deleting)
│   │   │
│   │   └── /home or /opt or /srv is largest
│   │       → application data growth — check below
│   │
│   ├── /var/lib/containerd or /var/lib/docker (container storage)
│   │   `du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots`
│   │   `crictl images | sort -k4 -rh | head -10`
│   │   │
│   │   ├── Many unused / old images
│   │   │   └── ✅ ACTION: Prune Unused Container Images
│   │   │
│   │   ├── Many stopped containers accumulating layers
│   │   │   `crictl ps -a --state Exited | wc -l`
│   │   │   └── ✅ ACTION: Remove Stopped Containers
│   │   │
│   │   └── Container overlay snapshots growing (active containers writing logs)
│   │       → configure container log size limits in kubelet config
│   │       └── ✅ ACTION: Set Container Log Rotation Limits
│   │
│   ├── /var/lib/postgresql or /data (database volume)
│   │   │
│   │   ├── Is the DB a PostgreSQL instance?
│   │   │   `du -sh /var/lib/postgresql/*/main/pg_wal/`
│   │   │   │
│   │   │   ├── pg_wal is large (WAL accumulation)
│   │   │   │   Check: `SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))
│   │   │   │           FROM pg_stat_replication;`
│   │   │   │   │
│   │   │   │   ├── Replica lagging → WAL retained for replica
│   │   │   │   │   └── ✅ ACTION: Fix Replica Lag / Adjust wal_keep_size
│   │   │   │   │
│   │   │   │   └── No replicas? Check archiving: `SHOW archive_status;`
│   │   │   │       └── Archive stuck → ✅ ACTION: Fix WAL Archive / Clear pg_wal
│   │   │   │
│   │   │   ├── Dead tuples / table bloat
│   │   │   │   `SELECT relname, n_dead_tup, pg_size_pretty(pg_total_relation_size(oid))
│   │   │   │    FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;`
│   │   │   │   └── High dead tuple count → ✅ ACTION: Run VACUUM / VACUUM FULL
│   │   │   │
│   │   │   └── Large tables growing (unbounded data)
│   │   │       `SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
│   │   │        FROM pg_stat_user_tables ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;`
│   │   │       └── ✅ ACTION: Implement Data Retention / Partitioning
│   │   │
│   │   └── Is it another database (MySQL, MongoDB, etc.)?
│   │       Check data dir for binary logs / oplog / slow query log size
│   │       → adapt queries above to the DB engine
│   │
│   └── /var/lib/kubelet (kubelet data directory)
│       `du -sh /var/lib/kubelet/pods/*/volumes/ | sort -rh | head -10`
│       │
│       ├── hostPath or emptyDir volumes accumulating data
│       │   → identify the pod by the UUID path component
│       │   `kubectl get pod --all-namespaces -o yaml | grep -B5 <uuid>`
│       │   └── ✅ ACTION: Fix Pod Volume / Clean Up Orphaned Volumes
│       │
│       └── Orphaned pod directories (pod deleted but dir remains)
│           `ls /var/lib/kubelet/pods/ | while read id; do
│              kubectl get pod --all-namespaces -o yaml | grep -q $id || echo "orphan: $id"; done`
│           └── ✅ ACTION: Remove Orphaned Pod Volume Directories
│
└── Check inodes too! `df -i`
    │
    ├── Inode usage >90% on any filesystem?
    │   `find /var/log -maxdepth 3 -type f | wc -l`
    │   `find /tmp -maxdepth 3 -type f | wc -l`
    │   │
    │   └── Thousands of tiny files → ✅ ACTION: Clean Up Inode-Consuming Files
    │
    └── Inodes fine → block usage is the issue (already addressed above)

Node Details¶

Check 1: Initial filesystem survey¶

Command: df -h && df -i — run both together. Block usage and inode usage require separate checks. What you're looking for: Any filesystem at >85% block usage or >80% inode usage. Pay attention to the "Mounted on" column to understand which filesystem to investigate. Common pitfall: Kubernetes nodes typically have a large /var/lib/containerd partition that isn't visible in df -h / because it may be a separate mount. Check all mounts, not just root.

Check 2: Finding large directories¶

Command: du -sh /* 2>/dev/null | sort -rh | head -15 then drill down into the largest directory. Use du -sh /var/log/* | sort -rh | head -10 to go one level deeper. What you're looking for: Any single directory or file consuming unexpectedly large space. A healthy node's top consumers should be predictable (OS, containerd, kubelet). Common pitfall: du can take 30-60 seconds on large filesystems. If the node is in DiskPressure, this delay matters — start with df -h to identify the specific mount before running du.

Check 3: PostgreSQL WAL directory¶

Command: du -sh $PGDATA/pg_wal/ and: SELECT count(*), pg_size_pretty(sum(size)) FROM pg_ls_waldir(); (requires superuser, PG 10+). What you're looking for: WAL directory larger than ~1GB suggests either a lagging replica retaining WAL segments, a stuck WAL archive process, or wal_keep_size set too high. Common pitfall: Never manually delete files from pg_wal/ — this corrupts the database. Use pg_archivecleanup or fix the root cause (replica lag / archive process) instead.

Check 4: PostgreSQL dead tuples / bloat¶

Command: SELECT relname, n_dead_tup, n_live_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC; What you're looking for: Tables with n_dead_tup much larger than n_live_tup — these have significant bloat. Also check if last_autovacuum is recent (within hours for active tables). Common pitfall: autovacuum may be disabled or its cost_delay may be too high, causing bloat to accumulate. Check: SHOW autovacuum_vacuum_cost_delay;

Check 5: Container images¶

Command: crictl images (containerd) or docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" — sorted by size. What you're looking for: Multiple versions of large base images (node, python, ubuntu) that are no longer in use by any pod. Also look for images with <none> tag (dangling layers). Common pitfall: crictl rmi --prune only removes images not referenced by any container spec — it is safe. However, if you remove an image that a pod spec references, the pod will fail with ImagePullBackOff if the registry is unreachable.

Check 6: Container log size limits¶

Command: Check current kubelet config: sudo cat /var/lib/kubelet/config.yaml | grep -i log. Also: sudo ls -lhS /var/log/containers/ | head -5. What you're looking for: Kubelet containerLogMaxSize (default 10Mi) and containerLogMaxFiles (default 5). If not set, a single verbose container can fill the disk. Common pitfall: Setting containerLogMaxSize to a very small value causes logs to rotate so fast that kubectl logs returns almost no history. A good default is 50Mi with 3 files.

Terminal Actions¶

Action: Rotate / Truncate Logs¶

Do: 1. View largest logs: sudo ls -lhS /var/log/*.log | head -10 2. Force rotation: sudo logrotate -f /etc/logrotate.conf 3. For journald: sudo journalctl --vacuum-size=500M && sudo journalctl --vacuum-time=7d 4. If a specific log file must be truncated (and you've reviewed it): sudo truncate -s 0 /var/log/syslog Verify: df -h shows free space recovered. sudo logrotate -d /etc/logrotate.conf runs without error.

Action: Prune Unused Container Images¶

Do: 1. Dry run (containerd): sudo crictl images | grep -v REPOSITORY — identify candidates 2. Prune: sudo crictl rmi --prune 3. For Docker: sudo docker system prune -a --volumes (be careful with --volumes on stateful nodes) 4. Check space freed: df -h /var/lib/containerd Verify: df -h shows free space. kubectl get node <name> no longer shows DiskPressure.

Action: Remove Stopped Containers¶

Do: 1. List stopped containers: sudo crictl ps -a --state Exited 2. Remove all: sudo crictl rm $(sudo crictl ps -a -q --state Exited) 2>/dev/null || true 3. Verify: sudo crictl ps -a | grep Exited | wc -l — should be 0 Verify: Space freed. No impact to running containers.

Action: Set Container Log Rotation Limits¶

Do: 1. Edit kubelet config: sudo vim /var/lib/kubelet/config.yaml 2. Add or update: containerLogMaxSize: "50Mi" and containerLogMaxFiles: 3 3. Restart kubelet: sudo systemctl restart kubelet 4. Existing large logs: sudo truncate -s 100M /var/log/containers/<large-log-file>.log Verify: sudo ls -lhS /var/log/containers/ | head shows no files >50Mi. kubectl logs still works.

Action: Run VACUUM / VACUUM FULL¶

Do: 1. For routine cleanup (online, safe): VACUUM ANALYZE <table_name>; 2. For full space reclamation (requires exclusive lock, takes table offline): VACUUM FULL <table_name>; 3. Check progress: SELECT phase, heap_blks_total, heap_blks_scanned FROM pg_stat_progress_vacuum; 4. Post-vacuum: SELECT pg_size_pretty(pg_total_relation_size('<table>')); — confirm size reduced Verify: n_dead_tup in pg_stat_user_tables drops to near zero. Table size decreases (VACUUM FULL only).

Action: Fix Replica Lag / Adjust wal_keep_size¶

Do: 1. Check replica lag: SELECT client_addr, sent_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag FROM pg_stat_replication; 2. If replica is far behind, investigate replica health (network, disk, CPU) 3. Temporary fix to allow WAL cleanup: SELECT pg_drop_replication_slot('<slot_name>'); (only if replica is permanently gone) 4. Adjust: ALTER SYSTEM SET wal_keep_size = '512MB'; SELECT pg_reload_conf(); Verify: du -sh $PGDATA/pg_wal/ decreases after next checkpoint.

Action: Implement Data Retention / Partitioning¶

Do: 1. Identify oldest data you can delete: SELECT min(created_at) FROM <table>; 2. Delete in batches to avoid long locks: DELETE FROM <table> WHERE created_at < now() - interval '90 days' LIMIT 10000; — repeat until done 3. Long-term: implement table partitioning by date and use DROP TABLE partition_name instead of DELETE 4. Add a retention job (cron or pg_cron) to run cleanup automatically Verify: Table size decreases. Autovacuum processes the freed pages. df -h shows free space.

Action: Clean Up Inode-Consuming Files¶

Do: 1. Find directories with most files: find / -xdev -type f | sed 's/\/[^\/]*$//' | sort | uniq -c | sort -rn | head -20 2. Common culprits: /var/log/journal/ (journald), /tmp/, application temp directories 3. For journald: sudo journalctl --vacuum-files=5 4. For tmp: sudo find /tmp -maxdepth 1 -mtime +7 -delete Verify: df -i shows inode usage below 80%. No new file creation errors.

Action: Clean /tmp (check before deleting)¶

Do: 1. ls -lhS /tmp/ | head -20 — identify largest files 2. stat /tmp/<file> — check modification time and owner 3. Delete old files: sudo find /tmp -maxdepth 1 -mtime +1 -delete 4. For application temp dirs: notify app owner before deleting Verify: df -h /tmp shows free space. Application still functions.

Edge Cases¶

Disk fills suddenly in minutes: A process is actively writing (log flooding, runaway query result set, core dump). Use lsof +D /var/lib/containerd or inotifywait -m /var/log to catch it live.
Disk usage high but du finds nothing: An open file handle is keeping a deleted file's disk blocks allocated. sudo lsof | grep deleted | sort -k7 -rh | head -10. Solution: restart the process holding the file handle.
PostgreSQL disk full with no large tables: Check pg_wal, temporary sort files (pg_base/pgsql_tmp/), and crash recovery files. A long-running sort/hash query can fill /tmp inside the Postgres data dir.
Node DiskPressure after container image build: CI/CD pipelines that build images on Kubernetes nodes generate large intermediate layers. Use a dedicated build node or remote builder instead.
PVC shows high usage but pod can't write: PVC may be ReadOnlyMany or the pod may be writing to a different path than the mount point. Check kubectl exec -it <pod> -- df -h inside the pod.

Cross-References¶

Topic Packs: linux-performance, linux-ops-storage, postgresql, k8s-storage, k8s-ops
Runbooks: pod_eviction.md, node_not_ready.md