Skip to content

Decision Tree: Disk Is Filling Up

Category: Incident Triage Starting Question: "Disk usage is high or growing — what's consuming space?" Estimated traversal: 2-5 minutes Domains: linux-performance, kubernetes, postgresql


The Tree

Disk usage is high or growing  what's consuming space?
├── First: which machine / volume?
      ├── Kubernetes node  SSH to node first, then follow below
      `kubectl get node <node> -o wide`  get node IP
      `ssh ubuntu@<node-ip>`
      └── Database pod / persistent volume  exec into pod
       `kubectl exec -it <db-pod> -- bash`
├── `df -h`  which filesystem is full or near full?
      ├── / (root filesystem)
      `du -sh /* 2>/dev/null | sort -rh | head -15`
            ├── /var/log is largest
         `du -sh /var/log/* | sort -rh | head -10`
                  ├── Specific log file huge (e.g., syslog, auth.log)
            └──  ACTION: Rotate / Truncate Logs
                  └── Many container logs in /var/log/containers/
             `ls -lhS /var/log/containers/ | head -10`
             └──  ACTION: Configure Container Log Rotation / Reduce verbosity
            ├── /tmp is largest
         `ls -lhS /tmp/ | head -10`
         └── Old temp files / uncleaned job artifacts
             └──  ACTION: Clean /tmp (check before deleting)
            └── /home or /opt or /srv is largest
           application data growth  check below
      ├── /var/lib/containerd or /var/lib/docker (container storage)
      `du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots`
      `crictl images | sort -k4 -rh | head -10`
            ├── Many unused / old images
         └──  ACTION: Prune Unused Container Images
            ├── Many stopped containers accumulating layers
         `crictl ps -a --state Exited | wc -l`
         └──  ACTION: Remove Stopped Containers
            └── Container overlay snapshots growing (active containers writing logs)
           configure container log size limits in kubelet config
          └──  ACTION: Set Container Log Rotation Limits
      ├── /var/lib/postgresql or /data (database volume)
            ├── Is the DB a PostgreSQL instance?
         `du -sh /var/lib/postgresql/*/main/pg_wal/`
                  ├── pg_wal is large (WAL accumulation)
            Check: `SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))
                    FROM pg_stat_replication;`
                        ├── Replica lagging  WAL retained for replica
               └──  ACTION: Fix Replica Lag / Adjust wal_keep_size
                        └── No replicas? Check archiving: `SHOW archive_status;`
                └── Archive stuck   ACTION: Fix WAL Archive / Clear pg_wal
                  ├── Dead tuples / table bloat
            `SELECT relname, n_dead_tup, pg_size_pretty(pg_total_relation_size(oid))
             FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;`
            └── High dead tuple count   ACTION: Run VACUUM / VACUUM FULL
                  └── Large tables growing (unbounded data)
             `SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
              FROM pg_stat_user_tables ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;`
             └──  ACTION: Implement Data Retention / Partitioning
            └── Is it another database (MySQL, MongoDB, etc.)?
          Check data dir for binary logs / oplog / slow query log size
           adapt queries above to the DB engine
      └── /var/lib/kubelet (kubelet data directory)
       `du -sh /var/lib/kubelet/pods/*/volumes/ | sort -rh | head -10`
              ├── hostPath or emptyDir volumes accumulating data
           identify the pod by the UUID path component
          `kubectl get pod --all-namespaces -o yaml | grep -B5 <uuid>`
          └──  ACTION: Fix Pod Volume / Clean Up Orphaned Volumes
              └── Orphaned pod directories (pod deleted but dir remains)
           `ls /var/lib/kubelet/pods/ | while read id; do
              kubectl get pod --all-namespaces -o yaml | grep -q $id || echo "orphan: $id"; done`
           └──  ACTION: Remove Orphaned Pod Volume Directories
└── Check inodes too! `df -i`
        ├── Inode usage >90% on any filesystem?
       `find /var/log -maxdepth 3 -type f | wc -l`
       `find /tmp -maxdepth 3 -type f | wc -l`
              └── Thousands of tiny files   ACTION: Clean Up Inode-Consuming Files
        └── Inodes fine  block usage is the issue (already addressed above)

Node Details

Check 1: Initial filesystem survey

Command: df -h && df -i — run both together. Block usage and inode usage require separate checks. What you're looking for: Any filesystem at >85% block usage or >80% inode usage. Pay attention to the "Mounted on" column to understand which filesystem to investigate. Common pitfall: Kubernetes nodes typically have a large /var/lib/containerd partition that isn't visible in df -h / because it may be a separate mount. Check all mounts, not just root.

Check 2: Finding large directories

Command: du -sh /* 2>/dev/null | sort -rh | head -15 then drill down into the largest directory. Use du -sh /var/log/* | sort -rh | head -10 to go one level deeper. What you're looking for: Any single directory or file consuming unexpectedly large space. A healthy node's top consumers should be predictable (OS, containerd, kubelet). Common pitfall: du can take 30-60 seconds on large filesystems. If the node is in DiskPressure, this delay matters — start with df -h to identify the specific mount before running du.

Check 3: PostgreSQL WAL directory

Command: du -sh $PGDATA/pg_wal/ and: SELECT count(*), pg_size_pretty(sum(size)) FROM pg_ls_waldir(); (requires superuser, PG 10+). What you're looking for: WAL directory larger than ~1GB suggests either a lagging replica retaining WAL segments, a stuck WAL archive process, or wal_keep_size set too high. Common pitfall: Never manually delete files from pg_wal/ — this corrupts the database. Use pg_archivecleanup or fix the root cause (replica lag / archive process) instead.

Check 4: PostgreSQL dead tuples / bloat

Command: SELECT relname, n_dead_tup, n_live_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC; What you're looking for: Tables with n_dead_tup much larger than n_live_tup — these have significant bloat. Also check if last_autovacuum is recent (within hours for active tables). Common pitfall: autovacuum may be disabled or its cost_delay may be too high, causing bloat to accumulate. Check: SHOW autovacuum_vacuum_cost_delay;

Check 5: Container images

Command: crictl images (containerd) or docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" — sorted by size. What you're looking for: Multiple versions of large base images (node, python, ubuntu) that are no longer in use by any pod. Also look for images with <none> tag (dangling layers). Common pitfall: crictl rmi --prune only removes images not referenced by any container spec — it is safe. However, if you remove an image that a pod spec references, the pod will fail with ImagePullBackOff if the registry is unreachable.

Check 6: Container log size limits

Command: Check current kubelet config: sudo cat /var/lib/kubelet/config.yaml | grep -i log. Also: sudo ls -lhS /var/log/containers/ | head -5. What you're looking for: Kubelet containerLogMaxSize (default 10Mi) and containerLogMaxFiles (default 5). If not set, a single verbose container can fill the disk. Common pitfall: Setting containerLogMaxSize to a very small value causes logs to rotate so fast that kubectl logs returns almost no history. A good default is 50Mi with 3 files.


Terminal Actions

Action: Rotate / Truncate Logs

Do: 1. View largest logs: sudo ls -lhS /var/log/*.log | head -10 2. Force rotation: sudo logrotate -f /etc/logrotate.conf 3. For journald: sudo journalctl --vacuum-size=500M && sudo journalctl --vacuum-time=7d 4. If a specific log file must be truncated (and you've reviewed it): sudo truncate -s 0 /var/log/syslog Verify: df -h shows free space recovered. sudo logrotate -d /etc/logrotate.conf runs without error.

Action: Prune Unused Container Images

Do: 1. Dry run (containerd): sudo crictl images | grep -v REPOSITORY — identify candidates 2. Prune: sudo crictl rmi --prune 3. For Docker: sudo docker system prune -a --volumes (be careful with --volumes on stateful nodes) 4. Check space freed: df -h /var/lib/containerd Verify: df -h shows free space. kubectl get node <name> no longer shows DiskPressure.

Action: Remove Stopped Containers

Do: 1. List stopped containers: sudo crictl ps -a --state Exited 2. Remove all: sudo crictl rm $(sudo crictl ps -a -q --state Exited) 2>/dev/null || true 3. Verify: sudo crictl ps -a | grep Exited | wc -l — should be 0 Verify: Space freed. No impact to running containers.

Action: Set Container Log Rotation Limits

Do: 1. Edit kubelet config: sudo vim /var/lib/kubelet/config.yaml 2. Add or update: containerLogMaxSize: "50Mi" and containerLogMaxFiles: 3 3. Restart kubelet: sudo systemctl restart kubelet 4. Existing large logs: sudo truncate -s 100M /var/log/containers/<large-log-file>.log Verify: sudo ls -lhS /var/log/containers/ | head shows no files >50Mi. kubectl logs still works.

Action: Run VACUUM / VACUUM FULL

Do: 1. For routine cleanup (online, safe): VACUUM ANALYZE <table_name>; 2. For full space reclamation (requires exclusive lock, takes table offline): VACUUM FULL <table_name>; 3. Check progress: SELECT phase, heap_blks_total, heap_blks_scanned FROM pg_stat_progress_vacuum; 4. Post-vacuum: SELECT pg_size_pretty(pg_total_relation_size('<table>')); — confirm size reduced Verify: n_dead_tup in pg_stat_user_tables drops to near zero. Table size decreases (VACUUM FULL only).

Action: Fix Replica Lag / Adjust wal_keep_size

Do: 1. Check replica lag: SELECT client_addr, sent_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag FROM pg_stat_replication; 2. If replica is far behind, investigate replica health (network, disk, CPU) 3. Temporary fix to allow WAL cleanup: SELECT pg_drop_replication_slot('<slot_name>'); (only if replica is permanently gone) 4. Adjust: ALTER SYSTEM SET wal_keep_size = '512MB'; SELECT pg_reload_conf(); Verify: du -sh $PGDATA/pg_wal/ decreases after next checkpoint.

Action: Implement Data Retention / Partitioning

Do: 1. Identify oldest data you can delete: SELECT min(created_at) FROM <table>; 2. Delete in batches to avoid long locks: DELETE FROM <table> WHERE created_at < now() - interval '90 days' LIMIT 10000; — repeat until done 3. Long-term: implement table partitioning by date and use DROP TABLE partition_name instead of DELETE 4. Add a retention job (cron or pg_cron) to run cleanup automatically Verify: Table size decreases. Autovacuum processes the freed pages. df -h shows free space.

Action: Clean Up Inode-Consuming Files

Do: 1. Find directories with most files: find / -xdev -type f | sed 's/\/[^\/]*$//' | sort | uniq -c | sort -rn | head -20 2. Common culprits: /var/log/journal/ (journald), /tmp/, application temp directories 3. For journald: sudo journalctl --vacuum-files=5 4. For tmp: sudo find /tmp -maxdepth 1 -mtime +7 -delete Verify: df -i shows inode usage below 80%. No new file creation errors.

Action: Clean /tmp (check before deleting)

Do: 1. ls -lhS /tmp/ | head -20 — identify largest files 2. stat /tmp/<file> — check modification time and owner 3. Delete old files: sudo find /tmp -maxdepth 1 -mtime +1 -delete 4. For application temp dirs: notify app owner before deleting Verify: df -h /tmp shows free space. Application still functions.


Edge Cases

  • Disk fills suddenly in minutes: A process is actively writing (log flooding, runaway query result set, core dump). Use lsof +D /var/lib/containerd or inotifywait -m /var/log to catch it live.
  • Disk usage high but du finds nothing: An open file handle is keeping a deleted file's disk blocks allocated. sudo lsof | grep deleted | sort -k7 -rh | head -10. Solution: restart the process holding the file handle.
  • PostgreSQL disk full with no large tables: Check pg_wal, temporary sort files (pg_base/pgsql_tmp/), and crash recovery files. A long-running sort/hash query can fill /tmp inside the Postgres data dir.
  • Node DiskPressure after container image build: CI/CD pipelines that build images on Kubernetes nodes generate large intermediate layers. Use a dedicated build node or remote builder instead.
  • PVC shows high usage but pod can't write: PVC may be ReadOnlyMany or the pod may be writing to a different path than the mount point. Check kubectl exec -it <pod> -- df -h inside the pod.

Cross-References