- devops
- l2
- runbook
- postgresql
- disk-troubleshooting --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations, Filesystems & Storage | Domain: DevOps & Tooling
Runbook: PostgreSQL Disk Space Critical¶
| Field | Value |
|---|---|
| Domain | Databases |
| Alert | node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} < 10% |
| Severity | P1 |
| Est. Resolution Time | 20-45 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | SSH access to DB server, psql superuser, sudo on the host |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
df -h /var/lib/postgresql/ && psql -h <DB_HOST> -U <DB_SUPERUSER> -c "SELECT pg_size_pretty(pg_database_size(current_database()));"
Step 1: Check All Database Sizes¶
Why: You need to know which databases are consuming the most space so you can target investigation at the right place rather than guessing.
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT pg_database.datname AS database,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size,
pg_database_size(pg_database.datname) AS size_bytes
FROM pg_database
ORDER BY size_bytes DESC;
"
database | size | size_bytes
--------------+----------+------------
myapp_prod | 47 GB | 50491776000
myapp_stage | 2 GB | 2147483648
postgres | 8 MB | 8388608
pg_database_size requires it for databases you do not own.
Step 2: Find the Largest Tables in the Problem Database¶
Why: A single table with unbounded growth (logs, events, audit records) is the most common cause of sudden disk consumption. Identifying it narrows the remediation path.
psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
SELECT schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS index_size,
pg_total_relation_size(schemaname || '.' || tablename) AS total_bytes
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY total_bytes DESC
LIMIT 20;
"
schemaname | tablename | total_size | table_size | index_size
------------+--------------------+------------+------------+------------
public | events | 38 GB | 30 GB | 8 GB
public | audit_log | 6 GB | 5 GB | 1 GB
-d <DATABASE_NAME>). Querying pg_tables without specifying the database shows system catalogs only.
Step 3: Check for Table Bloat and Dead Tuples¶
Why: Dead tuples from UPDATEs and DELETEs accumulate when VACUUM cannot run fast enough. A heavily-bloated table consumes disk space without containing live data — vacuuming it can reclaim space without deleting anything.
psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
SELECT schemaname,
tablename,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
"
schemaname | tablename | n_live_tup | n_dead_tup | dead_pct | last_autovacuum
------------+------------+------------+------------+----------+-----------------
public | events | 5000000 | 3000000 | 37.5 | 2026-03-10 ...
dead_pct above 20% on a large table means VACUUM has been falling behind.
If this fails: If last_autovacuum is NULL or very old, autovacuum may be disabled or misconfigured for this table.
Step 4: Run VACUUM to Reclaim Bloat¶
Why: VACUUM marks dead tuples as reusable space. VACUUM ANALYZE also updates query planner statistics. VACUUM FULL reclaims disk at the OS level but takes an exclusive lock on the table — never run it during peak hours without explicit DBA sign-off.
# Safe option: standard VACUUM (does NOT release space back to OS, but reclaims for reuse)
psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
VACUUM ANALYZE <SCHEMA_NAME>.<TABLE_NAME>;
"
# Nuclear option: VACUUM FULL (rewrites the table, reclaims OS space, but takes EXCLUSIVE LOCK)
# Only run during a maintenance window with application traffic stopped
psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
VACUUM FULL VERBOSE <SCHEMA_NAME>.<TABLE_NAME>;
"
VACUUM FULL VERBOSE, you will see progress output listing pages removed.
If this fails: If you see ERROR: canceling statement due to conflict with recovery, you are connected to a replica. Always run VACUUM on the primary.
Step 5: Check WAL and Archive Log Accumulation¶
Why: PostgreSQL's Write-Ahead Log (WAL) can grow very large if archiving is failing, if max_wal_size is set too high, or if replication slots are holding WAL back. WAL lives in pg_wal/ and can silently fill the disk.
# Check WAL directory size
du -sh /var/lib/postgresql/data/pg_wal/
# Check WAL archiving status (if archiving is enabled)
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT archived_count,
last_archived_wal,
last_archived_time,
failed_count,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
"
# Check current WAL settings
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT name, setting, unit
FROM pg_settings
WHERE name IN ('max_wal_size', 'min_wal_size', 'archive_mode', 'archive_command');
"
# du: pg_wal/ should be under 2-4x max_wal_size setting
# pg_stat_archiver: failed_count = 0 and last_archived_time is recent
failed_count is high in pg_stat_archiver, the archive command is broken (e.g., S3 credentials expired, destination full). Fix the archive command before WAL fills the local disk.
Step 6: Check for Inactive Replication Slots Holding WAL¶
Why: An inactive replication slot (replica disconnected without dropping its slot) causes PostgreSQL to retain all WAL since the slot was last active, which can fill the disk completely.
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
"
slot_name | slot_type | active | retained_wal
--------------+-----------+--------+--------------
replica_old | physical | f | 47 GB
active = f and retained_wal is large, it is the cause of WAL accumulation.
To drop an inactive slot (confirm with the team that the replica is truly decommissioned first):
If this fails: You cannot drop an active slot. If the slot showsactive = t but lag is enormous, the replica is connected but severely behind — see postgres-replication-lag.md.
Verification¶
Success looks like: Disk usage is below 85% and trending downward. Database reports expected size. If still broken: Escalate — see below.Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | DBA on-call | "PostgreSQL disk at |
| Data loss suspected | DBA / Data Lead | "Possible data loss: PostgreSQL on |
| Scope expanding | Platform team | "Storage volume for |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Implement a data retention policy or partition + archive strategy for large tables
- Set up WAL monitoring: alert before
pg_wal/fills, not after - Review autovacuum settings for high-churn tables (
autovacuum_vacuum_scale_factor,autovacuum_vacuum_cost_delay) - Evaluate table partitioning for tables growing without bound (events, logs, audit trails)
Common Mistakes¶
- Running
VACUUM FULLduring peak hours:VACUUM FULLtakes anACCESS EXCLUSIVElock on the table, blocking all reads and writes for the duration. On a 30 GB table this can take 30+ minutes. Always schedule this for a maintenance window. - Dropping a replication slot without verifying it is truly decommissioned: Even if a slot is inactive, it may belong to a replica that will reconnect. Dropping it forces a full base backup resync of that replica, which is time-consuming and stressful on the primary.
- Not checking WAL accumulation when disk is nearly full: Tables and indexes are the obvious place to look, but WAL from a broken archive or inactive slot is a common culprit that is easy to miss.
- Confusing
pg_relation_sizewithpg_total_relation_size:pg_relation_sizeshows only the main table heap.pg_total_relation_sizeincludes indexes and TOAST tables, which can easily double or triple the apparent table size.
Cross-References¶
- Topic Pack: PostgreSQL Storage and VACUUM (deep background on MVCC, dead tuples, WAL, and autovacuum tuning)
- Related Runbook: postgres-replication-lag.md — inactive replication slots are often the link between lag and disk pressure
- Related Runbook: long-running-query.md — long transactions prevent autovacuum from running, causing bloat to accumulate
Wiki Navigation¶
Related Content¶
- Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage
- Case Study: NVMe Drive Disappeared (Case Study, L2) — Filesystems & Storage
- Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage
- Case Study: Stuck NFS Mount (Case Study, L2) — Filesystems & Storage
- Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage
- Deep Dive: Linux Performance Debugging (deep_dive, L2) — Filesystems & Storage
- Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage
- Inodes (Topic Pack, L1) — Filesystems & Storage
- Inodes Flashcards (CLI) (flashcard_deck, L1) — Filesystems & Storage
- Kernel Troubleshooting (Topic Pack, L3) — Filesystems & Storage
Pages that link here¶
- Inodes
- Kernel Troubleshooting
- Linux Filesystem Internals
- Linux Performance Debugging
- NVMe Drive Disappeared After Reboot
- Operational Runbooks
- PostgreSQL Operations - Primer
- Runbook: Disk Full
- Runbook: Long-Running Query / Lock Contention
- Runbook: PostgreSQL Connection Exhaustion
- Runbook: PostgreSQL Replication Lag
- Symptoms