devops
l2
runbook
postgresql
disk-troubleshooting --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations, Filesystems & Storage | Domain: DevOps & Tooling

Runbook: PostgreSQL Disk Space Critical¶

Field	Value
Domain	Databases
Alert	`node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} < 10%`
Severity	P1
Est. Resolution Time	20-45 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	SSH access to DB server, psql superuser, sudo on the host

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
df -h /var/lib/postgresql/ && psql -h <DB_HOST> -U <DB_SUPERUSER> -c "SELECT pg_size_pretty(pg_database_size(current_database()));"

If output shows: Filesystem >90% full AND database is large → Continue to Step 1 If output shows: Filesystem is full but database is small → WAL or OS files may be the culprit; skip to Step 5

Step 1: Check All Database Sizes¶

Why: You need to know which databases are consuming the most space so you can target investigation at the right place rather than guessing.

psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT pg_database.datname AS database,
       pg_size_pretty(pg_database_size(pg_database.datname)) AS size,
       pg_database_size(pg_database.datname) AS size_bytes
FROM pg_database
ORDER BY size_bytes DESC;
"

Expected output:

   database   |   size   | size_bytes
--------------+----------+------------
 myapp_prod   | 47 GB    | 50491776000
 myapp_stage  | 2 GB     | 2147483648
 postgres     | 8 MB     | 8388608

If this fails: Ensure you are connecting as superuser — pg_database_size requires it for databases you do not own.

Step 2: Find the Largest Tables in the Problem Database¶

Why: A single table with unbounded growth (logs, events, audit records) is the most common cause of sudden disk consumption. Identifying it narrows the remediation path.

psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
SELECT schemaname,
       tablename,
       pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
       pg_size_pretty(pg_relation_size(schemaname || '.' || tablename))       AS table_size,
       pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename))        AS index_size,
       pg_total_relation_size(schemaname || '.' || tablename)                 AS total_bytes
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY total_bytes DESC
LIMIT 20;
"

Expected output:

 schemaname |     tablename      | total_size | table_size | index_size
------------+--------------------+------------+------------+------------
 public     | events             | 38 GB      | 30 GB      | 8 GB
 public     | audit_log          | 6 GB       | 5 GB       | 1 GB

If this fails: Connect to the specific database (-d <DATABASE_NAME>). Querying pg_tables without specifying the database shows system catalogs only.

Step 3: Check for Table Bloat and Dead Tuples¶

Why: Dead tuples from UPDATEs and DELETEs accumulate when VACUUM cannot run fast enough. A heavily-bloated table consumes disk space without containing live data — vacuuming it can reclaim space without deleting anything.

psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
SELECT schemaname,
       tablename,
       n_live_tup,
       n_dead_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
       last_vacuum,
       last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
"

Expected output:

 schemaname | tablename  | n_live_tup | n_dead_tup | dead_pct | last_autovacuum
------------+------------+------------+------------+----------+-----------------
 public     | events     | 5000000    | 3000000    | 37.5     | 2026-03-10 ...

A dead_pct above 20% on a large table means VACUUM has been falling behind. If this fails: If last_autovacuum is NULL or very old, autovacuum may be disabled or misconfigured for this table.

Step 4: Run VACUUM to Reclaim Bloat¶

Why: VACUUM marks dead tuples as reusable space. VACUUM ANALYZE also updates query planner statistics. VACUUM FULL reclaims disk at the OS level but takes an exclusive lock on the table — never run it during peak hours without explicit DBA sign-off.

# Safe option: standard VACUUM (does NOT release space back to OS, but reclaims for reuse)
psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
VACUUM ANALYZE <SCHEMA_NAME>.<TABLE_NAME>;
"

# Nuclear option: VACUUM FULL (rewrites the table, reclaims OS space, but takes EXCLUSIVE LOCK)
# Only run during a maintenance window with application traffic stopped
psql -h <DB_HOST> -U <DB_SUPERUSER> -d <DATABASE_NAME> -c "
VACUUM FULL VERBOSE <SCHEMA_NAME>.<TABLE_NAME>;
"

Expected output (VACUUM):

VACUUM

For VACUUM FULL VERBOSE, you will see progress output listing pages removed. If this fails: If you see ERROR: canceling statement due to conflict with recovery, you are connected to a replica. Always run VACUUM on the primary.

Step 5: Check WAL and Archive Log Accumulation¶

Why: PostgreSQL's Write-Ahead Log (WAL) can grow very large if archiving is failing, if max_wal_size is set too high, or if replication slots are holding WAL back. WAL lives in pg_wal/ and can silently fill the disk.

# Check WAL directory size
du -sh /var/lib/postgresql/data/pg_wal/

# Check WAL archiving status (if archiving is enabled)
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT archived_count,
       last_archived_wal,
       last_archived_time,
       failed_count,
       last_failed_wal,
       last_failed_time
FROM pg_stat_archiver;
"

# Check current WAL settings
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT name, setting, unit
FROM pg_settings
WHERE name IN ('max_wal_size', 'min_wal_size', 'archive_mode', 'archive_command');
"

Expected output:

# du: pg_wal/ should be under 2-4x max_wal_size setting
# pg_stat_archiver: failed_count = 0 and last_archived_time is recent

If this fails: If failed_count is high in pg_stat_archiver, the archive command is broken (e.g., S3 credentials expired, destination full). Fix the archive command before WAL fills the local disk.

Step 6: Check for Inactive Replication Slots Holding WAL¶

Why: An inactive replication slot (replica disconnected without dropping its slot) causes PostgreSQL to retain all WAL since the slot was last active, which can fill the disk completely.

psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT slot_name,
       slot_type,
       active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
"

Expected output:

  slot_name   | slot_type | active | retained_wal
--------------+-----------+--------+--------------
 replica_old  | physical  | f      | 47 GB

If a slot is active = f and retained_wal is large, it is the cause of WAL accumulation.

To drop an inactive slot (confirm with the team that the replica is truly decommissioned first):

psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
SELECT pg_drop_replication_slot('<SLOT_NAME>');
"

If this fails: You cannot drop an active slot. If the slot shows active = t but lag is enormous, the replica is connected but severely behind — see postgres-replication-lag.md.

Verification¶

# Confirm free space is recovering
df -h /var/lib/postgresql/

Success looks like: Disk usage is below 85% and trending downward. Database reports expected size. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	DBA on-call	"PostgreSQL disk at % on ; unable to reclaim space with standard VACUUM; WAL analysis attached"
Data loss suspected	DBA / Data Lead	"Possible data loss: PostgreSQL on may have crashed or paused writes due to disk full condition"
Scope expanding	Platform team	"Storage volume for is at capacity; may need emergency volume expansion or data archival"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Implement a data retention policy or partition + archive strategy for large tables
Set up WAL monitoring: alert before pg_wal/ fills, not after
Review autovacuum settings for high-churn tables (autovacuum_vacuum_scale_factor, autovacuum_vacuum_cost_delay)
Evaluate table partitioning for tables growing without bound (events, logs, audit trails)

Common Mistakes¶

Running VACUUM FULL during peak hours: VACUUM FULL takes an ACCESS EXCLUSIVE lock on the table, blocking all reads and writes for the duration. On a 30 GB table this can take 30+ minutes. Always schedule this for a maintenance window.
Dropping a replication slot without verifying it is truly decommissioned: Even if a slot is inactive, it may belong to a replica that will reconnect. Dropping it forces a full base backup resync of that replica, which is time-consuming and stressful on the primary.
Not checking WAL accumulation when disk is nearly full: Tables and indexes are the obvious place to look, but WAL from a broken archive or inactive slot is a common culprit that is easy to miss.
Confusing pg_relation_size with pg_total_relation_size: pg_relation_size shows only the main table heap. pg_total_relation_size includes indexes and TOAST tables, which can easily double or triple the apparent table size.

Cross-References¶

Topic Pack: PostgreSQL Storage and VACUUM (deep background on MVCC, dead tuples, WAL, and autovacuum tuning)
Related Runbook: postgres-replication-lag.md — inactive replication slots are often the link between lag and disk pressure
Related Runbook: long-running-query.md — long transactions prevent autovacuum from running, causing bloat to accumulate

Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage
Case Study: NVMe Drive Disappeared (Case Study, L2) — Filesystems & Storage
Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage
Case Study: Stuck NFS Mount (Case Study, L2) — Filesystems & Storage
Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage
Deep Dive: Linux Performance Debugging (deep_dive, L2) — Filesystems & Storage
Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage
Inodes (Topic Pack, L1) — Filesystems & Storage
Inodes Flashcards (CLI) (flashcard_deck, L1) — Filesystems & Storage
Kernel Troubleshooting (Topic Pack, L3) — Filesystems & Storage

Runbook: PostgreSQL Disk Space Critical¶

Quick Assessment (30 seconds)¶

Step 1: Check All Database Sizes¶

Step 2: Find the Largest Tables in the Problem Database¶

Step 3: Check for Table Bloat and Dead Tuples¶

Step 4: Run VACUUM to Reclaim Bloat¶

Step 5: Check WAL and Archive Log Accumulation¶

Step 6: Check for Inactive Replication Slots Holding WAL¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: PostgreSQL Disk Space Critical¶

Quick Assessment (30 seconds)¶

Step 1: Check All Database Sizes¶

Step 2: Find the Largest Tables in the Problem Database¶

Step 3: Check for Table Bloat and Dead Tuples¶

Step 4: Run VACUUM to Reclaim Bloat¶

Step 5: Check WAL and Archive Log Accumulation¶

Step 6: Check for Inactive Replication Slots Holding WAL¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶