On-Call Survival: Databases (PostgreSQL)¶

Print this. Pin it. Read it at 3 AM.

Alert: Connection Exhaustion (too many clients)¶

Severity: P1

First command:

psql -c "SELECT count(*), state, wait_event_type, wait_event FROM pg_stat_activity GROUP BY state, wait_event_type, wait_event ORDER BY count DESC;"

What you're looking for: Total connections near max_connections; large number of idle or idle in transaction connections; what they're waiting on.

Decision tree:

Are most connections "idle in transaction"?
├── Yes → Application not closing transactions. Immediate fix:
│         SELECT pg_terminate_backend(pid) FROM pg_stat_activity
│           WHERE state = 'idle in transaction' AND query_start < now() - interval '5 min';
│         Long-term: app team must fix connection handling / add connection pool.
└── No → Is total count > 90% of max_connections?
    ├── Yes → Is PgBouncer / connection pooler running?
    │         → If not: start it. If yes: check its pool_size and max_client_conn.
    └── No → Sudden spike? → Check for connection leak: new deploy?
             kubectl rollout undo deploy/<app> -n <ns>  (if recent deploy caused it)
             Escalate: "PG connections at <N>/<max>, majority state: <state>"

Escalation trigger: New connections being refused; application fully down; cannot connect even as superuser.

Safe actions: pg_stat_activity queries — read-only. Count connections.

Dangerous actions: pg_terminate_backend (kills active sessions), restart PostgreSQL, change max_connections (requires restart).

Alert: Replication Lag¶

Severity: P1 (lag > 30s and growing) / P2 (lag < 30s, stable)

First command:

# On primary:
psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS replication_lag FROM pg_stat_replication;"

What you're looking for: Lag in bytes or the replay_lsn falling behind sent_lsn.

Decision tree:

Is the replica connected?
├── No → Is replica pod/service running?
│         kubectl get pods -n <ns> | grep postgres
│         Replica down → start it. Lag will catch up automatically.
└── Yes → Is lag growing or stable?
    ├── Growing → Replica under I/O pressure? Check: iostat -x 1 5 on replica host.
    │             Network saturation? iftop or nload on replica.
    │             Escalate: "Replication lag growing, <N> bytes behind, replica I/O: <paste>"
    └── Stable (plateau) → Likely expected lag during heavy write period.
                            Alert: watch it for 10 min. Page if crosses 60s or keeps growing.

Escalation trigger: Replica completely disconnected; lag > 5 min and growing; replica WAL slot bloating primary disk.

Safe actions: Query pg_stat_replication, check replica pod status — read-only.

Dangerous actions: Drop replication slot (WAL accumulates — disk risk), failover to replica.

Alert: Disk Full (PostgreSQL data volume)¶

Severity: P1

First command:

df -h /var/lib/postgresql   # or wherever PGDATA is mounted
du -sh /var/lib/postgresql/data/pg_wal/

What you're looking for: Filesystem % full. Is it the WAL directory eating space?

Decision tree:

Is pg_wal/ filling the disk?
├── Yes → Is there a replication slot that isn't consuming WAL?
│         psql -c "SELECT slot_name, active, restart_lsn FROM pg_replication_slots;"
│         Inactive slot? → DROP REPLICATION SLOT '<name>'; (clears WAL accumulation)
│         WARNING: only drop slots for confirmed dead replicas — page DBA first.
└── No → Is it base data (tables/indexes)?
    ├── Yes → VACUUM FULL is running? psql -c "SELECT pid, phase, heap_blks_scanned FROM pg_stat_progress_vacuum;"
    │         Bloat from updates/deletes? → Schedule VACUUM ANALYZE; not an emergency.
    └── No → Logs? find /var/log/postgresql -name "*.log" | xargs ls -lh | sort -k5 -rh | head
              Rotate logs. Escalate to DBA: "PG data volume full, breakdown: <paste du>"

Escalation trigger: PostgreSQL cannot write (ENOSPC); cannot connect; WAL archiving failing; cannot free space without DBA decision.

Safe actions: df -h, du, query pg_replication_slots — read-only.

Dangerous actions: Drop replication slots, VACUUM FULL (locks table), extend volume.

Alert: Long-Running Query (> 5 min)¶

Severity: P2 (holding locks) / P3 (no locks)

First command:

psql -c "SELECT pid, now() - query_start AS duration, state, wait_event, left(query, 100) FROM pg_stat_activity WHERE query_start < now() - interval '5 minutes' AND state != 'idle' ORDER BY duration DESC;"

What you're looking for: PID, duration, whether it's blocking others (wait_event = 'Lock').

Decision tree:

Is it blocking other queries (wait_event = 'Lock')?
├── Yes → Is it a known analytics/batch job that can be killed?
│         SELECT pg_cancel_backend(<pid>);  (graceful — SIGINT)
│         If that fails: SELECT pg_terminate_backend(<pid>);  (SIGTERM)
└── No → Is it consuming excessive CPU (check top)?
    ├── Yes → Runaway query. pg_cancel_backend(<pid>).
    │         App team: "Query PID <pid> running <duration>, consuming CPU, cancelled."
    └── No → Slow query (I/O bound). Not blocking. Monitor.
             If > 30 min: escalate to DBA for EXPLAIN ANALYZE review.

Escalation trigger: Query holding AccessExclusiveLock on a critical table > 2 min; cannot cancel a blocking query; cascade of locked queries building up.

Safe actions: pg_stat_activity queries — read-only.

Dangerous actions: pg_terminate_backend (kills session), pg_cancel_backend (cancels query).

Alert: Lock Wait / Deadlock¶

Severity: P1 (blocking user transactions)

First command:

psql -c "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.granted AND NOT blocked_locks.granted JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid;"

What you're looking for: blocking_pid and blocking_query — that's the process holding the lock.

Decision tree:

Is the blocking_pid an active user query?
├── Yes → How long has it been running?
│         If > 2 min and blocking production: pg_terminate_backend(<blocking_pid>)
│         Notify app team: "Terminated PID <pid> holding lock for <duration>"
└── No → Is it an idle transaction ("idle in transaction")?
    ├── Yes → pg_terminate_backend(<blocking_pid>) — idle transactions should not hold locks.
    └── No → Is it a DDL migration (ALTER TABLE, etc.)?
             → Contact the engineer running the migration.
             → If no one owns it: pg_terminate_backend(<blocking_pid>). Then investigate.

Escalation trigger: Lock chain > 5 PIDs; cannot terminate the blocking query; AccessExclusiveLock held > 5 min.

Safe actions: Lock query (pg_locks join) — read-only.

Dangerous actions: pg_terminate_backend (terminates session, rolls back transaction).

Quick Reference¶

Most Useful Commands¶

# Active connections by state
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"

# Long-running queries
psql -c "SELECT pid, now()-query_start AS dur, state, left(query,80) FROM pg_stat_activity WHERE state != 'idle' ORDER BY dur DESC;"

# Replication lag
psql -c "SELECT client_addr, (sent_lsn - replay_lsn) AS lag FROM pg_stat_replication;"

# Lock waits
psql -c "SELECT pid, wait_event_type, wait_event, left(query,80) FROM pg_stat_activity WHERE wait_event_type = 'Lock';"

# Replication slots (WAL accumulation risk)
psql -c "SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;"

# Cancel a query (graceful)
psql -c "SELECT pg_cancel_backend(<pid>);"

# Terminate a session (forced)
psql -c "SELECT pg_terminate_backend(<pid>);"

# DB sizes
psql -c "SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database ORDER BY pg_database_size(datname) DESC;"

# Table bloat (top 10 biggest tables)
psql -c "SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"

Escalation Contacts¶

Situation	Team	Channel
Connection pool exhausted	App team + DBA	#db-oncall
Replication lag > 5 min	DBA	#db-oncall
Disk full (data volume)	DBA + Infra	#infra-oncall
Deadlock affecting users	DBA	#db-oncall
Failover decision	DBA lead	PagerDuty: db-critical

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Query pg_stat_activity	pg_terminate_backend
Query pg_locks	pg_cancel_backend (less risky)
Query pg_replication_slots	Drop replication slot
Check disk usage	VACUUM FULL
Monitor replication lag	Failover to replica
Count connections	Change max_connections

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]