On-Call Survival: Databases (PostgreSQL)¶
Print this. Pin it. Read it at 3 AM.
Alert: Connection Exhaustion (too many clients)¶
Severity: P1
First command:
psql -c "SELECT count(*), state, wait_event_type, wait_event FROM pg_stat_activity GROUP BY state, wait_event_type, wait_event ORDER BY count DESC;"
max_connections; large number of idle or idle in transaction connections; what they're waiting on.
Decision tree:
Are most connections "idle in transaction"?
├── Yes → Application not closing transactions. Immediate fix:
│ SELECT pg_terminate_backend(pid) FROM pg_stat_activity
│ WHERE state = 'idle in transaction' AND query_start < now() - interval '5 min';
│ Long-term: app team must fix connection handling / add connection pool.
└── No → Is total count > 90% of max_connections?
├── Yes → Is PgBouncer / connection pooler running?
│ → If not: start it. If yes: check its pool_size and max_client_conn.
└── No → Sudden spike? → Check for connection leak: new deploy?
kubectl rollout undo deploy/<app> -n <ns> (if recent deploy caused it)
Escalate: "PG connections at <N>/<max>, majority state: <state>"
Escalation trigger: New connections being refused; application fully down; cannot connect even as superuser.
Safe actions: pg_stat_activity queries — read-only. Count connections.
Dangerous actions: pg_terminate_backend (kills active sessions), restart PostgreSQL, change max_connections (requires restart).
Alert: Replication Lag¶
Severity: P1 (lag > 30s and growing) / P2 (lag < 30s, stable)
First command:
# On primary:
psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS replication_lag FROM pg_stat_replication;"
replay_lsn falling behind sent_lsn.
Decision tree:
Is the replica connected?
├── No → Is replica pod/service running?
│ kubectl get pods -n <ns> | grep postgres
│ Replica down → start it. Lag will catch up automatically.
└── Yes → Is lag growing or stable?
├── Growing → Replica under I/O pressure? Check: iostat -x 1 5 on replica host.
│ Network saturation? iftop or nload on replica.
│ Escalate: "Replication lag growing, <N> bytes behind, replica I/O: <paste>"
└── Stable (plateau) → Likely expected lag during heavy write period.
Alert: watch it for 10 min. Page if crosses 60s or keeps growing.
Escalation trigger: Replica completely disconnected; lag > 5 min and growing; replica WAL slot bloating primary disk.
Safe actions: Query pg_stat_replication, check replica pod status — read-only.
Dangerous actions: Drop replication slot (WAL accumulates — disk risk), failover to replica.
Alert: Disk Full (PostgreSQL data volume)¶
Severity: P1
First command:
What you're looking for: Filesystem % full. Is it the WAL directory eating space?Decision tree:
Is pg_wal/ filling the disk?
├── Yes → Is there a replication slot that isn't consuming WAL?
│ psql -c "SELECT slot_name, active, restart_lsn FROM pg_replication_slots;"
│ Inactive slot? → DROP REPLICATION SLOT '<name>'; (clears WAL accumulation)
│ WARNING: only drop slots for confirmed dead replicas — page DBA first.
└── No → Is it base data (tables/indexes)?
├── Yes → VACUUM FULL is running? psql -c "SELECT pid, phase, heap_blks_scanned FROM pg_stat_progress_vacuum;"
│ Bloat from updates/deletes? → Schedule VACUUM ANALYZE; not an emergency.
└── No → Logs? find /var/log/postgresql -name "*.log" | xargs ls -lh | sort -k5 -rh | head
Rotate logs. Escalate to DBA: "PG data volume full, breakdown: <paste du>"
Escalation trigger: PostgreSQL cannot write (ENOSPC); cannot connect; WAL archiving failing; cannot free space without DBA decision.
Safe actions: df -h, du, query pg_replication_slots — read-only.
Dangerous actions: Drop replication slots, VACUUM FULL (locks table), extend volume.
Alert: Long-Running Query (> 5 min)¶
Severity: P2 (holding locks) / P3 (no locks)
First command:
psql -c "SELECT pid, now() - query_start AS duration, state, wait_event, left(query, 100) FROM pg_stat_activity WHERE query_start < now() - interval '5 minutes' AND state != 'idle' ORDER BY duration DESC;"
wait_event = 'Lock').
Decision tree:
Is it blocking other queries (wait_event = 'Lock')?
├── Yes → Is it a known analytics/batch job that can be killed?
│ SELECT pg_cancel_backend(<pid>); (graceful — SIGINT)
│ If that fails: SELECT pg_terminate_backend(<pid>); (SIGTERM)
└── No → Is it consuming excessive CPU (check top)?
├── Yes → Runaway query. pg_cancel_backend(<pid>).
│ App team: "Query PID <pid> running <duration>, consuming CPU, cancelled."
└── No → Slow query (I/O bound). Not blocking. Monitor.
If > 30 min: escalate to DBA for EXPLAIN ANALYZE review.
Escalation trigger: Query holding AccessExclusiveLock on a critical table > 2 min; cannot cancel a blocking query; cascade of locked queries building up.
Safe actions: pg_stat_activity queries — read-only.
Dangerous actions: pg_terminate_backend (kills session), pg_cancel_backend (cancels query).
Alert: Lock Wait / Deadlock¶
Severity: P1 (blocking user transactions)
First command:
psql -c "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.granted AND NOT blocked_locks.granted JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid;"
Decision tree:
Is the blocking_pid an active user query?
├── Yes → How long has it been running?
│ If > 2 min and blocking production: pg_terminate_backend(<blocking_pid>)
│ Notify app team: "Terminated PID <pid> holding lock for <duration>"
└── No → Is it an idle transaction ("idle in transaction")?
├── Yes → pg_terminate_backend(<blocking_pid>) — idle transactions should not hold locks.
└── No → Is it a DDL migration (ALTER TABLE, etc.)?
→ Contact the engineer running the migration.
→ If no one owns it: pg_terminate_backend(<blocking_pid>). Then investigate.
Escalation trigger: Lock chain > 5 PIDs; cannot terminate the blocking query; AccessExclusiveLock held > 5 min.
Safe actions: Lock query (pg_locks join) — read-only.
Dangerous actions: pg_terminate_backend (terminates session, rolls back transaction).
Quick Reference¶
Most Useful Commands¶
# Active connections by state
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
# Long-running queries
psql -c "SELECT pid, now()-query_start AS dur, state, left(query,80) FROM pg_stat_activity WHERE state != 'idle' ORDER BY dur DESC;"
# Replication lag
psql -c "SELECT client_addr, (sent_lsn - replay_lsn) AS lag FROM pg_stat_replication;"
# Lock waits
psql -c "SELECT pid, wait_event_type, wait_event, left(query,80) FROM pg_stat_activity WHERE wait_event_type = 'Lock';"
# Replication slots (WAL accumulation risk)
psql -c "SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;"
# Cancel a query (graceful)
psql -c "SELECT pg_cancel_backend(<pid>);"
# Terminate a session (forced)
psql -c "SELECT pg_terminate_backend(<pid>);"
# DB sizes
psql -c "SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database ORDER BY pg_database_size(datname) DESC;"
# Table bloat (top 10 biggest tables)
psql -c "SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| Connection pool exhausted | App team + DBA | #db-oncall |
| Replication lag > 5 min | DBA | #db-oncall |
| Disk full (data volume) | DBA + Infra | #infra-oncall |
| Deadlock affecting users | DBA | #db-oncall |
| Failover decision | DBA lead | PagerDuty: db-critical |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| Query pg_stat_activity | pg_terminate_backend |
| Query pg_locks | pg_cancel_backend (less risky) |
| Query pg_replication_slots | Drop replication slot |
| Check disk usage | VACUUM FULL |
| Monitor replication lag | Failover to replica |
| Count connections | Change max_connections |