- devops
- l2
- runbook
- postgresql --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations | Domain: DevOps & Tooling
Runbook: PostgreSQL Connection Exhaustion¶
| Field | Value |
|---|---|
| Domain | Databases |
| Alert | pg_stat_activity_count >= pg_settings_max_connections * 0.90 or connection refused errors |
| Severity | P1 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | psql access to PostgreSQL, superuser or monitoring role, PgBouncer access if applicable |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
psql -h <DB_HOST> -U <DB_USER> -c "SELECT count(*) FROM pg_stat_activity;"
max_connections → Continue to Step 1
If output shows: count is low → This is a different problem — application may be failing to connect for other reasons; check network or authentication, see ingress_404 runbook
Step 1: Check Current Connection Count vs Maximum¶
Why: You need to know exactly how full the connection pool is before deciding how aggressively to act.
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS active_connections,
max_conn,
round(100.0 * count(*) / max_conn, 1) AS pct_used
FROM pg_stat_activity,
(SELECT setting::int AS max_conn
FROM pg_settings
WHERE name = 'max_connections') mc
GROUP BY max_conn;
"
pg_settings. Try connecting as superuser or a role with pg_monitor membership.
Step 2: Identify Which Apps/Users Are Holding Connections¶
Why: Pinpointing the offending application lets you target the fix without disrupting unrelated traffic.
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT client_addr,
usename,
application_name,
state,
count(*) AS conn_count
FROM pg_stat_activity
GROUP BY client_addr, usename, application_name, state
ORDER BY conn_count DESC;
"
client_addr | usename | application_name | state | conn_count
----------------+----------+------------------+--------+------------
10.0.1.45 | app_user | myapp-web | idle | 47
10.0.1.46 | app_user | myapp-worker | active | 12
pg_stat_activity visibility. In PostgreSQL 10+, grant pg_monitor to the monitoring user.
Step 3: Find Long-Idle Connections¶
Why: Idle connections that have been open for minutes or hours are the most common cause of exhaustion and are safe to terminate.
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
usename,
application_name,
client_addr,
now() - state_change AS idle_for,
left(query, 80) AS last_query
FROM pg_stat_activity
WHERE state = 'idle'
ORDER BY idle_for DESC
LIMIT 20;
"
pid | usename | application_name | client_addr | idle_for | last_query
-------+----------+------------------+--------------+----------+------------
12345 | app_user | myapp-web | 10.0.1.45 | 00:47:23 | SELECT 1
Step 4: Terminate Long-Idle Connections¶
Why: Connections idle for more than 10 minutes are almost certainly leaked — terminating them immediately frees capacity without affecting active work.
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';
"
pg_terminate_backend permission. You need superuser or the pg_signal_backend role. Contact the DBA on-call to perform this step.
Step 5: Check PgBouncer Pool Status (if PgBouncer is in use)¶
Why: If PgBouncer is deployed, it should be absorbing connection pressure. If PgBouncer itself is saturated, terminating database-side connections will not fix the problem — the root cause is upstream.
Expected output: database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | maxwait
--------------+-----------+-----------+------------+-----------+---------+---------+---------
myapp | app_user | 20 | 0 | 20 | 5| 3 | 0
cl_waiting — if it is nonzero, clients are queuing. If maxwait is high (>5 seconds), PgBouncer is overwhelmed.
If this fails: PgBouncer admin console is password-protected. Check the pgbouncer.ini for the admin_users setting and use the correct credentials.
Step 6: Increase max_connections or Enforce Connection Pooling¶
Why: If connection exhaustion is a recurring issue, the architectural fix is either to increase max_connections (with a memory tradeoff) or to route all traffic through PgBouncer in transaction-pool mode.
Check current setting and estimate memory impact:
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT name, setting, unit, context
FROM pg_settings
WHERE name IN ('max_connections', 'work_mem', 'shared_buffers');
"
To update max_connections (requires restart — coordinate with team):
# Edit postgresql.conf — do NOT do this without DBA sign-off during an incident
# max_connections = <NEW_VALUE> # each connection ~5-10 MB RAM overhead
# Then reload config (no restart needed for most params, but max_connections requires restart):
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "SELECT pg_reload_conf();"
max_connections changes require a full PostgreSQL restart — this is a significant operational action. Escalate to the DBA team.
Verification¶
# Confirm the issue is resolved
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS total,
sum(CASE WHEN state = 'active' THEN 1 ELSE 0 END) AS active,
sum(CASE WHEN state = 'idle' THEN 1 ELSE 0 END) AS idle
FROM pg_stat_activity;
"
total is well below max_connections (ideally under 75%), idle count has dropped significantly.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | DBA on-call | "PostgreSQL connection exhaustion on |
| Data loss suspected | DBA / Data Lead | "Possible data loss: terminated connections may have interrupted in-flight transactions on |
| Scope expanding | Platform team | "Connection exhaustion spreading to multiple DB hosts; possible PgBouncer or network failure" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Verify the application's connection pool configuration (pool size, idle timeout, max lifetime)
- Consider deploying PgBouncer in transaction-pool mode if not already in use
- Set
idle_in_transaction_session_timeoutin PostgreSQL to auto-kill stuck sessions
Common Mistakes¶
- Terminating active queries instead of idle connections: Always filter by
state = 'idle'first. Killing active queries can corrupt application state and cause data inconsistency. - Not checking if PgBouncer is saturated: If all traffic goes through PgBouncer and PgBouncer has hit its
max_client_conn, terminating database-side connections does nothing — the bottleneck is in PgBouncer. - Restarting PostgreSQL as a quick fix: A restart kills every active connection and transaction simultaneously, causing cascading application failures. It is a last resort, not a first response.
- Setting max_connections too high: Each PostgreSQL connection reserves ~5-10 MB of RAM. Setting
max_connections = 1000on a server with 8 GB RAM may cause out-of-memory crashes under load.
Cross-References¶
- Topic Pack: PostgreSQL Administration (deep background on connection management, PgBouncer, and tuning)
- Related Runbook: postgres-replication-lag.md — high connection counts can cause replication lag
- Related Runbook: long-running-query.md — long-running transactions hold connections open
Wiki Navigation¶
Related Content¶
- PostgreSQL Operations (Topic Pack, L2) — PostgreSQL Operations
- Runbook: Long-Running Query / Lock Contention (Runbook, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Disk Space Critical (Runbook, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Replication Lag (Runbook, L2) — PostgreSQL Operations