devops
l2
runbook
postgresql --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations | Domain: DevOps & Tooling

Runbook: PostgreSQL Connection Exhaustion¶

Field	Value
Domain	Databases
Alert	`pg_stat_activity_count >= pg_settings_max_connections * 0.90` or connection refused errors
Severity	P1
Est. Resolution Time	15-30 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	psql access to PostgreSQL, superuser or monitoring role, PgBouncer access if applicable

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
psql -h <DB_HOST> -U <DB_USER> -c "SELECT count(*) FROM pg_stat_activity;"

If output shows: count near or above max_connections → Continue to Step 1 If output shows: count is low → This is a different problem — application may be failing to connect for other reasons; check network or authentication, see ingress_404 runbook

Step 1: Check Current Connection Count vs Maximum¶

Why: You need to know exactly how full the connection pool is before deciding how aggressively to act.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS active_connections,
       max_conn,
       round(100.0 * count(*) / max_conn, 1) AS pct_used
FROM pg_stat_activity,
     (SELECT setting::int AS max_conn
      FROM pg_settings
      WHERE name = 'max_connections') mc
GROUP BY max_conn;
"

Expected output:

 active_connections | max_conn | pct_used
--------------------+----------+----------
                 94 |      100 |     94.0

If this fails: You may not have permission to query pg_settings. Try connecting as superuser or a role with pg_monitor membership.

Step 2: Identify Which Apps/Users Are Holding Connections¶

Why: Pinpointing the offending application lets you target the fix without disrupting unrelated traffic.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT client_addr,
       usename,
       application_name,
       state,
       count(*) AS conn_count
FROM pg_stat_activity
GROUP BY client_addr, usename, application_name, state
ORDER BY conn_count DESC;
"

Expected output:

  client_addr   | usename  | application_name | state  | conn_count
----------------+----------+------------------+--------+------------
 10.0.1.45      | app_user | myapp-web        | idle   |         47
 10.0.1.46      | app_user | myapp-worker     | active |         12

If this fails: Check that your role has pg_stat_activity visibility. In PostgreSQL 10+, grant pg_monitor to the monitoring user.

Step 3: Find Long-Idle Connections¶

Why: Idle connections that have been open for minutes or hours are the most common cause of exhaustion and are safe to terminate.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
       usename,
       application_name,
       client_addr,
       now() - state_change AS idle_for,
       left(query, 80) AS last_query
FROM pg_stat_activity
WHERE state = 'idle'
ORDER BY idle_for DESC
LIMIT 20;
"

Expected output:

  pid  | usename  | application_name | client_addr  | idle_for | last_query
-------+----------+------------------+--------------+----------+------------
 12345 | app_user | myapp-web        | 10.0.1.45    | 00:47:23 | SELECT 1

If this fails: The database may be too overloaded to respond to queries. Skip to Step 4 and terminate connections immediately.

Step 4: Terminate Long-Idle Connections¶

Why: Connections idle for more than 10 minutes are almost certainly leaked — terminating them immediately frees capacity without affecting active work.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < NOW() - INTERVAL '10 minutes';
"

Expected output:

 pg_terminate_backend
----------------------
 t
 t
 t
(3 rows)

If this fails: You may lack pg_terminate_backend permission. You need superuser or the pg_signal_backend role. Contact the DBA on-call to perform this step.

Step 5: Check PgBouncer Pool Status (if PgBouncer is in use)¶

Why: If PgBouncer is deployed, it should be absorbing connection pressure. If PgBouncer itself is saturated, terminating database-side connections will not fix the problem — the root cause is upstream.

psql -h <PGBOUNCER_HOST> -p 6432 -U <PGBOUNCER_ADMIN_USER> pgbouncer -c 'SHOW POOLS;'

Expected output:

   database   |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | maxwait
--------------+-----------+-----------+------------+-----------+---------+---------+---------
 myapp        | app_user  |        20 |          0 |        20 |        5|       3 |       0

Look at cl_waiting — if it is nonzero, clients are queuing. If maxwait is high (>5 seconds), PgBouncer is overwhelmed. If this fails: PgBouncer admin console is password-protected. Check the pgbouncer.ini for the admin_users setting and use the correct credentials.

Step 6: Increase max_connections or Enforce Connection Pooling¶

Why: If connection exhaustion is a recurring issue, the architectural fix is either to increase max_connections (with a memory tradeoff) or to route all traffic through PgBouncer in transaction-pool mode.

Check current setting and estimate memory impact:

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT name, setting, unit, context
FROM pg_settings
WHERE name IN ('max_connections', 'work_mem', 'shared_buffers');
"

To update max_connections (requires restart — coordinate with team):

# Edit postgresql.conf — do NOT do this without DBA sign-off during an incident
# max_connections = <NEW_VALUE>   # each connection ~5-10 MB RAM overhead
# Then reload config (no restart needed for most params, but max_connections requires restart):
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "SELECT pg_reload_conf();"

Expected output:

 pg_reload_conf
----------------
 t

If this fails: max_connections changes require a full PostgreSQL restart — this is a significant operational action. Escalate to the DBA team.

Verification¶

# Confirm the issue is resolved
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS total,
       sum(CASE WHEN state = 'active' THEN 1 ELSE 0 END) AS active,
       sum(CASE WHEN state = 'idle' THEN 1 ELSE 0 END) AS idle
FROM pg_stat_activity;
"

Success looks like: total is well below max_connections (ideally under 75%), idle count has dropped significantly. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	DBA on-call	"PostgreSQL connection exhaustion on : of used, application errors ongoing"
Data loss suspected	DBA / Data Lead	"Possible data loss: terminated connections may have interrupted in-flight transactions on "
Scope expanding	Platform team	"Connection exhaustion spreading to multiple DB hosts; possible PgBouncer or network failure"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Verify the application's connection pool configuration (pool size, idle timeout, max lifetime)
Consider deploying PgBouncer in transaction-pool mode if not already in use
Set idle_in_transaction_session_timeout in PostgreSQL to auto-kill stuck sessions

Common Mistakes¶

Terminating active queries instead of idle connections: Always filter by state = 'idle' first. Killing active queries can corrupt application state and cause data inconsistency.
Not checking if PgBouncer is saturated: If all traffic goes through PgBouncer and PgBouncer has hit its max_client_conn, terminating database-side connections does nothing — the bottleneck is in PgBouncer.
Restarting PostgreSQL as a quick fix: A restart kills every active connection and transaction simultaneously, causing cascading application failures. It is a last resort, not a first response.
Setting max_connections too high: Each PostgreSQL connection reserves ~5-10 MB of RAM. Setting max_connections = 1000 on a server with 8 GB RAM may cause out-of-memory crashes under load.

Cross-References¶

Topic Pack: PostgreSQL Administration (deep background on connection management, PgBouncer, and tuning)
Related Runbook: postgres-replication-lag.md — high connection counts can cause replication lag
Related Runbook: long-running-query.md — long-running transactions hold connections open

PostgreSQL Operations (Topic Pack, L2) — PostgreSQL Operations
Runbook: Long-Running Query / Lock Contention (Runbook, L2) — PostgreSQL Operations
Runbook: PostgreSQL Disk Space Critical (Runbook, L2) — PostgreSQL Operations
Runbook: PostgreSQL Replication Lag (Runbook, L2) — PostgreSQL Operations

Runbook: PostgreSQL Connection Exhaustion¶

Quick Assessment (30 seconds)¶

Step 1: Check Current Connection Count vs Maximum¶

Step 2: Identify Which Apps/Users Are Holding Connections¶

Step 3: Find Long-Idle Connections¶

Step 4: Terminate Long-Idle Connections¶

Step 5: Check PgBouncer Pool Status (if PgBouncer is in use)¶

Step 6: Increase max_connections or Enforce Connection Pooling¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: PostgreSQL Connection Exhaustion¶

Quick Assessment (30 seconds)¶

Step 1: Check Current Connection Count vs Maximum¶

Step 2: Identify Which Apps/Users Are Holding Connections¶

Step 3: Find Long-Idle Connections¶

Step 4: Terminate Long-Idle Connections¶

Step 5: Check PgBouncer Pool Status (if PgBouncer is in use)¶

Step 6: Increase max_connections or Enforce Connection Pooling¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶