Pattern: Connection Pool Exhaustion¶
ID: FP-002 Family: Resource Exhaustion Frequency: Very Common Blast Radius: Multi-Service Detection Difficulty: Moderate
The Shape¶
A pool of persistent connections (to a database, cache, or downstream service) is finite. When all connections are checked out and not returned quickly enough, new requests queue or fail. The upstream service appears healthy (its own CPU/memory fine) while the downstream connection count sits at its hard ceiling. Traffic spikes, slow queries, or missing connection return all trigger it.
How You'll See It¶
In Kubernetes¶
All pods show healthy readiness probes, but FATAL: remaining connection slots are reserved
(Postgres) or too many connections (MySQL) errors spike in app logs. HPA scales up more
pods — each pod tries to establish its own connections — which makes the problem worse.
kubectl top pods shows low CPU and memory, so auto-scaling looks like it should be helping.
In Linux/Infrastructure¶
Application server process count is normal. netstat -an | grep :5432 | grep ESTABLISHED | wc -l
equals max_connections in postgresql.conf. New app processes block on connect(). Timeouts
cascade upstream.
In CI/CD¶
Integration tests run in parallel; each test opens its own DB connection. With 50 parallel
jobs and a shared dev Postgres with max_connections=100, tests intermittently fail with
"could not connect" — only under parallelism, never locally.
In Networking¶
The connection bottleneck can appear as a network timeout even though the network itself is fine. Packet capture shows the TCP handshake completing, but the database sends a reset or the app closes the connection immediately with an error banner.
The Tell¶
max_connections(Postgres) ormax_user_connections(MySQL) is at its ceiling. Application error logs say "too many connections" or "connection refused" while infrastructure metrics (CPU, memory, network) look normal.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Database overloaded | Pool exhaustion | DB CPU is low; active queries near zero; connection count at max |
| Network issue | Connection refusal at DB layer | TCP connects fine; error comes from DB application protocol |
| App memory leak | Leaked connections | Connection count at ceiling before memory is high; check pool config |
The Fix (Generic)¶
- Immediate: Kill idle connections:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '5 minutes'; - Short-term: Add a connection pooler (PgBouncer, ProxySQL) in front of the database; set pool size per app instance =
max_connections / num_app_instances - headroom. - Long-term: Use a shared connection pool; add
pool_sizeandmax_overflowlimits in the ORM; monitorpg_stat_activityconnection counts and alert before reaching max.
Real-World Examples¶
- Example 1: E-commerce app with 20 pods, each with a Django
CONN_MAX_AGE=600(persistent connections). After a deploy that doubled pod count to 40, Postgresmax_connections=100was immediately exhausted. Scaling the app made the database unreachable. - Example 2: Data pipeline that opens a new DB connection per message processed (100 msg/sec). Works fine at low load; at peak, 100 in-flight connections held open for the full processing time hit the ceiling.
War Story¶
We scaled from 10 to 30 pods to handle a traffic spike — and the site went more down, not less. Took 15 minutes to realize: each pod had its own connection pool of 5, so 30 pods × 5 = 150 connections, but Postgres max was 100. The "fix" (more pods) was making it worse. We emergency-deployed PgBouncer in transaction mode (pool of 20), restarted the pods, and went from 0% success to 99.9% in 90 seconds.
Cross-References¶
- Topic Packs: database-ops, k8s-ops
- Case Studies: ops-archaeology/04-postgres-replica-lag/
- Footguns: database-ops/footguns.md — "Connection pool exhaustion"
- Related Patterns: FP-019 (no circuit breaker — connection exhaustion is often the triggering failure), FP-023 (thread pool exhaustion — same shape, different resource)