Pattern: Connection Pool Exhaustion¶

ID: FP-002 Family: Resource Exhaustion Frequency: Very Common Blast Radius: Multi-Service Detection Difficulty: Moderate

The Shape¶

A pool of persistent connections (to a database, cache, or downstream service) is finite. When all connections are checked out and not returned quickly enough, new requests queue or fail. The upstream service appears healthy (its own CPU/memory fine) while the downstream connection count sits at its hard ceiling. Traffic spikes, slow queries, or missing connection return all trigger it.

How You'll See It¶

In Kubernetes¶

All pods show healthy readiness probes, but FATAL: remaining connection slots are reserved (Postgres) or too many connections (MySQL) errors spike in app logs. HPA scales up more pods — each pod tries to establish its own connections — which makes the problem worse. kubectl top pods shows low CPU and memory, so auto-scaling looks like it should be helping.

In Linux/Infrastructure¶

Application server process count is normal. netstat -an | grep :5432 | grep ESTABLISHED | wc -l equals max_connections in postgresql.conf. New app processes block on connect(). Timeouts cascade upstream.

In CI/CD¶

Integration tests run in parallel; each test opens its own DB connection. With 50 parallel jobs and a shared dev Postgres with max_connections=100, tests intermittently fail with "could not connect" — only under parallelism, never locally.

In Networking¶

The connection bottleneck can appear as a network timeout even though the network itself is fine. Packet capture shows the TCP handshake completing, but the database sends a reset or the app closes the connection immediately with an error banner.

The Tell¶

max_connections (Postgres) or max_user_connections (MySQL) is at its ceiling. Application error logs say "too many connections" or "connection refused" while infrastructure metrics (CPU, memory, network) look normal.

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
Database overloaded	Pool exhaustion	DB CPU is low; active queries near zero; connection count at max
Network issue	Connection refusal at DB layer	TCP connects fine; error comes from DB application protocol
App memory leak	Leaked connections	Connection count at ceiling before memory is high; check pool config

The Fix (Generic)¶

Immediate: Kill idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '5 minutes';
Short-term: Add a connection pooler (PgBouncer, ProxySQL) in front of the database; set pool size per app instance = max_connections / num_app_instances - headroom.
Long-term: Use a shared connection pool; add pool_size and max_overflow limits in the ORM; monitor pg_stat_activity connection counts and alert before reaching max.

Real-World Examples¶

Example 1: E-commerce app with 20 pods, each with a Django CONN_MAX_AGE=600 (persistent connections). After a deploy that doubled pod count to 40, Postgres max_connections=100 was immediately exhausted. Scaling the app made the database unreachable.
Example 2: Data pipeline that opens a new DB connection per message processed (100 msg/sec). Works fine at low load; at peak, 100 in-flight connections held open for the full processing time hit the ceiling.

War Story¶

We scaled from 10 to 30 pods to handle a traffic spike — and the site went more down, not less. Took 15 minutes to realize: each pod had its own connection pool of 5, so 30 pods × 5 = 150 connections, but Postgres max was 100. The "fix" (more pods) was making it worse. We emergency-deployed PgBouncer in transaction mode (pool of 20), restarted the pods, and went from 0% success to 99.9% in 90 seconds.

Cross-References¶

Topic Packs: database-ops, k8s-ops
Case Studies: ops-archaeology/04-postgres-replica-lag/
Footguns: database-ops/footguns.md — "Connection pool exhaustion"
Related Patterns: FP-019 (no circuit breaker — connection exhaustion is often the triggering failure), FP-023 (thread pool exhaustion — same shape, different resource)