Pattern: Thread Pool Exhaustion¶
ID: FP-023 Family: Cascading Failure Frequency: Common Blast Radius: Single Service Detection Difficulty: Moderate
The Shape¶
A service uses a fixed-size thread pool (or goroutine pool, or connection pool) to handle concurrent requests. Each thread is assigned one request at a time and holds it until the request completes. If requests are slow (due to blocking I/O, slow downstream, or heavy computation), the pool fills with in-progress requests. New requests queue. The queue fills. New incoming connections are rejected. The service is functionally down while all threads are busy — just waiting.
How You'll See It¶
In Kubernetes¶
Java service with server.tomcat.threads.max=200. Downstream DB query spikes from 50ms
to 5s. 200 threads blocked for 5s each = throughput drops from 4,000 req/s to 40 req/s.
Request queue fills. Tomcat rejects connections: "Too many open connections." CPU: 5%
(threads are waiting, not computing). Looks like a network problem from the outside.
In Linux/Infrastructure¶
Nginx + uWSGI with processes=8. Upstream app takes 30s per request (blocking on
external API). All 8 uWSGI workers blocked. Nginx queue fills (listen.backlog). New
requests return 502. uWSGI logs: "worker killed after 30s" (if harakiri is set). Without
harakiri, workers are blocked indefinitely.
In CI/CD¶
CI executor pool with 4 executors. 4 long-running integration tests each take 30 minutes. No new builds can start for 30 minutes. Build queue grows. CI appears "stuck."
The Tell¶
CPU utilization is low (5–20%) but the service is unresponsive. Thread count is at maximum pool size; all threads show blocking I/O wait. Request latency matches the downstream call timeout, not the actual processing time. JVM:
jstackshows all threads inWAITINGorTIMED_WAITINGstate on I/O.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Service overloaded (high CPU) | Thread pool exhausted (low CPU, waiting) | CPU is low; threads are blocked on I/O, not computing |
| Network failure | Thread pool full, rejecting new connections | Existing connections respond (slowly); new connections refused |
| Memory pressure | Many blocked threads | Thread stack memory grows but RSS is below limit; GC is not struggling |
The Fix (Generic)¶
- Immediate: Add a short timeout on blocking operations to free threads; scale up the pool size temporarily.
- Short-term: Implement async I/O (non-blocking) for downstream calls; use separate thread pools per downstream dependency to contain the blast radius.
- Long-term: Move to a reactive/async model (Netty, Vert.x, async/await, goroutines with bounded channels); add
jstack/goroutine dump alerting when pool saturation is detected.
Real-World Examples¶
- Example 1: Tomcat with
threads.max=200. Downstream microservice had a GC pause (15s). 200 threads blocked. New user requests rejected at the load balancer level. CPU: 3%. Engineers added 200 more threads (increasedthreads.max=400) — bought 15 more seconds before the same problem at larger scale. - Example 2: Python Django with 4 gunicorn workers + sync I/O. External payment API slowed to 8s/request. All 4 workers blocked. Site returned 502. Added async payment calls + retry logic; worker count became irrelevant.
War Story¶
4am pager: "payment service is down." CPU 2%, memory fine, pods running, health check passing. But every request returned 504.
jstackon the JVM: all 200 Tomcat threads inTIMED_WAITING— all waiting on the external fraud-check API that had a network partition. Our timeout was 30 seconds; threads were just sitting there waiting. No circuit breaker (FP-019). Adding a 2s timeout on fraud check + circuit breaker cut the blast radius: 95% of orders processed without fraud check (accepted risk during outage), 5% blocked. Service recovered immediately.
Cross-References¶
- Topic Packs: distributed-systems, k8s-ops
- Related Patterns: FP-019 (no circuit breaker — the protection layer), FP-002 (connection pool exhaustion — same shape, connection resource)