Skip to content

Pattern: Thread Pool Exhaustion

ID: FP-023 Family: Cascading Failure Frequency: Common Blast Radius: Single Service Detection Difficulty: Moderate

The Shape

A service uses a fixed-size thread pool (or goroutine pool, or connection pool) to handle concurrent requests. Each thread is assigned one request at a time and holds it until the request completes. If requests are slow (due to blocking I/O, slow downstream, or heavy computation), the pool fills with in-progress requests. New requests queue. The queue fills. New incoming connections are rejected. The service is functionally down while all threads are busy — just waiting.

How You'll See It

In Kubernetes

Java service with server.tomcat.threads.max=200. Downstream DB query spikes from 50ms to 5s. 200 threads blocked for 5s each = throughput drops from 4,000 req/s to 40 req/s. Request queue fills. Tomcat rejects connections: "Too many open connections." CPU: 5% (threads are waiting, not computing). Looks like a network problem from the outside.

In Linux/Infrastructure

Nginx + uWSGI with processes=8. Upstream app takes 30s per request (blocking on external API). All 8 uWSGI workers blocked. Nginx queue fills (listen.backlog). New requests return 502. uWSGI logs: "worker killed after 30s" (if harakiri is set). Without harakiri, workers are blocked indefinitely.

In CI/CD

CI executor pool with 4 executors. 4 long-running integration tests each take 30 minutes. No new builds can start for 30 minutes. Build queue grows. CI appears "stuck."

The Tell

CPU utilization is low (5–20%) but the service is unresponsive. Thread count is at maximum pool size; all threads show blocking I/O wait. Request latency matches the downstream call timeout, not the actual processing time. JVM: jstack shows all threads in WAITING or TIMED_WAITING state on I/O.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Service overloaded (high CPU) Thread pool exhausted (low CPU, waiting) CPU is low; threads are blocked on I/O, not computing
Network failure Thread pool full, rejecting new connections Existing connections respond (slowly); new connections refused
Memory pressure Many blocked threads Thread stack memory grows but RSS is below limit; GC is not struggling

The Fix (Generic)

  1. Immediate: Add a short timeout on blocking operations to free threads; scale up the pool size temporarily.
  2. Short-term: Implement async I/O (non-blocking) for downstream calls; use separate thread pools per downstream dependency to contain the blast radius.
  3. Long-term: Move to a reactive/async model (Netty, Vert.x, async/await, goroutines with bounded channels); add jstack/goroutine dump alerting when pool saturation is detected.

Real-World Examples

  • Example 1: Tomcat with threads.max=200. Downstream microservice had a GC pause (15s). 200 threads blocked. New user requests rejected at the load balancer level. CPU: 3%. Engineers added 200 more threads (increased threads.max=400) — bought 15 more seconds before the same problem at larger scale.
  • Example 2: Python Django with 4 gunicorn workers + sync I/O. External payment API slowed to 8s/request. All 4 workers blocked. Site returned 502. Added async payment calls + retry logic; worker count became irrelevant.

War Story

4am pager: "payment service is down." CPU 2%, memory fine, pods running, health check passing. But every request returned 504. jstack on the JVM: all 200 Tomcat threads in TIMED_WAITING — all waiting on the external fraud-check API that had a network partition. Our timeout was 30 seconds; threads were just sitting there waiting. No circuit breaker (FP-019). Adding a 2s timeout on fraud check + circuit breaker cut the blast radius: 95% of orders processed without fraud check (accepted risk during outage), 5% blocked. Service recovered immediately.

Cross-References

  • Topic Packs: distributed-systems, k8s-ops
  • Related Patterns: FP-019 (no circuit breaker — the protection layer), FP-002 (connection pool exhaustion — same shape, connection resource)