Pattern: No Circuit Breaker¶
ID: FP-019 Family: Cascading Failure Frequency: Very Common Blast Radius: Multi-Service Detection Difficulty: Moderate
The Shape¶
A service calls a downstream dependency synchronously. The dependency becomes slow (not down, but slow). Callers block waiting for responses. Thread/goroutine pools fill up with waiting requests. New incoming requests queue. The upstream service, which was healthy, becomes unresponsive because all its workers are blocked on a slow downstream. Without a circuit breaker to short-circuit calls to the slow dependency, the upstream service acts as a cascade amplifier.
How You'll See It¶
In Kubernetes¶
Service A (100 pods) calls Service B synchronously (5s timeout). Service B starts responding slowly (4.9s per request). Service A's 100 threads are each blocked for 4.9s. Throughput of Service A drops from 1,000 req/s to ~20 req/s (100 threads / 4.9s). Service C, which calls Service A, starts seeing timeouts. The cascade continues upward.
In Linux/Infrastructure¶
Application server uses a thread-per-request model. Downstream DB queries average 200ms normally but spike to 10s during a large batch job. All 200 request-handler threads block for 10s. New connections queue, then time out. The application server appears "down" while the database is merely slow.
In CI/CD¶
Build pipeline calls an external linting service synchronously. The service is slow (not down). All parallel build jobs block, filling the CI runner queue. CI system appears saturated when the underlying issue is a single slow external call.
In Networking¶
TCP keepalive timeout (default 2 hours on Linux) means connections to a slow service are held open for a long time. A connection pool fills with connections that are "open" but not responsive.
The Tell¶
Upstream service health check passes (it's responding), but latency is very high. Thread/goroutine/connection pool is at or near maximum. Downstream service is slow (not down): it eventually responds. The upstream service's metrics show 100% of workers in "waiting" state.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Upstream overloaded | Upstream blocked on slow downstream | Upstream CPU low; all threads waiting on I/O |
| Downstream down | Downstream slow | Downstream responds eventually; upstream timeout fires before that |
| Traffic spike | Cascade from slow dependency | Incoming traffic flat; response time increase exactly tracks downstream latency |
The Fix (Generic)¶
- Immediate: Add a short timeout on the downstream call; kill requests that take too long; free the worker.
- Short-term: Implement a circuit breaker: after N failures in a window, stop calling the downstream for T seconds (short-circuit to error/cached response); try again after the break.
- Long-term: Use a circuit breaker library (resilience4j, Hystrix, go-hystrix); set timeouts at each service boundary; instrument the circuit state (open/half-open/closed) as a metric.
Real-World Examples¶
- Example 1: Payment service called an address validation API (synchronous). The API had a 30s timeout. During a storm (geocoding database slow), all 50 payment service threads blocked for 30s. Checkout was completely unavailable for 8 minutes.
- Example 2: API gateway called 4 downstream services sequentially. One service became slow (3s instead of 200ms). 100% of gateway threads blocked. The gateway returned 504 for all requests, regardless of which downstream was needed.
War Story¶
It was a Friday afternoon deploy — not even our service. Someone deployed a slow database migration on the user-auth service. Auth started responding in 8s instead of 200ms. Within 90 seconds, our API gateway was completely unresponsive. Every request hit auth (because every endpoint required auth); all 200 goroutines were blocked waiting for auth responses. We had a 10s timeout (barely longer than auth's actual response time) but no circuit breaker. The auth service was working — just slowly — and it was killing us. Circuit breakers would have opened after 5 failures and returned "auth unavailable" immediately, preserving the API gateway for non-auth paths.
Cross-References¶
- Topic Packs: distributed-systems, k8s-ops
- Footguns: distributed-systems/footguns.md — "No circuit breakers for downstream calls"
- Related Patterns: FP-020 (missing backpressure — the queue-side of the same problem), FP-023 (thread pool exhaustion — the mechanism), FP-009 (retry storm — what happens when you add retries without circuit breakers)