Skip to content

Mental Model: Circuit Breaker

Category: Architecture & Design Origin: Michael Nygard, Release It! (2007); popularized by Martin Fowler One-liner: When a downstream service is failing, stop calling it immediately and fail fast — protecting both the caller and the callee from cascading overload.

The Model

A circuit breaker in electrical systems does one thing: when current exceeds a safe threshold, it opens the circuit before the wire melts. It doesn't try to diagnose the root cause. It doesn't retry. It just stops current from flowing, preventing the failure from spreading to adjacent systems. Michael Nygard applied this exact logic to distributed systems: when a remote call is failing at a high rate, the caller should stop making that call immediately, return an error or a fallback response, and wait before trying again.

The critical insight is about what happens without a circuit breaker. A slow or failing downstream service causes callers to queue up requests, waiting for timeouts. Each waiting thread holds resources — memory, connections, file descriptors. If the timeout is 30 seconds and requests arrive at 100/second, within 30 seconds you have 3,000 threads waiting. Your service runs out of threads. Now your service is unavailable. The downstream failure has cascaded upstream. The circuit breaker prevents this by failing immediately once the failure threshold is crossed, rather than letting callers pile up waiting.

The circuit breaker has three states. Closed (normal operation): calls pass through, successes and failures are counted. Open (failing fast): calls are rejected immediately without attempting the remote call; a timeout is started. Half-Open (testing recovery): after the timeout expires, a limited number of probe calls are allowed through; if they succeed, the breaker closes; if they fail, the breaker re-opens and the timeout resets. The state transitions are driven by configurable thresholds: failure rate percentage (e.g., open when >50% of calls in the last 10 seconds fail) and minimum request volume (don't open on 1 failure out of 2 requests).

The boundary conditions matter. Circuit breakers are meaningful only for calls that cross a network boundary or an unreliable resource boundary — database connections, HTTP clients, gRPC stubs, message queue producers. Wrapping in-memory operations in a circuit breaker adds overhead with no benefit. Circuit breakers also require that callers can meaningfully handle the "open circuit" response — either with a fallback (serve cached data, return a degraded response) or with graceful error propagation. If the only option is to crash anyway, the circuit breaker defers but doesn't improve the failure.

Circuit breakers interact with timeouts in an important way: a circuit breaker without timeouts doesn't protect against slow calls, only failed ones. A timeout without a circuit breaker still allows every caller to wait for the full timeout before failing. The two must be combined: short timeouts prevent resource exhaustion per call; the circuit breaker prevents the accumulation of even-short timeouts from overwhelming the system.

Visual

STATE MACHINE:
                    failure rate > threshold
                    (e.g., >50% over 10s)
         ┌──────────────────────────────────────────┐
                                                                                                        ┌─────────────┐    all calls pass through   ┌────────────┐
                ──────────────────────────►                  CLOSED                                     OPEN        (normal)    ◄── probe succeeds ────────  (fail fast)  └─────────────┘                             └────────────┘
                                                                                                       after timeout
            probe succeeds                          (e.g., 30s)
                                                                                               ┌─────────────┐
         └───────────────────────────────────│  HALF-OPEN                probe fails  back to OPEN       (probing)                                               └─────────────┘

CALL FLOW (CLOSED state):
  caller ──► circuit breaker ──► downstream service
               counts successes/failures
               if failure_rate > threshold  transition to OPEN

CALL FLOW (OPEN state):
  caller ──► circuit breaker ──X  (immediate rejection, no network call)
               returns fallback or error immediately
               starts recovery timeout countdown

CONFIGURATION KNOBS:
┌──────────────────────────────────────────────────────┐
  failure_rate_threshold     = 50%                      minimum_number_of_calls    = 10   (per window)        sliding_window_size        = 10s  (or 10 calls)       wait_duration_in_open_state = 30s                     permitted_calls_in_half_open = 3                    └──────────────────────────────────────────────────────┘
stateDiagram-v2
    [*] --> Closed
    Closed --> Open : failure rate > threshold
    Open --> HalfOpen : timeout expires
    HalfOpen --> Closed : probe succeeds
    HalfOpen --> Open : probe fails

    Closed : Normal operation
    Closed : Calls pass through
    Open : Fail fast
    Open : Reject immediately
    HalfOpen : Limited probe calls
    HalfOpen : Testing recovery

When to Reach for This

  • Any service that makes synchronous calls to a downstream dependency (database, external API, microservice, DNS resolver)
  • You've experienced cascading failures where one service going down took out multiple others due to thread exhaustion or connection pool exhaustion
  • You're implementing retries and want to avoid retry storms — the circuit breaker prevents retrying against a system that's clearly down
  • You need graceful degradation: when the circuit is open, serve a cached response, a static fallback, or a "service temporarily unavailable" message rather than hanging
  • You're in a service mesh environment and want application-level circuit breaking in addition to the mesh-level circuit breaking (defense in depth)
  • Your service has hard latency SLAs and a single slow downstream call cannot be allowed to consume the entire timeout budget

When NOT to Use This

  • Wrapping calls that are already in-process (local function calls, in-memory caches) — there's no network latency to protect against; you add overhead with no benefit
  • Using a circuit breaker as a substitute for fixing the downstream service — a circuit breaker manages failure, it doesn't fix root causes; the on-call engineer still needs to investigate
  • Setting thresholds so low that the breaker opens on normal transient errors (1–2 failures) — this creates false positives that break healthy traffic and are indistinguishable from real outages in metrics
  • Relying on the circuit breaker to protect against slow calls without also setting timeouts — an open circuit on a 10% failure rate doesn't help if the other 90% of calls are hanging for 30 seconds each

Applied Examples

Example 1: Python service calling a flaky external payments API

Using the pybreaker library:

import pybreaker
import requests

# Configure: open after 5 failures in last 60 seconds; retry after 30s
payment_breaker = pybreaker.CircuitBreaker(
    fail_max=5,
    reset_timeout=30,
    exclude=[requests.exceptions.HTTPError]  # don't count 4xx as failures
)

@payment_breaker
def charge_card(order_id: str, amount_cents: int) -> dict:
    response = requests.post(
        "https://payments.example.com/charge",
        json={"order_id": order_id, "amount": amount_cents},
        timeout=5.0  # timeout + circuit breaker, always paired
    )
    response.raise_for_status()
    return response.json()

def process_order(order):
    try:
        result = charge_card(order.id, order.total_cents)
        return {"status": "charged", "transaction_id": result["id"]}
    except pybreaker.CircuitBreakerError:
        # Circuit is OPEN — fail fast with a clear user message
        return {"status": "payment_unavailable", "retry_after": 30}
    except requests.exceptions.Timeout:
        return {"status": "payment_timeout"}

When the payments API starts failing, charge_card accumulates failures. After the 5th failure, CircuitBreakerError is raised immediately on every subsequent call — no network request is made. The caller gets a fast response it can handle, rather than waiting 5 seconds per call while requests pile up.

Example 2: Kubernetes DNS resolution and CoreDNS

A service makes many short-lived DNS lookups for dynamic service discovery. CoreDNS becomes overloaded. Without a circuit breaker, every service instance blocks on DNS resolution for the full 5-second timeout, exhausting goroutines/threads. With a circuit breaker around the DNS-dependent code path:

// Resilience4j-style in Go using sony/gobreaker
var dnsBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "dns-lookup",
    MaxRequests: 3,                    // half-open: 3 probes
    Interval:    10 * time.Second,     // count window
    Timeout:     30 * time.Second,     // open → half-open after 30s
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures > 5
    },
})

func resolveServiceEndpoint(serviceName string) (string, error) {
    result, err := dnsBreaker.Execute(func() (interface{}, error) {
        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()
        addrs, err := net.DefaultResolver.LookupHost(ctx, serviceName)
        if err != nil {
            return nil, err
        }
        return addrs[0], nil
    })
    if err == gobreaker.ErrOpenState {
        // Serve from local DNS cache if available
        return localCache.Get(serviceName)
    }
    return result.(string), err
}

The circuit breaker prevents the DNS overload from amplifying into application thread exhaustion — a common pattern in Kubernetes environments where CoreDNS becomes a bottleneck under load.

Example 3: Instrumenting circuit breaker state as an operational metric

A circuit breaker that opens silently is nearly useless operationally — you won't know the payments API is down until users complain. The breaker state must emit metrics:

import pybreaker
import prometheus_client as prom

# Prometheus metrics for circuit breaker observability
CB_STATE = prom.Gauge(
    "circuit_breaker_state",
    "Circuit breaker state: 0=closed, 1=open, 2=half-open",
    ["service"]
)
CB_TRANSITIONS = prom.Counter(
    "circuit_breaker_state_transitions_total",
    "Number of state transitions",
    ["service", "from_state", "to_state"]
)

class InstrumentedListener(pybreaker.CircuitBreakerListener):
    def __init__(self, service_name: str):
        self.service_name = service_name
        self._prev_state = "closed"

    def state_change(self, cb, old_state, new_state):
        state_map = {"closed": 0, "open": 1, "half-open": 2}
        CB_STATE.labels(service=self.service_name).set(
            state_map.get(new_state.name, -1)
        )
        CB_TRANSITIONS.labels(
            service=self.service_name,
            from_state=old_state.name,
            to_state=new_state.name,
        ).inc()

payment_breaker = pybreaker.CircuitBreaker(
    fail_max=5,
    reset_timeout=30,
    listeners=[InstrumentedListener("payments-api")],
)

With this instrumentation, a Grafana alert fires the moment circuit_breaker_state{service="payments-api"} == 1. The on-call engineer sees the payments circuit opened at 14:32 UTC, correlates with a spike in payment gateway 503s, and routes the alert to the payments team — all before the first user complaint arrives. The circuit breaker becomes an early warning system, not just a protection mechanism.

The Junior vs Senior Gap

Junior Senior
Adds unlimited retries with exponential backoff and wonders why the downstream service gets worse under failure Pairs retries with a circuit breaker; retries operate only when the circuit is closed
Sets the same timeout for all downstream calls regardless of their SLA Calibrates per-dependency: tight timeout on payment API (5s), looser on batch export API (30s)
Monitors service errors in aggregate; misses that one downstream dependency is causing 95% of failures Instruments circuit breaker state transitions as metrics; alerts on circuit opening events
Treats a circuit opening as an incident Treats a circuit opening as a signal that a different team's service has an incident; routes alert correctly
Returns HTTP 500 when the circuit is open Returns HTTP 503 with Retry-After header, allowing upstream load balancers to handle gracefully
Wraps every function in a circuit breaker "for safety" Applies circuit breakers selectively to I/O-bound, network-crossing calls only

Connections

  • Complements: Bulkhead (use together for — circuit breakers stop calls when failure rate spikes; bulkheads prevent one consumer's heavy load from exhausting the connection pool for other consumers; deploy both for defense in depth against dependency failures)
  • Complements: Idempotency (use together for — when the circuit closes and calls resume, you need idempotent operations so that retried requests don't cause duplicate side effects)
  • Tensions: Strangler Fig (contradicts when — during a migration where the new system is receiving partial traffic, a circuit breaker on the new system that opens frequently may mask legitimate bugs rather than protecting against genuine overload; be careful not to paper over new-system defects with breaker state)
  • Topic Packs: kubernetes, service-mesh
  • Case Studies: coredns-timeout-pod-dns (DNS timeouts cascading into application thread exhaustion — circuit breaker around DNS-dependent paths would have bounded the blast radius), dns-resolution-slow (slow resolution without fail-fast behavior causes latency to bleed into unrelated request paths)