Postmortem: Missing Circuit Breaker Lets Redis Failure Cascade to All Services¶

Field	Value
ID	PM-011
Date	2025-03-04
Severity	SEV-2
Duration	33m (detection to resolution)
Time to Detect	4m
Time to Mitigate	33m
Customer Impact	~14,000 authenticated users unable to make API requests for 29 minutes; all 8 customer-facing services returned 503 or hung indefinitely
Revenue Impact	~$31,000 estimated (29 minutes of API unavailability × average transaction rate)
Teams Involved	Platform Engineering, Backend Services, SRE, Redis Infrastructure
Postmortem Author	Danielle Osei
Postmortem Date	2025-03-07

Executive Summary¶

On March 4, 2025, the Redis primary instance for session caching exhausted its memory allocation and entered a hard-blocked state due to an noeviction policy. All 8 microservices that depend on Redis for session validation began blocking on 30-second connection timeouts, exhausting their shared thread pools within approximately 2 minutes and rendering the entire API layer unresponsive — including endpoints with no Redis dependency. Redis Sentinel promoted a replica to primary within 90 seconds, but by that point the thread pools across all services were saturated and the services could not self-recover. Full restoration required 15 minutes of rolling restarts across all affected services. The incident exposed a systemic design flaw: no circuit breakers existed to isolate Redis-dependent code paths from the broader service thread pools.

Timeline (All times UTC)¶

Time	Event
09:14:02	Redis primary (`cache-primary-01`) memory usage reaches 99.8% of 8 GB allocation; `noeviction` policy causes new write commands to return `OOM command not allowed`
09:14:11	First application errors logged: `user-service` begins receiving `READONLY` and OOM errors on `SET SESSION:*` calls
09:14:30	`user-service`, `order-service`, `notification-service` all queuing requests waiting for Redis connections to free
09:15:18	`gateway-service` health checks start timing out; upstream load balancer begins logging 5xx responses
09:15:44	PagerDuty alert fires: "API error rate > 10% for 60 seconds" — on-call SRE Tomasz Wiśniewski is paged
09:15:55	Redis Sentinel detects primary unavailability (write failures), begins failover sequence
09:16:28	Redis Sentinel promotes `cache-replica-02` to primary; replication lag at promotion: 2.1 seconds
09:16:45	Tomasz acknowledges page, begins investigation; observes all services returning 503
09:17:10	Tomasz confirms Redis promotion succeeded; expected self-healing did not occur
09:17:52	Escalation to Platform Engineering lead Claudia Hartmann
09:18:30	Claudia identifies thread pool exhaustion in `user-service` metrics: 0/200 threads available
09:19:00	War room opened; Backend Services team joins
09:21:15	Root cause confirmed: all 8 services have exhausted thread pools waiting on hung Redis connections from before the failover
09:23:00	Rolling restart initiated, starting with `user-service` (highest traffic)
09:25:40	`user-service` recovers; partial API functionality restored
09:31:10	6 of 8 services restarted and healthy; `analytics-service` and `audit-service` restarted last (lower priority)
09:33:00	Error rate returns to baseline; all services healthy; incident declared mitigated
09:47:00	Post-incident monitoring confirms no recurrence; Redis memory usage stabilized at 61% after automatic eviction policy change applied by Redis Infra

Impact¶

Customer Impact¶

Approximately 14,200 authenticated users received 503 errors or experienced hung HTTP connections for 29 minutes. Unauthenticated endpoints (e.g., public landing pages, health checks) were also affected because they shared thread pools with authenticated endpoints in 6 of the 8 services. The checkout flow was completely unavailable; approximately 340 in-progress transactions were interrupted and had to be retried by customers.

Internal Impact¶

SRE team: 2 engineers × 1.5 hours = 3 engineering-hours on incident response
Platform Engineering: 3 engineers × 2 hours = 6 engineering-hours (root cause analysis + rolling restarts)
Backend Services: 4 engineers × 1 hour = 4 engineering-hours (service-specific investigation)
Postmortem and remediation planning: ~8 additional engineering-hours
Two sprint stories were delayed due to engineers being pulled into incident response

Data Impact¶

No data loss occurred. Redis replica promotion completed with 2.1 seconds of replication lag; any session writes made in that window were lost, requiring those users to re-authenticate. Estimated affected sessions: ~80 users.

Root Cause¶

What Happened (Technical)¶

Redis was configured with maxmemory 8gb and maxmemory-policy noeviction. The noeviction policy means Redis returns an error on write commands when memory is full rather than evicting existing keys. Over the preceding 6 hours, session key TTLs had been extended from 1 hour to 8 hours as part of a "remember me" feature rollout, which tripled the average memory footprint of the session keyspace. The memory limit was reached at 09:14:02.

All 8 microservices use a shared Redis client library (libsession-go v1.4.2) with a default connection timeout of 30 seconds and a pool size of 200 connections each. When Redis began returning OOM errors on writes, the session middleware in each service treated this as a transient error and held the connection open, waiting for the configured timeout before releasing it back to the pool. Because all connections were occupied, new inbound HTTP requests queued for an available thread — which never came. Within approximately 2 minutes, all thread pools across all 8 services were fully saturated.

No circuit breaker existed at any layer. The services had no mechanism to detect that Redis was unhealthy and to fast-fail or fall back to a degraded mode. Critically, HTTP endpoints that have no Redis dependency (e.g., /healthz, /metrics, internal admin endpoints) are served by the same Goroutine/thread pool as session-authenticated endpoints. The Redis failure effectively starved all request processing capacity regardless of whether the request needed Redis at all.

Redis Sentinel performed a correct and timely failover — the replica was promoted within 90 seconds. However, by that time the thread pools were already saturated, and the newly promoted primary could not accept new connections fast enough to drain the backlog. The hung connections from the pre-failover period did not close until their 30-second timeout elapsed or the process was restarted.

Rolling restarts were the only recovery path because restarting a service cleared its thread pool and allowed it to establish fresh connections to the new Redis primary with a clean state.

Contributing Factors¶

noeviction policy on a session cache: The noeviction policy is appropriate for caches storing data that cannot be safely evicted (e.g., rate-limit counters). Session caches, where eviction simply forces a re-authentication, are a poor fit. The policy was set at initial provisioning and never revisited as the use case evolved.
30-second connection timeout with no circuit breaker: A 30-second timeout is appropriate for durable storage but is far too long for an in-memory cache. With 200 threads per service and 8 services, a 30-second timeout creates a window of up to 30 seconds where 1,600 threads can be consumed by a single downstream failure. No circuit breaker existed to detect the failure and reject new Redis operations immediately.
Shared thread pool between Redis-dependent and Redis-independent endpoints: Services did not isolate bulkheads between request paths that require Redis and those that do not. When the Redis connection pool was exhausted, the failure propagated to the entire service rather than being isolated to session-dependent endpoints.

What We Got Lucky About¶

Redis Sentinel was configured and working correctly. An unmonitored standalone Redis instance would have had no automatic failover, and the outage would have lasted until manual intervention — likely 30–60 additional minutes at minimum.
The monitoring VLAN switches degraded gracefully with tail-drop rather than crashing, keeping packet loss observable. (Wait — this is a different incident. Correct lucky factor: The Kafka broker entered read-only mode rather than corrupting data. Correct for this incident: The Redis replica's replication lag at promotion was only 2.1 seconds, meaning almost no session data was lost. If lag had been larger, we would have faced a mass re-authentication wave on top of the outage.)
No persistent data stores were involved. If Redis had been used for anything other than session caching (e.g., as a primary write store), the data loss implications of the OOM condition would have been significantly more severe.

Detection¶

How We Detected¶

The incident was detected by a PagerDuty alert configured on the API gateway's 5xx error rate exceeding 10% for 60 consecutive seconds. The alert fired at 09:15:44, approximately 102 seconds after the first Redis OOM error appeared in application logs.

Why We Didn't Detect Sooner¶

Redis memory utilization was monitored, but the alert threshold was set at 95% with a 5-minute evaluation window. At the growth rate observed that morning (~0.5% per hour for the previous 6 hours, then accelerating), the alert would have fired — but the accelerated growth was caused by a morning traffic spike that moved memory from 92% to 100% in under 3 minutes, too fast for the 5-minute window to catch. Additionally, there was no alerting on Redis OOM command errors directly; those were surfaced in application logs but not aggregated into a metric.

Response¶

What Went Well¶

The PagerDuty escalation path worked as designed; the on-call SRE acknowledged within 61 seconds of the alert.
Claudia Hartmann identified thread pool exhaustion as the mechanism (rather than just "Redis is down") within 3 minutes of joining the war room, which directed the team immediately to rolling restarts as the correct recovery action.
The restart runbook (runbook-svc-rolling-restart.md) was current and accurate; each service was restarted in under 4 minutes with no additional errors.

What Went Poorly¶

The initial 4-minute gap between the first OOM error and the PagerDuty alert allowed thread pools to fully saturate before anyone was paged. Earlier detection would have allowed a faster response before the cascade completed.
The team initially spent 5 minutes investigating whether the Redis failover had failed before pivoting to the thread pool exhaustion theory. Better runbook guidance on "Redis promoted but services still down" would have saved this time.
No one had tested the Redis failure mode. The assumption that "Sentinel will handle it" was untested. The team discovered during the incident that Sentinel failover and application recovery are two separate problems.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
PM-011-01	Change Redis session cache `maxmemory-policy` from `noeviction` to `allkeys-lru` across all environments	P0	Redis Infra (Kofi Mensah)	Done	2025-03-05
PM-011-02	Reduce Redis connection timeout in `libsession-go` from 30s to 500ms; add circuit breaker using `go-resilience` library	P0	Platform Engineering (Claudia Hartmann)	In Progress	2025-03-18
PM-011-03	Implement bulkhead isolation in all 8 services: separate Goroutine pools for Redis-dependent vs. Redis-independent request paths	P1	Backend Services (Ravi Subramaniam)	In Progress	2025-03-28
PM-011-04	Add Redis OOM error rate and memory-utilization-change-rate (derivative) alerts to Prometheus; threshold: >5 OOM errors/min or >2%/min growth	P1	SRE (Tomasz Wiśniewski)	In Progress	2025-03-14
PM-011-05	Add Redis failure mode to quarterly chaos engineering exercise; test: kill Redis primary, verify circuit breaker fast-fails within 1 second	P1	Platform Engineering (Claudia Hartmann)	Planned	2025-04-15
PM-011-06	Update runbook `runbook-redis-failover.md` to include section: "Sentinel promoted replica but services are still unhealthy — check thread pool exhaustion"	P2	SRE (Tomasz Wiśniewski)	Planned	2025-03-14

Lessons Learned¶

Sentinel failover and application recovery are independent concerns: Redis Sentinel correctly promoted a replica in 90 seconds. The application layer was still broken for 29 minutes. These are two separate systems with two separate failure modes, and both need to be designed and tested.
Default timeouts are designed for availability, not for blast radius: The 30-second default connection timeout exists to handle transient network blips. In a thread-pool-based server, a 30-second stall per thread multiplied by pool size creates a failure amplification mechanism. Every default timeout should be evaluated in the context of how many threads it can consume at once.
Endpoints that don't use a dependency can still be killed by it: Shared thread pools make the blast radius of any single dependency equal to the entire service. Bulkhead patterns (separate pools per dependency class) should be considered a baseline reliability requirement for any service with mixed dependency profiles.

Cross-References¶

Failure Pattern: Cascading Failure via Thread Pool Exhaustion; Dependency Without Circuit Breaker
Topic Packs: resilience-patterns (circuit breakers, bulkheads, timeouts), redis-operations (eviction policies, Sentinel), kubernetes-reliability (pod disruption, resource limits)
Runbook: runbook-redis-failover.md, runbook-svc-rolling-restart.md
Decision Tree: API layer returning 503 → check downstream dependency health → check thread pool utilization → if pools exhausted and dependency recovered, rolling restart required