Pattern: Retry Storm¶
ID: FP-009 Family: Thundering Herd Frequency: Very Common Blast Radius: Multi-Service to Cluster-Wide Detection Difficulty: Moderate
The Shape¶
When a downstream service becomes slow or unavailable, all upstream callers begin retrying simultaneously. If the retry logic uses fixed intervals (no jitter, no exponential backoff), each retry wave arrives at the same time. The downstream service recovers briefly, gets hit by a synchronized retry wave, fails again, and the cycle repeats. The upstream's "recovery" behavior is what prevents the downstream from actually recovering.
How You'll See It¶
In Kubernetes¶
Upstream pods all timeout at the same interval (e.g., 5s). At t=5, 10, 15, 20s, a wave
of 500 requests hits the downstream pod simultaneously. kubectl top shows CPU spikes
on the downstream service every 5 seconds. The downstream pod's latency histogram has
spikes at regular intervals synchronized across all upstream pods.
In Linux/Infrastructure¶
Service A (500 instances) retries failed HTTP calls every 1s. Service B recovers, gets
hit by 500 simultaneous requests at exactly t=1.000s, falls over again. In netstat,
you see connection counts spike and drain in a regular sawtooth pattern.
In CI/CD¶
Artifact registry becomes slow during a deploy. All 50 parallel build jobs retry downloading base images simultaneously after 30s. Registry falls over under the load. Build logs show all jobs failing at the same timestamp with connection errors.
In Networking¶
TCP retransmission storms: a lossy network link causes all TCP sessions to retransmit simultaneously (after the same RTO). The retransmission burst adds to the congestion that caused the original loss.
The Tell¶
Downstream service error rate has a regular sawtooth pattern — spikes at fixed intervals corresponding to the upstream retry interval. The pattern is synchronized across all upstream instances. The downstream never has a quiet period to recover.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Downstream service overloaded | Retry storm from upstream | Load is periodic, not sustained; interval matches upstream timeout |
| DDoS | Self-inflicted retry storm | Traffic source is internal; intervals are perfectly regular |
| Capacity issue | Synchronization problem | Same load spread out over time (with jitter) would be handled fine |
The Fix (Generic)¶
- Immediate: Circuit-break the upstream callers; add
sleep $((RANDOM % 30))before retries to break synchronization. - Short-term: Implement exponential backoff with jitter:
sleep = base * 2^attempt + random(0, base). - Long-term: Add circuit breakers; use a library (resilience4j, go-retry with jitter) rather than hand-rolled retry logic; test retry behavior under failure injection.
Real-World Examples¶
- Example 1: 200 microservice instances, each with
retry_interval=2s. Database went down for maintenance. All 200 instances retried at t=2, 4, 6s. Database restarted at t=3s and was immediately overwhelmed at t=4s, crashing again. Outage extended from 30s to 8 minutes. - Example 2: CDN origin servers went slow. All CDN edge nodes (globally synchronized by NTP) hit the same retry timeout. The synchronized wave from 3,000 edge nodes overwhelmed the origin for 12 minutes.
War Story¶
Downstream service was brownout: 30% of requests timing out. Upstream had 500 pods, each retrying at exactly 5s. I watched the downstream metrics: every 5 seconds, a wall of traffic. The downstream would almost recover, then get hit again. We killed the retry logic entirely (circuit open) and the downstream recovered in under a minute. We re-enabled retries with exponential backoff + 20% jitter. The downstream never saw another synchronized wave.
Cross-References¶
- Topic Packs: distributed-systems, k8s-ops
- Case Studies: ops-archaeology/01-redis-oom-crashloop/
- Footguns: distributed-systems/footguns.md — "Retrying without exponential backoff"
- Related Patterns: FP-010 (cache stampede — same thundering herd shape), FP-019 (no circuit breaker — the missing protection)