Pattern: Retry Amplification¶
ID: FP-021 Family: Cascading Failure Frequency: Common Blast Radius: Multi-Service Detection Difficulty: Moderate
The Shape¶
A multi-tier system where each tier retries independently. If Service A retries 3×, Service B retries 3×, and Service C retries 3×, a single user request that fails at the bottom can trigger 3×3×3 = 27 actual attempts at the lowest tier. The retry multiplication is invisible to the user (they see one request) but the downstream service sees an amplified load that can exceed capacity.
How You'll See It¶
In Kubernetes¶
User → API Gateway (3 retries) → Service A (3 retries) → Database. A single slow DB query causes: 1 user request → 3 gateway attempts × 3 service A attempts = 9 DB queries per user request. With 100 concurrent users, the DB sees 900 simultaneous queries instead of 100. The DB crashes under the amplified load.
In Linux/Infrastructure¶
Three-tier application: frontend → middleware → backend DB. Each tier has a 5s timeout and 3 retries. Backend has a 5-minute disk I/O stall. Frontend user waits: 5s + 5s + 5s = 15s then sees error. Backend DB received: 3 × 3 = 9 requests per user during the stall period. At 1,000 users, DB received 9,000 queries for a 1-minute stall.
In CI/CD¶
CI pipeline: test runner retries flaky tests 3× per step, 3 steps retry on failure. A flaky integration test against a shared database causes 9 actual DB test runs per CI job invocation. With 20 parallel jobs: 180 DB test runs from 20 "retried" tests.
The Tell¶
Request count at lower tiers is significantly higher than at upper tiers. The amplification factor matches the product of retry counts across tiers. Load at the bottom tier spikes before any user-visible error appears.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Traffic spike | Retry amplification | User-facing request count flat; DB query count is multiple of user requests |
| Bottom-tier overloaded | Self-inflicted amplification | Load at bottom tier is N× higher than tier above; N = product of retry counts |
| Cascading failure from bad deployment | Retry storm from existing code | No deploy; failure triggered by a transient slowdown that was amplified |
The Fix (Generic)¶
- Immediate: Disable retries at intermediate tiers during the incident; only the outermost tier should retry.
- Short-term: Implement retry budgets: count retries across the call chain using a header (
X-Retry-Budget: 3); each tier decrements the budget; no retries when budget is 0. - Long-term: Only retry at one layer (typically the edge/gateway); use idempotency keys; propagate request context (trace IDs) so retries can be detected and coordinated.
Real-World Examples¶
- Example 1: E-commerce checkout: 3 retries at edge + 3 retries at cart service + 3 retries at inventory service = 27× amplification. A 5s inventory DB slowdown caused 2,700 DB queries from 100 users. DB crashed.
- Example 2: Data pipeline with 3 stages, each retrying 5×: 125× amplification. A flaky S3 read caused 125 S3 API calls per pipeline run. S3 rate limit (429) triggered for the entire team's AWS account.
War Story¶
Database was at 5% CPU normally. During a 10-minute network blip between the app and DB, we saw DB CPU spike to 400% (saturated). How? We had 3 app tiers, each with 5 retries. Each user request triggered 5×5×5 = 125 DB connections in the worst case. 500 users × 125 = 62,500 DB connections against a max of 500. We hadn't even noticed the retry configuration in the middleware; it was a "default" from the HTTP client library. Removing retries from tiers 2 and 3 (keeping only at the edge) dropped the amplification to 5×. DB recovered in minutes.
Cross-References¶
- Topic Packs: distributed-systems
- Footguns: distributed-systems/footguns.md
- Related Patterns: FP-009 (retry storm — same retry mechanism, synchronization variant), FP-019 (no circuit breaker — the protection that stops amplification)