Postmortem: Unbounded Retry Storm Takes Down Payment Processing¶

Field	Value
ID	PM-005
Date	2025-11-12
Severity	SEV-1
Duration	1h 54m (detection to resolution)
Time to Detect	8m
Time to Mitigate	1h 54m
Customer Impact	100% of payment attempts failed for 1h 54m. Approximately 22,000 customers were unable to complete purchases. All checkout flows displayed error pages or spinner timeouts.
Revenue Impact	~$163,000 estimated (1.9h × avg $86k/h payment transaction volume)
Teams Involved	Payments Engineering, Platform Engineering, Service Mesh Team, SRE, Customer Success, Finance
Postmortem Author	Marcus Truong (Staff SRE)
Postmortem Date	2025-11-16

Executive Summary¶

On 2025-11-12, the company's payment gateway operator (Clearpath Payments) performed a planned 20-minute maintenance window. All 14 internal services that call the payment gateway API began retrying failed requests immediately and aggressively, using the company's shared HTTP client library defaults of 3 retries with no backoff and no jitter. The combined retry behavior amplified traffic to the payment gateway by approximately 10× at the moment Clearpath's maintenance completed and the gateway came back online, preventing recovery. The gateway remained overloaded and continued returning 503s for 1 hour 34 minutes after maintenance was supposed to have ended. Manual intervention — rate-limiting all 14 callers at the API gateway layer — was required to allow the payment gateway to drain its backlog and recover. No payment data was corrupted; all requests were rejected cleanly at the transport layer before any transaction was initiated.

Timeline (All times UTC)¶

Time	Event
14:00	Clearpath Payments begins planned maintenance on payment gateway API (api.clearpathpay.io). Returns HTTP 503 with `Retry-After: 1200` header on all requests. Clearpath had filed a maintenance notice via email to `vendor-notifications@company.com` 48h prior.
14:00	All 14 internal services begin receiving 503 responses. HTTP client library (`http-client-go v2.3.1`, shared across all services) interprets 503 as a transient error and immediately begins retry sequence: attempt 1 at T+0, attempt 2 at T+1s, attempt 3 at T+2s. No exponential backoff. No jitter. `Retry-After` header is not parsed.
14:01	Retry storm is already at full amplitude: 14 services × avg 210 requests/min × 3 retries = approximately 8,820 retry attempts/min against a gateway returning 503.
14:02	Payment gateway request rate (measured at Clearpath, per post-incident communication): 9,100 requests/min, 10× above the normal 910 requests/min baseline. Clearpath's maintenance window had expected 0 traffic during the window.
14:07	SRE on-call (Marcus Truong) notices checkout conversion drop on Datadog dashboard. Payment success rate at 0%. Pages Payments Engineering on-call (Adaeze Okonkwo).
14:08	Payments Engineering on-call acknowledges. PagerDuty alert for payment gateway 503 rate fires.
14:09	Adaeze checks Clearpath status page (status.clearpathpay.io). Shows "Scheduled Maintenance in Progress. Expected completion: 14:20 UTC." Payments Engineering team was not aware of this maintenance window — the notification email went to `vendor-notifications@company.com`, a distribution list with no subscribers.
14:11	SEV-1 declared. War room opened. Initial assessment: maintenance will complete at 14:20 UTC; retry storm should resolve naturally at that point. Decision: monitor and wait.
14:20	Clearpath maintenance completes. Payment gateway comes back online. However, the 10× request rate immediately overwhelms the freshly started gateway. Clearpath's gateway has a connection queue limit of 1,200 concurrent requests; incoming rate at 14:20 is approximately 9,100 requests/min. Queue saturates within 4 seconds. Gateway continues returning 503.
14:22	It becomes clear the gateway is not recovering. War room reassesses. Service Mesh on-call (Priya Sundaram) joins.
14:24	Hypothesis: Clearpath's maintenance is running long. Checked against status page — maintenance marked complete at 14:20. Hypothesis updated: retry storm is preventing gateway recovery.
14:27	Service Mesh Team identifies that Istio service mesh has no circuit breaker configured for `clearpathpay.io` external service entry. Circuit breaker would have opened the circuit after the first wave of 503s, shedding retry load.
14:29	Platform Engineering on-call (James Kowalczyk) joins. Proposes implementing Istio circuit breaker on the Clearpath service entry immediately. Estimated time to implement and deploy: 20–25 minutes.
14:31	Alternative proposed: rate-limit all 14 callers at the API gateway (Kong) by reducing their upstream request rate to 5% of normal. Estimated time: 10–12 minutes. Selected as faster mitigation.
14:33	James begins applying Kong rate-limiting plugin to all 14 service routes pointing at the payment gateway integration service.
14:41	Rate limiting applied to 11 of 14 services. Clearpath gateway request rate drops from 9,100 to 1,200 requests/min. Clearpath gateway begins returning 200s intermittently.
14:44	Rate limiting applied to remaining 3 services. Clearpath request rate at 890 requests/min (below normal baseline).
14:46	Clearpath payment gateway fully recovered. Processing normally. Adaeze begins gradually lifting rate limits.
14:54	Rate limits removed from all services. Payment success rate returns to 99.4% (normal baseline: 99.6%; small gap from in-flight abandoned sessions).
15:00	Marcus sends incident update to war room: "Payment processing restored. Monitoring for 30m before standing down."
15:54	Extended monitoring complete. All payment metrics nominal. Incident closed.
16:10	Finance confirms no payment records show partial or duplicate transactions during the incident window.

Impact¶

Customer Impact¶

22,000 customers unable to complete payment for 1h 54m (14:00–15:54 UTC, including recovery monitoring window)
100% of checkout flows failed during active retry storm (14:00–14:54 UTC, 54 minutes)
No partial or duplicate payments recorded. Clearpath returned 503 before any transaction was initiated; all failures were clean transport-layer rejections
Approximately 3,400 customers who were mid-checkout at 14:00 UTC encountered spinner timeouts and abandoned their sessions; Customer Success estimated ~40% of these completed checkout after recovery
1 enterprise B2B customer (Holloway Industrial) had a scheduled invoice batch payment at 14:15 UTC that failed entirely; Finance issued manual payment instructions

Internal Impact¶

SRE: 2 engineers × 2.5h = 5 engineer-hours
Payments Engineering: 2 engineers × 2.5h = 5 engineer-hours
Platform Engineering: 1 engineer × 2h = 2 engineer-hours
Service Mesh Team: 1 engineer × 1.5h = 1.5 engineer-hours
Customer Success: 6 agents × 1.5h = 9 engineer-hours on checkout-failure ticket triage
Finance: 1 analyst × 1h = 1 engineer-hour on payment reconciliation review
vendor-notifications@company.com distribution list audit initiated — revealed 23 additional vendor notification addresses with no active subscribers

Data Impact¶

None. All payment gateway requests during the incident were rejected at the HTTP layer with 503 before any payment data was written to Clearpath's systems. Finance confirmed via Clearpath's transaction API that no transactions were initiated, partially committed, or duplicated during the window. Order records in the internal database show all payment attempts with status PAYMENT_GATEWAY_UNAVAILABLE, which is a clean terminal state with no downstream side effects.

Root Cause¶

What Happened (Technical)¶

The http-client-go library (version 2.3.1) is a shared HTTP client wrapper used by all 14 services that call external payment APIs. Its default retry policy is: MaxRetries=3, RetryDelay=0 (immediate), no exponential backoff, no jitter, and no handling of the Retry-After response header. This policy was appropriate for idempotent GET requests against internal high-availability services, which it was originally designed for. Over time, it was adopted by all service teams as the standard HTTP client, including for non-idempotent calls to external payment APIs, without the defaults being reviewed or adjusted.

When Clearpath began returning 503 at 14:00 UTC, each of the 14 calling services immediately exhausted their retry budget (3 retries in 2 seconds) and then re-enqueued the original request at the application layer, repeating the cycle. The net effect was that each incoming payment attempt generated approximately 4 total requests to the Clearpath gateway within a 2-second window (original + 3 retries). With a baseline of approximately 210 payment requests per minute across all services, the retry storm generated approximately 840 gateway requests per minute per service slot, or approximately 9,100 requests per minute in aggregate — 10× the normal baseline.

When Clearpath's maintenance completed at 14:20 UTC and the gateway came back online, it resumed accepting connections with a cold connection pool and an empty queue. Within 4 seconds, the 10× retry load saturated its connection queue (capacity: 1,200 concurrent connections). The gateway's circuit protection reacted by continuing to return 503, trapping the system in an amplification loop: retries caused overload, overload caused 503s, 503s caused more retries.

The Istio service mesh was deployed in all service namespaces but had no DestinationRule configured for the clearpathpay.io external service entry. A circuit breaker with an outlierDetection policy would have ejected the Clearpath host from the load balancing pool after the first N consecutive 5xx responses, causing the service mesh to fast-fail subsequent requests without reaching the gateway. This would have reduced the load amplification to essentially zero. The service mesh team had configured circuit breakers for internal service-to-service calls but had not extended the configuration to external service entries.

The fundamental design flaw is that there is no global invariant preventing a calling service from amplifying load against a downstream dependency that is already saturated. Exponential backoff with jitter is the standard mitigation; it ensures that retry storms decay over time rather than sustaining constant load. A properly configured backoff (e.g., base 1s, max 60s, jitter ±50%) would have reduced the retry load to near-zero within 2–3 minutes.

Contributing Factors¶

http-client-go retry defaults designed for internal HA calls, applied to external payment API: The library was initially designed for internal service-to-service calls against load-balanced, always-available endpoints. Its aggressive defaults (immediate retry, no backoff) are appropriate for that use case. When adopted for external payment gateway calls — where the downstream has maintenance windows, rate limits, and no internal redundancy from the caller's perspective — the defaults became dangerous. No review or override was mandated when teams adopted the library for payment calls.
No circuit breaker on external service mesh entries: Istio circuit breakers were configured for internal services but the configuration was never extended to external service entries (ServiceEntry objects in Istio). The service mesh team was aware of the gap but had not prioritized it, as no previous incident had made the risk visible.
Vendor maintenance notification distribution list had no subscribers: Clearpath sent a maintenance notification 48h in advance to vendor-notifications@company.com per their standard process. This distribution list was created 18 months ago during a vendor onboarding and was never maintained. At the time of the incident, it had zero subscribers. No human or system received the notification. If Payments Engineering had received the notice, they would have either coordinated with Clearpath to schedule maintenance outside business hours or implemented a temporary circuit break before the window.

What We Got Lucky About¶

Clearpath's payment gateway rejected all requests with a clean 503 before any transaction was initiated. A 503 response is returned at the point of connection acceptance, before any payment data is transmitted or processed. There were no partial transactions, no duplicate charges, and no orphaned payment states requiring reconciliation. If Clearpath's gateway had failed in a way that accepted the connection and then errored mid-transaction, the outcome would have included potential duplicate charges requiring customer contact and bank coordination.
The rate-limiting mitigation (Kong plugin application across 14 services) was implementable in approximately 11 minutes. If the 14 callers had each had unique API gateway configurations or had not been behind a common gateway layer, manual throttling would have required deploying configuration changes to each service individually — a process that could have taken 1–2 hours per service.

Detection¶

How We Detected¶

The SRE on-call detected the incident through a Datadog dashboard showing checkout conversion rate drop to 0% at T+7m. A PagerDuty alert for payment gateway 503 rate >10% for 5 consecutive minutes fired at T+8m. The 8-minute detection lag was due to the alert threshold requiring 5 consecutive minutes of elevated 503 rate before firing.

Why We Didn't Detect Sooner¶

The primary gap was that there was no monitoring on the vendor-notifications@company.com inbox — if there had been, the Clearpath maintenance notice would have been routed to Payments Engineering 48 hours before the incident. With advance notice, the team could have implemented a pre-emptive circuit break or coordinated with Clearpath to reschedule. The alert threshold (5 consecutive minutes of elevated 503 rate) was designed to avoid alert fatigue from transient errors; a shorter threshold (1–2 minutes) would have reduced detection time but potentially increased false positives.

Response¶

What Went Well¶

The identification that the retry storm was preventing Clearpath recovery (rather than Clearpath maintenance simply running long) was made within 4 minutes of maintenance completing at 14:20 UTC. This required connecting two pieces of information — the maintenance completion time from the status page and the sustained request rate from Datadog — which the war room did quickly.
The Kong rate-limiting approach was selected over the Istio circuit breaker approach specifically because it was faster (11 minutes vs 20–25 minutes) and had lower implementation risk under time pressure. Picking the faster, lower-risk mitigation was the correct call.
Finance's post-incident confirmation that no payment records were corrupted was completed within 2 hours of incident close. Having a clear payment reconciliation procedure and executing it promptly prevented any ambiguity about customer impact.

What Went Poorly¶

The initial decision at T+11m to "monitor and wait" for Clearpath maintenance to complete was incorrect and cost approximately 9 minutes of additional outage. The retry storm was already at full amplitude — waiting for maintenance to complete would not have resolved the overload condition. The war room did not model what would happen when the gateway came back online under 10× load.
The 14-service manual rate-limiting operation took 11 minutes because each service's Kong configuration required a separate plugin application command. A single global rate-limiting policy applied at the gateway's upstream-cluster level would have taken 1–2 minutes. The absence of a global upstream rate-limit for external payment APIs reflects a gap in the API gateway configuration architecture.
Payments Engineering's lack of awareness of the Clearpath maintenance window meant the team spent 9 minutes diagnosing a situation that could have been avoided entirely with proper vendor notification routing.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-001	Update `http-client-go` v2.3.1 defaults for external (non-internal) calls: set `RetryDelay` to exponential backoff (base 1s, multiplier 2, max 60s), add ±30% jitter, parse and honor `Retry-After` header; cut v2.4.0 and mandate adoption by all services calling external APIs	P0	Platform Engineering (James Kowalczyk)	In Progress	2025-11-26
AI-002	Configure Istio `DestinationRule` with `outlierDetection` circuit breaker for all external `ServiceEntry` objects including `clearpathpay.io`; eject host after 5 consecutive 503s within 10s, eject for 30s with multiplier	P0	Service Mesh Team (Priya Sundaram)	In Progress	2025-11-22
AI-003	Audit all vendor notification distribution lists; assign a human owner and a monitoring alias to each; route `vendor-notifications@company.com` to Payments Engineering and SRE Slack channels with a 15-minute acknowledgment SLA	P0	Adaeze Okonkwo (Payments Engineering)	Not Started	2025-11-19
AI-004	Implement global Kong rate-limiting policy for the `clearpath-payments` upstream cluster at 110% of normal peak request rate; apply as a cluster-level policy so it does not require per-service configuration during incidents	P1	Platform Engineering	Not Started	2025-11-26
AI-005	Add payment gateway availability alert with 1-minute threshold (down from 5 minutes) and route to Payments Engineering primary as well as SRE on-call to ensure faster detection and domain-expert involvement	P1	Marcus Truong (SRE)	Not Started	2025-11-22
AI-006	Establish a quarterly vendor maintenance coordination review: Payments Engineering reviews upcoming Clearpath, Stripe, and Adyen maintenance windows and coordinates scheduling or pre-emptive circuit breaks	P2	Adaeze Okonkwo (Payments Engineering)	Not Started	2025-12-05

Lessons Learned¶

Retry policies designed for one context are dangerous when applied to a different context without review. The http-client-go library's aggressive defaults were appropriate for internal HA calls. They were catastrophically inappropriate for external payment API calls during a maintenance window. Libraries that encode operational behavior (retry, timeout, backoff) must be designed with the understanding that callers operate in different contexts. Defaults should be conservative (favor backoff and jitter over immediacy), and any deviation toward aggressiveness should require explicit opt-in with documented rationale.
A downstream going offline does not mean the load disappears — it means the load accumulates. The team's initial reaction to "Clearpath maintenance expected to complete at 14:20" was to wait for 14:20. This mental model assumes that retry traffic is held in check during maintenance and resumes normally after. The reality is the opposite: retry storms accumulate traffic that is then delivered in a burst at the moment the downstream recovers, potentially overwhelming it. Recovery from a dependency outage is not automatic; the caller-side traffic shape must be managed explicitly, either through circuit breakers, exponential backoff, or manual throttling.
Vendor operational communications must have verified, active subscribers. A vendor sending advance maintenance notice is performing a service. That service has zero value if the notification goes unread. Distribution lists for vendor communications should be treated with the same operational rigor as PagerDuty schedules: regular audits, verified subscribers, and an acknowledged receipt mechanism. The 48-hour notice that Clearpath sent could have prevented this incident entirely — it reached no one.

Cross-References¶

Failure Pattern: Design Flaw — Unbounded Retry with No Backoff (Thundering Herd); Process Gap — Vendor Notification Not Reaching Responsible Team
Topic Packs: retry-patterns, circuit-breakers, exponential-backoff-jitter, service-mesh-istio, api-gateway-rate-limiting
Runbook: runbooks/payments/payment-gateway-unavailable.md (to be created); must include step for immediate circuit break and Kong rate-limit as first response
Decision Tree: Triage path — "Payment gateway returning 503, maintenance window active or recently ended" → do NOT wait for maintenance to complete → immediately apply Kong rate-limiting to all callers → confirm Clearpath request rate drops to <110% normal baseline → only then gradually lift rate limits as gateway stabilizes