Postmortem: Race Condition in Distributed Lock Manager Corrupts Shared State¶
| Field | Value |
|---|---|
| ID | PM-003 |
| Date | 2025-05-21 |
| Severity | SEV-1 |
| Duration | 6h 15m (detection to resolution) |
| Time to Detect | 14m |
| Time to Mitigate | 6h 15m |
| Customer Impact | Order processing failed or produced incorrect results for 4h 22m during active contention. Approximately 9,400 orders entered an inconsistent state; 2,100 were visibly incorrect to customers (wrong status, duplicate line items, or missing fulfillment records). Manual reconciliation required for all 9,400. |
| Revenue Impact | ~$420,000 estimated (direct: ~$95k in refunds and credits issued; indirect: ~$325k in delayed fulfillment and engineering time for reconciliation) |
| Teams Involved | Order Platform, Distributed Systems, SRE, Data Engineering, Customer Success, Legal |
| Postmortem Author | Ingrid Solberg (Staff Engineer, Order Platform) |
| Postmortem Date | 2025-05-25 |
Executive Summary¶
On 2025-05-21, a race condition in the company's hand-rolled distributed lock manager caused two concurrent writers to simultaneously acquire the same lock for the same order record during a GC pause in one of the lock holder processes. The race corrupted shared state in the order processing pipeline across three services (order service, inventory service, and fulfillment service), causing ~9,400 orders to enter inconsistent states. The underlying TOCTOU (time-of-check to time-of-use) flaw had existed in the codebase for over two years but was masked at normal traffic volumes; a marketing campaign on the day of the incident increased contention 10× above baseline, surfacing the bug. Full recovery required 6 hours 15 minutes due to the complexity of three-service data reconciliation. No payment records were corrupted.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 09:00 | Marketing campaign "Summer Kickoff" launches, sending promotional email to 1.4M subscribers. Order ingestion rate climbs from baseline 420 orders/min to 4,100 orders/min over 40 minutes. |
| 09:41 | Order processing error rate begins rising (3% → 8%) on POST /v2/orders/fulfill. SRE on-call (Bastian Richter) begins investigation. |
| 09:47 | Application logs show interleaved writes from two goroutines on the same order ID: [order-svc] LOCK ACQUIRED order_id=ord-9927431 worker=pod-7f and [order-svc] LOCK ACQUIRED order_id=ord-9927431 worker=pod-3c. Lock invariant violated. |
| 09:51 | Bastian pages Distributed Systems on-call (Chen Wei). SEV-1 declared. War room opened in Slack. |
| 09:55 | Chen Wei reviews lock manager code. Identifies that lock acquisition uses Redis SET NX EX (set if not exists, with TTL) but the check-and-act sequence is not atomic: the code reads the key, checks if it is empty, then sets it — three separate Redis operations. Under GC pause, the TTL can expire between the check and the set. |
| 10:02 | First hypothesis: Redis connection pool exhaustion causing delayed NX responses. Redis metrics show pool healthy, latency p99 at 4ms. Hypothesis rejected. |
| 10:08 | Chen Wei confirms TOCTOU: a JVM GC pause of 280–400ms in the order service is causing the TTL to expire between GET and SET NX. Both callers then succeed on SET NX because the key is absent. |
| 10:14 | Order Platform on-call (Lena Fischer) joins. Confirms state corruption in database: orders table has rows with status=PROCESSING and status=FULFILLED for the same order ID in the audit log — both written within 200ms of each other. |
| 10:19 | Mitigation option A considered: roll back lock manager to previous version. No previous version — hand-rolled implementation has not been meaningfully changed in 2.1 years. |
| 10:23 | Mitigation option B considered: implement Redlock across 3 Redis nodes to make lock acquisition atomic. Estimated implementation time: 3–4 hours. Rejected for immediate mitigation; flagged for permanent fix. |
| 10:28 | Mitigation option C selected: temporarily route all order traffic through a single order service pod (disable horizontal scaling, force serial processing). Lena begins draining all but one replica. |
| 10:41 | Order service scaled to 1 replica. Contention eliminated. Error rate drops from 18% to 0.2% (residual errors from in-flight corrupted orders). |
| 10:44 | Decision: leave single-replica mode in place while Data Engineering identifies the full scope of corrupted orders. Marketing campaign traffic is declining naturally. |
| 11:03 | Data Engineering (Prisha Mehta) runs reconciliation query across orders, inventory_reservations, and fulfillment_records tables. Identifies 9,412 orders with state inconsistencies. |
| 11:20 | Legal joins war room to assess customer notification obligations. |
| 11:35 | Customer Success begins individual outreach to customers with visibly incorrect order states (2,143 orders). |
| 12:10 | Data Engineering completes reconciliation script for non-visible inconsistencies (7,269 orders with internal metadata mismatches but no customer-visible error). Script validated in staging. |
| 12:44 | Reconciliation script run against production. 7,241 of 7,269 orders corrected automatically. 28 orders require manual review (conflicting inventory signals). |
| 14:18 | 28 manually reconciled orders resolved by Order Platform team in coordination with Inventory team. |
| 15:41 | Data Engineering confirms all 9,412 orders are in consistent state. Order service scaled back to 8 replicas. Monitoring shows normal contention levels under reduced post-campaign traffic. Incident resolved. |
| 16:00 | Post-incident monitoring watch begins (48h extended watch given data corruption severity). |
Impact¶
Customer Impact¶
- 9,412 orders entered inconsistent state during active contention window (09:41–10:41 UTC)
- 2,143 customers received visibly incorrect order state: wrong status (e.g., "Delivered" while still in warehouse), duplicate line items, or missing fulfillment confirmation emails
- ~7,269 orders had internal metadata inconsistencies not visible to customers but requiring automated reconciliation
- Customer refunds and goodwill credits issued: ~$95,000 (tracked by Customer Success)
- 3 business accounts with high order volumes flagged for dedicated account manager follow-up
Internal Impact¶
- Order Platform: 4 engineers × 7h = 28 engineer-hours on incident response + reconciliation
- Distributed Systems: 2 engineers × 5h = 10 engineer-hours
- Data Engineering: 3 engineers × 6h = 18 engineer-hours on reconciliation queries and scripting
- SRE: 2 engineers × 6h = 12 engineer-hours
- Customer Success: 8 agents × 4h = 32 engineer-hours on outreach
- Legal: 2h consultation
- Q2 Distributed Systems sprint effectively consumed by Redlock implementation (displaced 2 planned features)
Data Impact¶
- 9,412 order records corrupted across three tables (
orders,inventory_reservations,fulfillment_records) - All 9,412 records were reconciled to a consistent state within 6h 15m of incident declaration
- No payment data was corrupted. Payment records are stored in a separate service (
payment-svc) backed by its own isolated PostgreSQL instance; the distributed lock manager does not govern payment state - 28 orders required manual business judgment to reconcile (ambiguous inventory signal); all were resolved conservatively (order honored, inventory re-checked manually)
- No permanent data loss; all corruption was within order metadata, not financial or personally identifiable data
Root Cause¶
What Happened (Technical)¶
The distributed lock manager is a hand-rolled Go implementation that uses Redis as its backing store. The lock acquisition sequence is as follows: (1) issue a GET <lock_key> command, (2) if the returned value is empty, issue a SET <lock_key> <worker_id> NX EX <ttl_seconds> command. This two-step sequence has a classic TOCTOU (time-of-check to time-of-use) vulnerability: if the key expires between step 1 and step 2, a second caller can complete step 1 (seeing an empty key), and both callers then successfully complete step 2, both setting the key and believing they have the lock.
Under normal operating conditions at 420 orders/minute, lock contention is low enough that the TTL expiry window (200ms) is never occupied by a second caller. The GC pause duration in the JVM-based services (order service runs on JVM despite the lock manager being written in Go — the lock manager is a sidecar) averaged 40–80ms at baseline traffic, well below the risk threshold.
At 4,100 orders/minute — a 10× increase from the marketing campaign — lock contention increased dramatically. Redis key churn increased, and more importantly, multiple goroutines were queueing on the same lock keys simultaneously. Under these conditions, a GC pause of 280–400ms in the order service JVM was sufficient to expire the lock TTL between check and set. Two workers — pod-7f and pod-3c — simultaneously acquired the lock for order ord-9927431, both wrote to the order record, and both wrote to the inventory reservation table. The duplicate writes produced contradictory state: two different inventory lot IDs reserved for the same order, and two different fulfillment records created.
Because the fulfillment service consumes order state asynchronously via a message queue, the corruption propagated downstream before the lock manager detected any anomaly. The fulfillment service has no idempotency guard at the record creation layer; it created a fulfillment record for each valid-looking message it received.
The proper solution for Redis-based distributed locking is either the Redlock algorithm (acquire lock on N/2+1 independent Redis nodes) or, more simply, the atomic SET NX command issued as a single operation (not preceded by a GET). The existing implementation's GET + SET NX pattern is documented as an anti-pattern in the Redis documentation for exactly this reason.
Contributing Factors¶
-
Lock implementation used a non-atomic GET + SET NX pattern: The implementation was written against Redis documentation that predates the single-command atomic
SET key value NX EX ttlsyntax (available since Redis 2.6.12). The original author may not have known about the atomic form, or the code predated it. The implementation was never reviewed against current Redis best practices. -
No fencing tokens: A robust distributed lock implementation issues a monotonically increasing fencing token with each lock grant. Writers must pass the token to the data store, which rejects writes from stale token holders. Without fencing tokens, a second lock holder can write after the first without any rejection at the storage layer. The order service's database write path had no such guard.
-
10× traffic spike was not modeled in capacity planning or load tests: The marketing team launched the Summer Kickoff campaign without filing a capacity review with Platform Engineering. The order platform's load tests topped out at 2× baseline (840 orders/min). The 10× spike was entirely outside tested operational parameters. Had the campaign been flagged in advance, the lock contention issue might have been identified during load testing.
What We Got Lucky About¶
- The corruption was confined to order metadata (status, line items, fulfillment records) — all of which are correctable by replaying or reconciling from the audit log. Payment records, which would have required bank reconciliation and potential regulatory reporting, are stored in an entirely separate service with its own data store and were not touched by the lock manager.
- The audit log tables (
order_events,inventory_events) use append-only writes with timestamps, which gave Data Engineering a reliable event stream to use for reconciliation. Had the application used destructive updates (UPDATE without audit), reconciliation would have been impossible rather than merely time-consuming.
Detection¶
How We Detected¶
SRE's Datadog dashboard showed the order fulfillment error rate climbing from baseline (< 0.5%) to 8% over approximately 6 minutes, triggering a P2 alert at T+14m. The on-call SRE pulled application logs and saw the lock acquisition collision messages within 4 minutes of alert acknowledgment. The log pattern — two different pod names claiming the same lock key within milliseconds — was unambiguous.
Why We Didn't Detect Sooner¶
The error rate alert threshold was set at 5% for 3 consecutive minutes before firing. Under the traffic ramp, the threshold was crossed approximately 8 minutes after the first lock collision. An absolute error count alert (e.g., >50 errors/minute regardless of rate) would have fired sooner during the ramp-up. Additionally, there was no Redis-level monitoring for concurrent lock holders on the same key — a metric that would have fired immediately on the first collision.
Response¶
What Went Well¶
- Log messages were sufficiently detailed to confirm the lock collision within minutes of the alert. Both the lock key and the acquiring worker identities were logged at INFO level, which meant the TOCTOU failure was visible in standard log output without enabling debug logging.
- The mitigation of scaling to a single replica, while operationally undesirable, was identified and implemented quickly (17 minutes from war room open to single-replica mode). It was a creative and effective short-circuit that eliminated contention without requiring a code change.
- Data Engineering's audit log tables provided a complete and reliable event history for reconciliation. The reconciliation script was developed, validated in staging, and run in production within approximately 3 hours of scope identification.
- The decision to keep the order service in single-replica mode until full reconciliation was complete — rather than restoring scale and re-exposing the bug — was correct and prevented further corruption during the recovery window.
What Went Poorly¶
- Two mitigation options (rollback, Redlock implementation) were explored and rejected before the single-replica solution was identified. This consumed approximately 25 minutes of war room time. A pre-written emergency mitigation playbook for "suspected lock contention" would have listed traffic throttling and single-instance mode as immediate options.
- The marketing campaign launched with no capacity review or notification to Platform Engineering. If the traffic spike had been anticipated, the incident likely would have been prevented entirely.
- The 28 orders requiring manual business judgment took 4 hours to resolve because the process involved three teams (Order Platform, Inventory, Customer Success) with no pre-defined escalation path for ambiguous reconciliation cases.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-001 | Replace hand-rolled lock manager with Redlock implementation using 3 independent Redis nodes; add fencing token support to all write paths that consume locks; ship with load test at 5× baseline | P0 | Chen Wei (Distributed Systems) | In Progress | 2025-06-13 |
| AI-002 | Require capacity review ticket (filed ≥48h in advance) for any marketing campaign expected to increase order rate >50% above recent baseline; create intake form and SLA with Platform Engineering | P0 | Lena Fischer (Order Platform) + Marketing Engineering | Not Started | 2025-06-06 |
| AI-003 | Add Redis monitor for concurrent lock holders on same key (use Redis keyspace notifications + a Prometheus counter); alert immediately on >1 holder | P1 | Bastian Richter (SRE) | Not Started | 2025-06-06 |
| AI-004 | Change order fulfillment error rate alert from percentage-based to absolute count (>50 errors/min for 1 min = alert) to catch issues earlier during traffic ramps | P1 | SRE | Not Started | 2025-05-30 |
| AI-005 | Document emergency mitigation playbook for lock contention: immediate options are (1) scale to 1 replica, (2) rate-limit inbound traffic at API gateway, (3) enable Redis WATCH-based optimistic locking as interim | P1 | Ingrid Solberg (Order Platform) | Not Started | 2025-06-06 |
| AI-006 | Define and document the escalation path for ambiguous order reconciliation cases, including ownership, decision authority, and SLA (target: <1h from identification to resolution) | P2 | Order Platform + Data Engineering | Not Started | 2025-06-13 |
Lessons Learned¶
-
Hand-rolled distributed systems primitives accumulate hidden assumptions. The lock manager was written with a specific Redis command sequence that was subtly wrong from the start. Two years of passing tests and low-traffic operation provided false confidence. Standard libraries and algorithms (Redlock, Zookeeper leases, etcd distributed locks) have been battle-tested at scale and reviewed by the community. Building your own is rarely justified and requires the same rigor as a cryptographic implementation — which this did not receive.
-
Traffic spikes are a different operational regime, not just more of the same. A system that works at 1× can fail in qualitatively different ways at 10×, not just be slower. Race conditions, GC pressure, connection pool exhaustion, and queue depth all exhibit nonlinear behavior under high load. Load tests must cover the realistic upper bound of traffic, not just a comfortable multiple of baseline, and marketing campaigns are a predictable source of large spikes that must be part of that planning.
-
Audit logs are a recovery superpower if you have them. The difference between "we can reconcile this" and "we have irrecoverable data loss" was the presence of append-only audit log tables. The decision to implement audit logging years ago, for compliance reasons, turned out to be the single most important factor in limiting the incident's impact. Audit logs should be treated as critical infrastructure, not as a compliance checkbox.
Cross-References¶
- Failure Pattern: Software Bug — TOCTOU Race in Distributed Lock; Design Flaw — Missing Fencing Tokens; Process Gap — No Traffic Spike Communication Channel
- Topic Packs: distributed-locking, redis-patterns, race-conditions, incident-response-data-corruption
- Runbook:
runbooks/platform/order-service-lock-contention-mitigation.md(to be created per AI-005) - Decision Tree: Triage path — "Order fulfillment errors rising, logs show duplicate lock acquisition on same key" → immediately scale order service to 1 replica to eliminate contention → page Distributed Systems → do not attempt to roll back application until scope of corruption is understood