Distributed Systems Footguns¶

Mistakes that cause data loss, split-brain, cascading failures, and silent inconsistency.

1. Assuming a request that timed out did not execute¶

You send a write request to a service. No response after 5 seconds. You assume it failed and retry. But the server received and committed the request — the response was lost in the network. Now you've made the same write twice. For non-idempotent operations (create order, charge card), this produces duplicates.

Fix: Make all operations idempotent. Use an idempotency key derived from the business domain. On retry, send the same key. The server deduplicates using the key. Never assume timeout = not executed.

2. Using wall-clock timestamps to order distributed events¶

You have two services writing events to a shared store. Service A writes at 2024-01-01T12:00:00.000Z. Service B writes at 2024-01-01T11:59:59.999Z. Service B's event should be first. But your sort order puts A first because timestamps. Except Service A's clock is 50ms ahead. The "earlier" event actually happened after the "later" one.

Fix: Don't use wall-clock time for event ordering across nodes. Use logical clocks (Lamport timestamps), vector clocks, or a monotonic sequence from a single authoritative source (database sequence, Snowflake ID).

3. Two-node etcd clusters¶

You deploy etcd with 2 nodes because "that's redundant." Quorum for 2 nodes requires both to agree. When one node fails (crash, maintenance, network issue), etcd loses quorum and becomes read-only. Your Kubernetes cluster loses the ability to schedule pods or update deployments. A single-node cluster is actually more available during planned maintenance.

Fix: Always use odd numbers: 3, 5, or 7 etcd nodes. 3 nodes tolerate 1 failure. 5 nodes tolerate 2 failures. For most clusters, 3 nodes is the right tradeoff.

4. Retrying without exponential backoff¶

A downstream service goes down. Your service retries every 1 second. 500 instances of your service retry simultaneously. The downstream recovers and immediately receives 500 RPS of retries while still starting up. It falls over again. This cycle repeats indefinitely (a "retry storm").

Fix: Use exponential backoff with jitter. Start small, grow exponentially, add randomness to desynchronize retries across instances. Cap at a maximum interval.

import random, time

def retry(func, max_attempts=5, base=0.5, cap=30):
    for attempt in range(max_attempts):
        try:
            return func()
        except TransientError:
            if attempt == max_attempts - 1:
                raise
            delay = min(cap, base * (2 ** attempt))
            time.sleep(delay * (0.5 + random.random()))  # ±50% jitter

5. Reading from async replicas after a write and expecting consistency¶

You write a record to the primary database. Immediately after, you redirect the user to a page that reads from a replica. The replica hasn't received the write yet (replication lag). The user sees their old data. They click "save" again. You now have two conflicting writes.

Fix: For reads-after-writes that must be consistent, route to the primary or use a session-consistent replica (track LSN). Design your application so user-visible confirmation pages always read from primary. Or use synchronous replication — with the understanding that it adds write latency.

6. Not implementing circuit breakers for downstream calls¶

Your service calls a slow downstream. The downstream gets overwhelmed and starts timing out after 30 seconds. You have 200 concurrent users. All their requests are now held open for 30 seconds each, waiting for the downstream. Your thread pool exhausts. Your service becomes unavailable, even though the downstream is only a non-critical dependency.

Fix: Wrap all downstream calls in a circuit breaker. When the circuit opens (failure threshold exceeded), immediately return an error or fallback instead of waiting for timeout. This fails fast and preserves your service's capacity for requests that don't need the broken downstream.

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30, expected_exception=RequestException)
def call_recommendation_service(user_id: str) -> list:
    return requests.get(f"http://recommendations/api/{user_id}", timeout=2).json()

7. Performing non-atomic read-modify-write across network calls¶

You read a record, modify it in application code, and write it back. Between your read and write, another process modifies the same record. Your write overwrites their change. Last write wins — silently.

Fix: Use atomic operations where available (UPDATE ... WHERE version = ?), optimistic locking (check version before write, retry on conflict), or pessimistic locking (SELECT FOR UPDATE). Never read-modify-write without protection.

-- Optimistic locking: fail if someone else updated since our read
UPDATE accounts
SET balance = balance + 100, version = version + 1
WHERE id = 42 AND version = 7;  -- fails if version changed
-- Check rows_affected: if 0, someone else won the race — retry

8. Treating eventual consistency like "just slightly delayed consistency"¶

You use DynamoDB with eventual consistency because "it's just a few milliseconds behind." Then you read an item immediately after writing it and get stale data. You read-modify-write and lose the write. A user sees their old settings after saving. You say "it'll be consistent eventually" — but the user thinks it's broken.

Fix: Model your application around eventual consistency. Don't read-after-write and expect to see your write (use conditional writes instead). Use strongly consistent reads when you need them (ConsistentRead=True in DynamoDB costs 2x read units). Be explicit with your team about which operations are eventually consistent.

# DynamoDB: explicit strong read when you need it
response = dynamodb.get_item(
    TableName="users",
    Key={"user_id": {"S": user_id}},
    ConsistentRead=True,  # costs 2x but guarantees you see latest write
)

9. Not handling the "forgotten node" in split-brain recovery¶

A network partition splits your cluster: 3 nodes on side A, 2 nodes on side B. Side A has quorum, continues as leader. Side B believes it's partitioned (no leader). The partition heals. Side B nodes rejoin. But during the partition, side B accepted some writes (if they had a buggy implementation of leadership detection). You now have conflicting data that must be manually resolved.

Fix: Use a proper consensus system (etcd, ZooKeeper, Raft-based DB) for anything that requires strong consistency. Implement STONITH (fencing) for systems that can't afford split-brain. Verify your replication protocol handles partition recovery correctly — don't implement your own.

10. Deep health checks that create health check cascades¶

You implement a /ready endpoint that checks all 8 downstream dependencies. When one dependency is slow (say 5 seconds), all health checks to your service time out. Kubernetes marks your pods as not ready and stops sending traffic. But your service works fine for 7/8 use cases. The slow dependency is only needed for one endpoint. You've made a partial failure into a full outage.

Fix: Design health checks to reflect whether the service can handle traffic, not whether all dependencies are healthy. Use separate degraded/healthy states. Don't fail readiness for non-critical dependencies.

@app.get("/ready")
async def readiness():
    # Critical path: fail readiness if these are down
    db_ok = await check_db(timeout=2.0)
    if not db_ok:
        return JSONResponse({"status": "not ready", "reason": "db"}, status_code=503)

    # Non-critical: degrade gracefully, don't fail readiness
    cache_ok = await check_cache(timeout=1.0)
    recommendations_ok = await check_recommendations(timeout=1.0)

    return JSONResponse({
        "status": "ready",
        "degraded": [s for s, ok in [("cache", cache_ok), ("recommendations", recommendations_ok)] if not ok],
    })

11. Using a single global timeout for all operations¶

You set timeout=5s for all HTTP calls. A batch API endpoint is expected to take 30 seconds. It gets killed at 5 seconds. An internal health check is expected to take < 100ms. It has 5 seconds of slack, during which unhealthy nodes aren't detected.

Fix: Set timeouts per operation type, based on expected latency and SLO. Use separate connect timeout and read timeout. Log when operations approach their timeout.

# Per-operation timeouts
TIMEOUTS = {
    "health_check": (1, 1),        # (connect, read) seconds
    "user_lookup": (0.5, 2),
    "search": (1, 10),
    "batch_import": (2, 120),
}

def call(endpoint: str, operation: str):
    timeout = TIMEOUTS.get(operation, (2, 30))
    return requests.get(endpoint, timeout=timeout)

12. Ignoring the thundering herd on service restart¶

You restart all 20 pods of your service simultaneously (rolling restart with maxSurge=100%). All 20 pods start up and simultaneously try to warm their caches, establish DB connection pools, sync state from the primary, and process the backlog of traffic that queued during restart. The database connection pool limit is 100 total connections. 20 pods × 10 connections each = 200 connection attempts. Half fail. Pods crash-loop. Restart cascades.

Fix: Use staged restarts (maxUnavailable=1, maxSurge=1). Set connection pool sizes accounting for maximum pod count. Add a startup delay or jitter between pods using minReadySeconds. Implement connection pool pre-warming with limits.

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  minReadySeconds: 30  # wait 30s after pod is ready before proceeding

13. Conflating "the service is up" with "the service is correct"¶

Your health check returns 200. Your monitoring says 0 alerts. But a race condition means 5% of writes are silently dropped. Users don't notice because the write "succeeded" (HTTP 200 returned before the async write committed). Your service is up, healthy, and losing data.

Fix: Implement business-level correctness checks. Track write success rates at the business layer (not just HTTP status). Use end-to-end synthetic transactions that verify data actually persisted. Alert on discrepancies between writes reported by the application and writes confirmed by the database.

# Write with verification
async def write_order(order: Order) -> dict:
    result = await db.insert(order)
    order_id = result["id"]

    # Verify the write actually stuck (catch silent failures)
    verified = await db.get(order_id)
    if verified is None:
        metrics.increment("order.write_lost")
        logger.error("Write lost", extra={"order_id": order_id})
        raise DataIntegrityError(f"Order {order_id} was not persisted")

    return result