Distributed Systems Fundamentals — Street-Level Ops¶

Quick Diagnosis Commands¶

# etcd cluster health (used by Kubernetes, Consul, CockroachDB)
etcdctl endpoint health --cluster \
  --endpoints=$ETCD_ENDPOINTS \
  --cacert=/etc/ssl/etcd/ca.crt \
  --cert=/etc/ssl/etcd/peer.crt \
  --key=/etc/ssl/etcd/peer.key

# etcd leader and status
etcdctl endpoint status --cluster --write-out=table

# Check etcd latency (high latency → leader elections → cluster instability)
etcdctl check perf --endpoints=$ETCD_ENDPOINTS

# Watch for etcd leader changes (sign of instability)
etcdctl watch --prefix / --rev=0 2>&1 | grep -i "leader\|election"

# Check CockroachDB cluster status
cockroach node status --host=localhost:26257 --insecure

# Check Kafka partition leadership distribution
kafka-topics.sh --bootstrap-server localhost:9092 --describe | \
  awk '/Leader:/ {print $4}' | sort | uniq -c | sort -rn

# Redis Sentinel: check master/replica status
redis-cli -p 26379 SENTINEL masters
redis-cli -p 26379 SENTINEL slaves mymaster

# Check network partition symptoms: high RTT between nodes
for node in 10.0.1.10 10.0.1.11 10.0.1.12; do
  echo -n "$node: "; ping -c 3 -q $node | tail -1
done

One-liner: In distributed systems, the two hardest problems are: 1) exactly-once delivery, 2) guaranteed ordering, and 3) agreeing on how many problems there are.

Gotcha: 2-Node etcd Is Worse Than 1 Node¶

Rule: A 2-node etcd cluster has no fault tolerance. Quorum requires 2/2 nodes. If either node fails, the cluster becomes read-only (no writes). A 1-node cluster at least fails clearly. Use 3, 5, or 7 nodes.

# Verify your etcd cluster size
etcdctl endpoint status --cluster --write-out=json | \
  jq '[.[] | .Status.header.member_id] | length'

# Check member count
etcdctl member list --write-out=table

# Add a third member if you only have two
etcdctl member add etcd3 --peer-urls=https://10.0.1.12:2380
# Then start etcd on the new node with ETCD_INITIAL_CLUSTER_STATE=existing

Gotcha: Thundering Herd on Cache Restart¶

Rule: When a cache (Redis, Memcached) restarts, all clients miss simultaneously and hammer the backend. Add jitter to your cache population.

# BAD: 1000 threads all check cache, all miss, all query DB at once
def get_user(user_id: str) -> dict:
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    user = db.query(f"SELECT * FROM users WHERE id = %s", [user_id])
    cache.setex(f"user:{user_id}", 300, json.dumps(user))
    return user

# GOOD: add jitter to TTL to stagger expirations
import random
def get_user(user_id: str) -> dict:
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    user = db.query("SELECT * FROM users WHERE id = %s", [user_id])
    # TTL: 300s ± 30s (10% jitter)
    ttl = 300 + random.randint(-30, 30)
    cache.setex(f"user:{user_id}", ttl, json.dumps(user))
    return user

Pattern: Implementing Idempotency with Redis¶

import redis
import json
import hashlib
from functools import wraps

r = redis.Redis(host='redis', port=6379, decode_responses=True)

def idempotent(ttl: int = 86400):
    """Decorator: make a function idempotent using Redis."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, idempotency_key: str, **kwargs):
            cache_key = f"idem:{func.__name__}:{idempotency_key}"

            # Check for existing result
            existing = r.get(cache_key)
            if existing:
                result = json.loads(existing)
                result['_idempotent'] = True
                return result

            # Execute and cache
            result = func(*args, **kwargs)
            r.setex(cache_key, ttl, json.dumps(result))
            return result
        return wrapper
    return decorator

@idempotent(ttl=3600)
def create_order(user_id: str, items: list, idempotency_key: str) -> dict:
    # DB insert happens only once even if called multiple times with same key
    order_id = db.insert("INSERT INTO orders ...")
    return {"order_id": order_id, "status": "created"}

# Client:
order = create_order(
    user_id="user-123",
    items=[{"product_id": "prod-456", "qty": 2}],
    idempotency_key=f"order-{user_id}-{cart_hash}",  # deterministic from business data
)

Pattern: Health Check That Detects Partial Failure¶

from fastapi import FastAPI
from fastapi.responses import JSONResponse
import asyncio

app = FastAPI()

@app.get("/health")
async def liveness():
    """Shallow: just proves process is alive."""
    return {"status": "ok"}

@app.get("/ready")
async def readiness():
    """Deep: proves service can handle traffic."""
    checks = {}
    failed = []

    # Database check with timeout
    try:
        await asyncio.wait_for(db.execute("SELECT 1"), timeout=2.0)
        checks["database"] = "ok"
    except asyncio.TimeoutError:
        checks["database"] = "timeout"
        failed.append("database")
    except Exception as e:
        checks["database"] = f"error: {type(e).__name__}"
        failed.append("database")

    # Cache check
    try:
        await asyncio.wait_for(redis.ping(), timeout=1.0)
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = f"error: {type(e).__name__}"
        # Cache is non-critical — warn but don't fail readiness
        # (depends on your service's cache dependency)

    # Message queue check
    try:
        await asyncio.wait_for(kafka.list_topics(), timeout=2.0)
        checks["kafka"] = "ok"
    except Exception as e:
        checks["kafka"] = f"error: {type(e).__name__}"
        failed.append("kafka")

    status_code = 200 if not failed else 503
    return JSONResponse(
        content={"status": "degraded" if failed else "ok", "checks": checks, "failed": failed},
        status_code=status_code,
    )

Scenario: Diagnosing a Split-Brain Symptom¶

Symptoms: users report seeing stale data or conflicting data across different browser sessions, database replicas show different values for the same row, application logs show "stale read" errors.

# 1. Check replication lag (PostgreSQL)
psql -h replica -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"

# 2. Check if replica is behind by how many transactions
psql -h primary -c "SELECT pg_current_wal_lsn();" -t
psql -h replica -c "SELECT pg_last_wal_replay_lsn();" -t
# Compare — large difference = replica very behind

# 3. Check if replica is in recovery (should be true for a healthy standby)
psql -h replica -c "SELECT pg_is_in_recovery();"

# 4. Check for two primaries (split-brain)
for host in primary replica-1 replica-2; do
  echo -n "$host: "
  psql -h $host -c "SELECT pg_is_in_recovery();" -t 2>/dev/null || echo "unreachable"
done
# If two nodes return 'f' (false = primary), you have split-brain

# 5. etcd split-brain: check if two leaders claim leadership
etcdctl endpoint status --cluster --write-out=json | \
  jq '[.[] | select(.Status.leader == .Status.header.member_id)] | length'
# Should be exactly 1. More than 1 = split-brain.

# 6. Immediate mitigation: stop writes to all but one node
# (This depends on your architecture — Pacemaker, cloud provider HA, etc.)

Scenario: Retry Storm After a Downstream Outage¶

# Symptom: downstream recovered but is immediately re-overwhelmed by retries

# 1. Check your retry configuration
# Are you using exponential backoff with jitter?
grep -r "retry\|backoff\|sleep" src/ | grep -v test | head -20

# 2. Check circuit breaker state
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.'
# If circuit is HALF_OPEN and failing → it's retrying but downstream isn't ready

# 3. If using Hystrix/Resilience4j:
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.state | \
  jq '.measurements[] | select(.statistic == "VALUE") | .value'
# 0=CLOSED, 1=OPEN, 2=HALF_OPEN

# 4. Temporarily increase circuit breaker timeout to give downstream more time
# (environment-specific config change)
export CIRCUIT_BREAKER_WAIT_DURATION=60s
kubectl rollout restart deployment/my-service

# 5. Check if load balancer is distributing retry traffic
kubectl top pods -l app=downstream-service
# If one pod is getting all traffic while others are idle → load balancer issue

Emergency: etcd Has No Quorum (Kubernetes Control Plane Down)¶

# Symptom: kubectl commands hang or return "etcdserver: no leader"

# 1. Check how many etcd members are up
etcdctl member list 2>&1

# 2. Check which members are healthy
etcdctl endpoint health --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379

# 3. If you have 3-node cluster and 2 are down:
# You cannot recover without restoring from backup. Start here:
# https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster

# 4. If only 1 node is down (quorum maintained):
# Remove the dead member
etcdctl member remove <dead-member-id>
# Repair the dead node and re-add
etcdctl member add etcd2 --peer-urls=https://10.0.1.11:2380

# 5. Restore from snapshot (last resort)
etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name=etcd1 \
  --initial-cluster="etcd1=https://10.0.1.10:2380" \
  --initial-advertise-peer-urls=https://10.0.1.10:2380 \
  --data-dir=/var/lib/etcd

# 6. Verify after recovery
etcdctl endpoint status --cluster --write-out=table
kubectl get nodes  # should work again within 30-60s

Useful One-Liners¶

# Check network latency between all pairs of nodes
for src in 10.0.1.10 10.0.1.11 10.0.1.12; do
  for dst in 10.0.1.10 10.0.1.11 10.0.1.12; do
    [ "$src" = "$dst" ] && continue
    echo -n "$src → $dst: "
    ssh $src "ping -c 3 -q $dst 2>/dev/null | tail -1"
  done
done

# Check for TCP retransmits (sign of network instability)
netstat -s | grep -i retransmit
ss -ti | grep retrans

# Debug clue: clock skew > 500ms between nodes can cause consensus failures.
# etcd, Consul, and CockroachDB all assume roughly synchronized clocks.
# If leader elections keep firing, check NTP before blaming the network.

# Check message queue lag (Kafka consumer group)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group | \
  awk 'NR>1 {lag+=$6} END {print "Total lag:", lag}'

# Watch etcd key changes in real time (useful during k8s debugging)
etcdctl watch --prefix /registry/pods/default

# Check Redis replication lag
redis-cli info replication | grep -E "role|connected_slaves|slave0|master_repl_offset|slave_repl_offset"

# Find clock skew across nodes (distributed systems need synchronized time)
for node in 10.0.1.10 10.0.1.11 10.0.1.12; do
  echo -n "$node: "; ssh $node "date +%s"
done | awk '/[0-9]+$/ {print; ts[NR]=$NF} END {max=ts[1];min=ts[1]; for(i in ts){if(ts[i]>max)max=ts[i];if(ts[i]<min)min=ts[i]} print "Max skew:", max-min, "seconds"}'

# Check if idempotency keys are working (Redis)
redis-cli keys "idem:*" | wc -l
redis-cli keys "idem:*" | head -5 | xargs redis-cli mget

# Simulate a partial failure for testing (block traffic to one node)
iptables -A INPUT -s 10.0.1.11 -j DROP
iptables -A OUTPUT -d 10.0.1.11 -j DROP
# Restore:
iptables -D INPUT -s 10.0.1.11 -j DROP
iptables -D OUTPUT -d 10.0.1.11 -j DROP