The Split-Brain Nightmare
- lesson
- distributed-consensus
- network-partitions
- quorum
- etcd
- cap-theorem
- fencing
- l2 ---# The Split-Brain Nightmare
Topics: distributed consensus, network partitions, quorum, etcd, CAP theorem, fencing Level: L2 (Operations) Time: 60–75 minutes Prerequisites: None (distributed systems concepts explained from scratch)
The Mission¶
Your Redis cluster has two primaries. Both are accepting writes. Both think they're the leader. Users on one side of the network partition see order A. Users on the other side see order B. Neither side knows the other exists.
When the partition heals, you have two divergent datasets and no way to automatically reconcile them. This is split-brain — the most dangerous failure mode in distributed systems.
What Split-Brain Is¶
A system splits into two (or more) partitions that each believe they are the authoritative copy. Both accept writes. When the partition heals, conflicting writes must be resolved — and for many systems, this is impossible without data loss.
Normal: Client → Primary → Replica → Replica
(one leader, consistent)
Partition: [Client A → Primary A] |WALL| [Client B → Primary B]
(two leaders, divergent data)
Heal: Primary A has writes X, Y, Z
Primary B has writes P, Q, R
Which version is "correct"? Both. Neither.
Why It Happens: The CAP Theorem (Made Practical)¶
The CAP theorem (Eric Brewer, 2000) says that during a network partition, a distributed system must choose between:
- Consistency: Every read returns the most recent write (or an error)
- Availability: Every request receives a response (but it might be stale)
You can't have both during a partition. The choice defines the system:
| Choice | Behavior during partition | Example |
|---|---|---|
| CP (choose consistency) | Reject writes on the minority side | etcd, ZooKeeper, PostgreSQL with synchronous replication |
| AP (choose availability) | Accept writes on both sides (risk split-brain) | Cassandra, DynamoDB, Redis Sentinel (misconfigured) |
Mental Model: Imagine a company with offices in New York and London. The network link between them goes down. A CP system says "nobody in London can approve purchase orders until the link is restored" — consistent but some work stops. An AP system says "both offices keep approving" — work continues but purchases might conflict.
The Quorum: How Systems Prevent Split-Brain¶
Quorum = majority. A cluster of N nodes requires (N/2)+1 nodes to agree before making a decision. With 3 nodes, quorum is 2. With 5, quorum is 3.
3-node cluster: Quorum = 2 (can tolerate 1 failure)
5-node cluster: Quorum = 3 (can tolerate 2 failures)
7-node cluster: Quorum = 4 (can tolerate 3 failures)
During a network partition:
[Node A] [Node B] | PARTITION | [Node C]
2 nodes | | 1 node
Side with A+B: Has quorum (2/3) → continues operating
Side with C: No quorum (1/3) → goes read-only or stops
The minority side CANNOT elect a new leader because it can't get a majority vote. This prevents two leaders from existing simultaneously.
Gotcha: Even-numbered clusters don't improve fault tolerance. A 4-node cluster has quorum 3 — it tolerates 1 failure, same as a 3-node cluster. But a network split can create two groups of 2, neither with quorum — both halves go down. Always use odd numbers for consensus clusters.
Real-World Split-Brain: Redis Sentinel¶
Redis Sentinel manages Redis high-availability. It monitors the primary, and if the primary fails, Sentinel promotes a replica.
The split-brain scenario:
Configuration: 1 primary + 2 replicas + 3 Sentinels (quorum: 2)
Normal:
[Sentinel1] [Sentinel2] [Sentinel3]
↓ ↓ ↓
[Primary] [Replica1] [Replica2]
Network partition:
[Sentinel1] [Sentinel2] | PARTITION | [Sentinel3]
↓ ↓ ↓
[Primary] [Replica1] [Replica2]
Sentinel3 (alone) sees Primary as down.
With quorum=1 (MISCONFIGURED), Sentinel3 promotes Replica2 to primary.
Now TWO primaries are accepting writes.
War Story: A production Redis cluster had Sentinel quorum set to 1 (should have been 2). A network partition occurred. The isolated Sentinel promoted a replica. Both sides accepted writes for 40 minutes. 15,000 users lost their sessions. Payment data diverged. Reconciliation took 5 days. The postmortem documented the fix: set quorum to 2. Three months later, the exact same incident happened — the postmortem action item was never completed.
Fix: Always set Sentinel quorum to (N_sentinels / 2) + 1. With 3 Sentinels, quorum
is 2. With 5, quorum is 3.
etcd and Raft: How Kubernetes Avoids Split-Brain¶
etcd (the key-value store behind Kubernetes) uses the Raft consensus algorithm. Raft guarantees that at most one leader exists at any time:
- Leader sends heartbeats to followers
- If followers don't receive heartbeats, they start an election
- A candidate needs votes from a majority to become leader
- Only the leader accepts writes; followers redirect clients to the leader
# Check etcd cluster health
etcdctl endpoint health --cluster
# → https://etcd-0:2379 is healthy: successfully committed proposal
# → https://etcd-1:2379 is healthy: successfully committed proposal
# → https://etcd-2:2379 is healthy: successfully committed proposal
# Check who's the leader
etcdctl endpoint status --cluster -w table
# → ENDPOINT ID IS LEADER
# → etcd-0:2379 abc123 true
# → etcd-1:2379 def456 false
# → etcd-2:2379 789abc false
During a partition:
[etcd-0 (leader)] [etcd-1] | PARTITION | [etcd-2]
etcd-0 + etcd-1: Have quorum (2/3) → leader continues, writes succeed
etcd-2: No quorum → stops accepting writes → Kubernetes API server on this
node returns errors
Gotcha: etcd needs fast disk. A slow disk causes heartbeat timeouts, which triggers unnecessary leader elections, which causes brief write unavailability. If your Kubernetes API is intermittently slow, check etcd disk latency with
etcdctl check perf.
Fencing: The Last Line of Defense¶
Even with quorum, there's a window where split-brain can occur: the old leader hasn't realized it's been deposed yet. It's still accepting writes for a few seconds while the new leader also accepts writes.
Fencing tokens solve this. When a new leader is elected, it gets a monotonically increasing token number. Any write to shared storage must include the current token. Storage rejects writes with old tokens.
Leader A (token 5): writes to storage ← accepted (5 ≥ 5)
Network partition... Leader B elected (token 6)
Leader A (token 5): writes to storage ← REJECTED (5 < 6)
Leader B (token 6): writes to storage ← accepted (6 ≥ 6)
PostgreSQL's pg_replication_slot and Kafka's epoch-based leadership use similar mechanisms.
Flashcard Check¶
Q1: What is split-brain?
Two or more nodes in a cluster both believe they are the leader and accept writes simultaneously. When the partition heals, conflicting writes can't be automatically reconciled.
Q2: How does quorum prevent split-brain?
A leader needs majority approval (N/2 + 1 votes). During a partition, only the side with the majority can elect a leader. The minority side goes read-only.
Q3: Why should consensus clusters have an odd number of nodes?
Even numbers don't improve fault tolerance. A 4-node cluster (quorum 3) tolerates 1 failure — same as 3 nodes. But an even split (2+2) leaves neither side with quorum.
Q4: CAP theorem — what's the practical choice?
During a partition: CP (reject writes, stay consistent) or AP (accept writes, risk divergence). Most databases choose CP. Most caches and DNS choose AP.
Q5: Redis Sentinel quorum was set to 1. What happened?
A single isolated Sentinel promoted a replica to primary. Two primaries accepted writes simultaneously. Data diverged. Always set quorum to (N/2)+1.
Cheat Sheet¶
Cluster Sizing¶
| Nodes | Quorum | Tolerates |
|---|---|---|
| 1 | 1 | 0 failures |
| 3 | 2 | 1 failure |
| 5 | 3 | 2 failures |
| 7 | 4 | 3 failures |
etcd Health¶
| Task | Command |
|---|---|
| Cluster health | etcdctl endpoint health --cluster |
| Leader status | etcdctl endpoint status --cluster -w table |
| Performance | etcdctl check perf |
| Member list | etcdctl member list -w table |
Takeaways¶
-
Split-brain is the worst distributed failure. Two leaders accepting divergent writes. Recovery is manual and painful. Prevention is everything.
-
Quorum prevents split-brain. Majority vote ensures only one side of a partition can operate. Always use odd-numbered clusters.
-
CAP is a spectrum, not a binary. Most systems choose CP for writes (reject rather than diverge) and AP for reads (serve stale rather than error).
-
Redis Sentinel quorum misconfiguration is the #1 cause of split-brain in production. Set quorum = (N_sentinels / 2) + 1. Test with
redis-cli DEBUG sleep 30. -
etcd is the brain of Kubernetes. If etcd has split-brain, Kubernetes has split-brain. Run it on fast disks, odd-numbered clusters, and monitor election frequency.
Related Lessons¶
- The Cascading Timeout — when split-brain cascades through services
- The Database That Wouldn't Start — single-node database recovery
- What Happens When You
kubectl apply— etcd's role in the Kubernetes control plane