Pattern: Two-Node Quorum Trap¶
ID: FP-014 Family: Split Brain Frequency: Common Blast Radius: Cluster-Wide Detection Difficulty: Obvious
The Shape¶
Distributed consensus protocols (Raft, Paxos, etcd) require a quorum — a majority of nodes — to make decisions. With 2 nodes, the quorum is 2/2 (both). This means any single node failure causes the cluster to lose quorum, become read-only, and refuse to schedule new workloads. The "redundancy" of having two nodes is illusory: both nodes must be available for the cluster to function. Three nodes require only 2/3 — genuine redundancy.
How You'll See It¶
In Kubernetes¶
etcd with 2 members. One etcd pod crashes or a node reboots. Kubernetes API server
returns 500 errors: "etcdserver: request timed out". kubectl get pods returns:
In Linux/Infrastructure¶
Two-node Galera cluster (MySQL). Network blip causes both nodes to consider the other unavailable. Neither has quorum; both go read-only. The split-brain protection works as designed — but the protection is indistinguishable from a complete database outage.
In CI/CD¶
Two-node Consul cluster used for service discovery. One node fails. Consul loses quorum; service registrations and health check updates stop propagating. Services start using stale information; new deploys fail to register.
The Tell¶
Cluster has exactly 2 members. One member is down/unreachable. All control-plane operations fail; data-plane (existing processes) may continue. Error messages explicitly mention "quorum" or "leader election failed."
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Network partition | Quorum loss on 2-node cluster | Single-node failure; network between remaining node and clients is fine |
| Software bug | 2-node quorum limitation | Failure mode is architectural, not a bug; adding a third node resolves it |
| Full cluster failure | Single node failure with quorum loss | Data plane (pods, services) still functional; only control plane is frozen |
The Fix (Generic)¶
- Immediate: Restore the failed node; quorum is re-established automatically.
- Short-term: Operate in degraded mode (read-only); avoid manual data surgery on the remaining single node.
- Long-term: Always deploy distributed consensus systems with an odd number ≥ 3. For etcd: 3 nodes (tolerates 1 failure), 5 nodes (tolerates 2 failures). Use
etcd --force-new-clusteronly as a last resort disaster recovery option.
Real-World Examples¶
- Example 1: k3s cluster with 2 etcd members (one control plane node + embedded etcd). Control plane node rebooted for kernel update. Second etcd member lost quorum; API server unavailable for 10 minutes until the first node came back.
- Example 2: Two-node Zookeeper for Kafka. One ZK node failed. Kafka brokers couldn't elect a controller; all producer/consumer operations failed for the duration.
War Story¶
Management wanted to "save money" by running etcd on 2 nodes instead of 3. I explained the quorum math. They said "but we have two — that's redundant, right?" I drew the diagram: 2 nodes, both need to be up, one failure = cluster frozen. We went with 2 nodes. Three months later, node 2's disk filled up (FP-003) and crashed etcd on that node. The API server froze. 45-minute incident. After the post-mortem, the third etcd node was approved immediately.
Cross-References¶
- Topic Packs: distributed-systems, k8s-ops
- Case Studies: ops-archaeology/14-split-brain-etcd/
- Footguns: distributed-systems/footguns.md — "Two-node etcd cluster"
- Related Patterns: FP-015 (stale leader — the other split-brain shape), FP-016 (dual-write divergence — what happens when split-brain isn't detected)