Pattern: Two-Node Quorum Trap¶

ID: FP-014 Family: Split Brain Frequency: Common Blast Radius: Cluster-Wide Detection Difficulty: Obvious

The Shape¶

Distributed consensus protocols (Raft, Paxos, etcd) require a quorum — a majority of nodes — to make decisions. With 2 nodes, the quorum is 2/2 (both). This means any single node failure causes the cluster to lose quorum, become read-only, and refuse to schedule new workloads. The "redundancy" of having two nodes is illusory: both nodes must be available for the cluster to function. Three nodes require only 2/3 — genuine redundancy.

How You'll See It¶

In Kubernetes¶

etcd with 2 members. One etcd pod crashes or a node reboots. Kubernetes API server returns 500 errors: "etcdserver: request timed out". kubectl get pods returns:

Error from server: etcdserver: request timed out, possibly due to connection reset

New pods cannot be scheduled; existing pods continue running (kubelet is autonomous) but no control-plane operations work. The cluster is in a frozen state.

In Linux/Infrastructure¶

Two-node Galera cluster (MySQL). Network blip causes both nodes to consider the other unavailable. Neither has quorum; both go read-only. The split-brain protection works as designed — but the protection is indistinguishable from a complete database outage.

In CI/CD¶

Two-node Consul cluster used for service discovery. One node fails. Consul loses quorum; service registrations and health check updates stop propagating. Services start using stale information; new deploys fail to register.

The Tell¶

Cluster has exactly 2 members. One member is down/unreachable. All control-plane operations fail; data-plane (existing processes) may continue. Error messages explicitly mention "quorum" or "leader election failed."

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
Network partition	Quorum loss on 2-node cluster	Single-node failure; network between remaining node and clients is fine
Software bug	2-node quorum limitation	Failure mode is architectural, not a bug; adding a third node resolves it
Full cluster failure	Single node failure with quorum loss	Data plane (pods, services) still functional; only control plane is frozen

The Fix (Generic)¶

Immediate: Restore the failed node; quorum is re-established automatically.
Short-term: Operate in degraded mode (read-only); avoid manual data surgery on the remaining single node.
Long-term: Always deploy distributed consensus systems with an odd number ≥ 3. For etcd: 3 nodes (tolerates 1 failure), 5 nodes (tolerates 2 failures). Use etcd --force-new-cluster only as a last resort disaster recovery option.

Real-World Examples¶

Example 1: k3s cluster with 2 etcd members (one control plane node + embedded etcd). Control plane node rebooted for kernel update. Second etcd member lost quorum; API server unavailable for 10 minutes until the first node came back.
Example 2: Two-node Zookeeper for Kafka. One ZK node failed. Kafka brokers couldn't elect a controller; all producer/consumer operations failed for the duration.

War Story¶

Management wanted to "save money" by running etcd on 2 nodes instead of 3. I explained the quorum math. They said "but we have two — that's redundant, right?" I drew the diagram: 2 nodes, both need to be up, one failure = cluster frozen. We went with 2 nodes. Three months later, node 2's disk filled up (FP-003) and crashed etcd on that node. The API server froze. 45-minute incident. After the post-mortem, the third etcd node was approved immediately.

Cross-References¶

Topic Packs: distributed-systems, k8s-ops
Case Studies: ops-archaeology/14-split-brain-etcd/
Footguns: distributed-systems/footguns.md — "Two-node etcd cluster"
Related Patterns: FP-015 (stale leader — the other split-brain shape), FP-016 (dual-write divergence — what happens when split-brain isn't detected)