Skip to content

Pattern: Two-Node Quorum Trap

ID: FP-014 Family: Split Brain Frequency: Common Blast Radius: Cluster-Wide Detection Difficulty: Obvious

The Shape

Distributed consensus protocols (Raft, Paxos, etcd) require a quorum — a majority of nodes — to make decisions. With 2 nodes, the quorum is 2/2 (both). This means any single node failure causes the cluster to lose quorum, become read-only, and refuse to schedule new workloads. The "redundancy" of having two nodes is illusory: both nodes must be available for the cluster to function. Three nodes require only 2/3 — genuine redundancy.

How You'll See It

In Kubernetes

etcd with 2 members. One etcd pod crashes or a node reboots. Kubernetes API server returns 500 errors: "etcdserver: request timed out". kubectl get pods returns:

Error from server: etcdserver: request timed out, possibly due to connection reset
New pods cannot be scheduled; existing pods continue running (kubelet is autonomous) but no control-plane operations work. The cluster is in a frozen state.

In Linux/Infrastructure

Two-node Galera cluster (MySQL). Network blip causes both nodes to consider the other unavailable. Neither has quorum; both go read-only. The split-brain protection works as designed — but the protection is indistinguishable from a complete database outage.

In CI/CD

Two-node Consul cluster used for service discovery. One node fails. Consul loses quorum; service registrations and health check updates stop propagating. Services start using stale information; new deploys fail to register.

The Tell

Cluster has exactly 2 members. One member is down/unreachable. All control-plane operations fail; data-plane (existing processes) may continue. Error messages explicitly mention "quorum" or "leader election failed."

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Network partition Quorum loss on 2-node cluster Single-node failure; network between remaining node and clients is fine
Software bug 2-node quorum limitation Failure mode is architectural, not a bug; adding a third node resolves it
Full cluster failure Single node failure with quorum loss Data plane (pods, services) still functional; only control plane is frozen

The Fix (Generic)

  1. Immediate: Restore the failed node; quorum is re-established automatically.
  2. Short-term: Operate in degraded mode (read-only); avoid manual data surgery on the remaining single node.
  3. Long-term: Always deploy distributed consensus systems with an odd number ≥ 3. For etcd: 3 nodes (tolerates 1 failure), 5 nodes (tolerates 2 failures). Use etcd --force-new-cluster only as a last resort disaster recovery option.

Real-World Examples

  • Example 1: k3s cluster with 2 etcd members (one control plane node + embedded etcd). Control plane node rebooted for kernel update. Second etcd member lost quorum; API server unavailable for 10 minutes until the first node came back.
  • Example 2: Two-node Zookeeper for Kafka. One ZK node failed. Kafka brokers couldn't elect a controller; all producer/consumer operations failed for the duration.

War Story

Management wanted to "save money" by running etcd on 2 nodes instead of 3. I explained the quorum math. They said "but we have two — that's redundant, right?" I drew the diagram: 2 nodes, both need to be up, one failure = cluster frozen. We went with 2 nodes. Three months later, node 2's disk filled up (FP-003) and crashed etcd on that node. The API server froze. 45-minute incident. After the post-mortem, the third etcd node was approved immediately.

Cross-References