Pattern: Stale Leader¶

ID: FP-015 Family: Split Brain Frequency: Common Blast Radius: Multi-Service Detection Difficulty: Actively Misleading

The Shape¶

A leader/primary node is partitioned from the cluster but not immediately aware of it. It continues to believe it is the leader and continues accepting writes. Simultaneously, the rest of the cluster elects a new leader and begins accepting writes. Two leaders operate in parallel, writing to what they believe is the same authoritative state. When the partition heals, conflicting writes must be reconciled — and in many systems, the resolution is "last write wins," silently discarding data.

How You'll See It¶

In Kubernetes¶

etcd split-brain (usually prevented by Raft, but possible under edge cases). More commonly: Kubernetes operators (e.g., cert-manager, operators with leader-election) fail to release their lease. A pod restart causes two instances of the operator to both believe they are the leader briefly, issuing conflicting reconciliation commands.

In Linux/Infrastructure¶

Primary MySQL in a 3-node cluster becomes network-partitioned. It continues accepting writes from applications that are also partitioned to the same side. The cluster on the other side elects a new primary. After partition heal, conflict: both primaries accepted writes to the same rows during the partition.

In Networking¶

Primary and secondary in a VRRP pair both believe they are the MASTER after a VRRP packet loss. Both respond to the virtual IP. ARP table on the switch is confused; traffic goes to whichever system last responded to ARP. Applications connected to the "wrong" node experience session drops.

The Tell¶

Two nodes claim to be primary/leader simultaneously. Writes are being accepted by two systems to the same logical dataset. ARP table shows two MACs for the same IP (VRRP/CARP case). Replication shows "both sides are ahead of each other" (split applies to different rows).

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
Temporary slowness	Data divergence accumulating silently	Writes appear to succeed; divergence only visible after partition heals
Network blip	Stale leader accepting writes	Check leader lease timestamps on both nodes; both claim current leadership
Replication lag	Active divergence	Replica is not behind; it has different data (ahead on some rows)

The Fix (Generic)¶

Immediate: Identify which node accepted writes after the partition; fence the stale leader (STONITH, remove it from the load balancer, shut it down).
Short-term: Compare data between nodes; apply the correct writes from the authoritative node; discard conflicting writes (with business logic approval).
Long-term: Use fencing tokens (lock services that increment a version; old leader's writes are rejected by storage if the token is stale); implement STONITH for database HA.

Real-World Examples¶

Example 1: Redis Sentinel failed to demote the old primary after a failover. Both old primary and new primary accepted writes for 45 seconds. After resolution, 1,200 cache entries had conflicting values; application showed users stale data for 2 hours until cache expiry.
Example 2: Two VRRP masters on the same VLAN after a cable was reconnected wrong. The "active" and "backup" virtual IPs pointed to different physical IPs. Half the connections went to each; users experienced non-deterministic behavior (some operations worked, some didn't).

War Story¶

We thought the failover was clean. Old primary was unreachable, new primary elected, apps reconnected. 20 minutes later: data inconsistencies. Users were reporting "order I just placed is gone." Turns out the old primary had a one-way network partition: it could reach the app servers but not the cluster management network. Apps were writing to the old primary; the cluster on the other side elected a new one. Old primary had 20 minutes of "authoritative" writes that had to be manually reconciled. We now do a hard fence (IPMI power-off) before any failover.

Cross-References¶

Topic Packs: distributed-systems, database-ops
Footguns: distributed-systems/footguns.md — "Not handling forgotten node in split-brain recovery"
Related Patterns: FP-014 (two-node quorum — the setup that enables this), FP-016 (dual-write divergence — the data consequence), FP-017 (clock skew — often confuses timestamps during reconciliation)