The Split-Brain Nightmare¶

Category: The Incident Domains: distributed-systems, database-ops Read time: ~5 min

Setting the Scene¶

I was a database engineer at a financial services company — about 600 employees, heavily regulated, zero tolerance for data inconsistency. We ran a three-node Galera cluster (MariaDB) for our core transaction ledger. Galera uses synchronous replication with a quorum-based approach: as long as a majority of nodes can talk to each other, the cluster accepts writes. In theory, split-brain is impossible with Galera because minority partitions go read-only. In theory.

What Happened¶

Thursday 2:00 PM — Our datacenter network team is performing maintenance on the core switches. They're upgrading firmware on the spine switches one at a time. All three Galera nodes are on the same spine-leaf fabric, spread across three racks.

2:15 PM — The first spine switch reboots. Network reconverges in about 4 seconds. Galera doesn't blink. Smooth.

2:30 PM — The second spine switch reboots. This one has a firmware issue and takes 90 seconds to come back. During those 90 seconds, the network topology is such that rack 3 (with Galera node C) is isolated from racks 1 and 2 (nodes A and B).

2:31 PM — Galera works correctly: nodes A and B form a quorum (2 of 3), continue accepting writes. Node C detects it's in the minority partition, goes to Non-Primary state, rejects writes. So far, everything by the book.

2:33 PM — The spine switch comes back. Node C reconnects to the cluster. But here's the problem: during the 90 seconds of partition, an operator saw node C in Non-Primary state and — trying to be helpful — ran SET GLOBAL wsrep_cluster_address='gcomm://' on node C to bootstrap it as a new single-node cluster. This command tells node C: "You are the entire cluster now."

2:33 PM, continued — Node C is now a standalone cluster, accepting writes. Nodes A and B are still a two-node cluster, also accepting writes. We have split-brain. Galera's quorum protection was bypassed by a manual bootstrap command.

2:35 PM — I get alerted to the Galera state inconsistency. I check SHOW STATUS LIKE 'wsrep_cluster_size' on each node. Nodes A and B show cluster_size=2. Node C shows cluster_size=1. My blood runs cold.

2:40 PM — I immediately fence node C: SET GLOBAL wsrep_on=OFF to stop it accepting writes through Galera, then set the app's connection pool to exclude node C. But we've had about 7 minutes of split-brain writes.

3:00 PM — 11:00 PM — The reconciliation. We export all writes from node C during the split-brain window and compare them against nodes A/B. There are 847 transactions on node C that don't exist on A/B, and 2,100 transactions on A/B that don't exist on C. Twelve of them are conflicting updates to the same rows — same account, different amounts. Each conflict requires manual review by the operations team to determine which transaction is the "real" customer intent.

The Moment of Truth¶

Galera's split-brain protection worked perfectly. The operator's manual override broke it. The gcomm:// bootstrap command is essentially a big red button that says "I am the cluster, ignore everyone else." It exists for disaster recovery, but used during a network partition, it creates exactly the scenario it's designed to recover from.

The Aftermath¶

We documented the gcomm:// bootstrap command as a "two-person, manager-approved" operation. We added a wrapper script that checks cluster state before allowing a bootstrap and refuses if other nodes are reachable. We also created a split-brain runbook with the cardinal rule: during a partition, the minority side must wait, not bootstrap. The 8 hours of manual reconciliation cost us about $40,000 in staff time and delayed customer transactions by a day. We set up network monitoring for inter-rack connectivity with 10-second alerting, so we'd know about partitions before operators started "fixing" things.

The Lessons¶

Understand your consensus protocol: Every team member who can touch the database must understand how quorum works and what commands bypass it. Galera's gcomm:// bootstrap is a loaded gun — label it as such.
Network partitions will happen: Design your operations around the assumption that nodes will lose contact. Have clear procedures for what to do (wait) and what NOT to do (bootstrap the minority).
Have a split-brain runbook: When split-brain occurs, you need a written, rehearsed process for fencing the minority, identifying conflicting writes, and reconciling data. Don't figure this out during the incident.

What I'd Do Differently¶

I'd implement an automated split-brain detector that triggers fencing automatically — if a node detects it's been bootstrapped while other nodes are reachable, it should refuse to accept writes and alert loudly. I'd also move to a five-node cluster across three failure domains, making it much harder for a single network event to create an ambiguous partition.

The Quote¶

"The database did exactly what we told it to do. Unfortunately, two of us told it different things."

Cross-References¶

Topic Packs: Distributed Systems, Database Ops, Incident Command
Case Studies: Cross-Domain