Skip to content

Pattern: Replication Lag at Failover

ID: FP-026 Family: Silent Corruption Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Subtle

The Shape

A database replica is typically seconds or minutes behind the primary. During normal operation, this lag is acceptable. During a failover (primary crashes, replica is promoted), the lag represents real data loss: transactions committed to the primary after the replica's last synced position are gone. If the failover is automatic (no human check of lag), the data loss happens silently. Users see "recent" writes disappear.

How You'll See It

In Linux/Infrastructure

Postgres primary crashes. Automatic failover promotes the replica (which was 5 minutes behind). Users see orders they placed in the last 5 minutes disappear. Payment system shows successful charges but the orders don't exist in the database.

-- On the promoted replica:
SELECT * FROM orders WHERE created_at > now() - interval '10 minutes';
-- Returns only orders older than 5 minutes; the rest are gone.

In Kubernetes

MySQL with a Kubernetes operator doing automatic failover. Primary pod OOMKilled (FP-004). Operator promotes the replica that has the highest replication position. But "highest position" was still 200 transactions behind the primary's last committed position. 200 writes are gone.

In CI/CD

Database used for CI state (build history, test results) is replicated for DR. Primary fails. CI system fails over. Last 3 build results are lost; CI dashboard shows builds that never completed (the records were in the primary-only window).

The Tell

Recent writes (past N minutes) are missing after a failover. The replica's Seconds_Behind_Master (MySQL) or pg_stat_replication lag was non-zero at the time of promotion. Users report seeing "saved" data disappear — the data existed in the primary but was not yet replicated.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Application bug (writes not persisted) Replication lag data loss Application logs show successful writes; missing from DB after failover
User error Data loss at failover Multiple users missing data from the same time window
Corruption Lag-based data loss Missing data is contiguous by time (not random); corresponds to lag window

The Fix (Generic)

  1. Immediate: Check if the primary's WAL/binlog is still accessible; attempt point-in-time recovery from the primary's last position.
  2. Short-term: Before promoting a replica, always check replication lag; accept controlled data loss only with explicit human approval; document the data-loss window in the incident.
  3. Long-term: Use synchronous replication for critical data (Postgres synchronous_commit=on); implement PITR (FP-027); monitor replication lag continuously and alert when lag > acceptable RPO.

Real-World Examples

  • Example 1: E-commerce primary DB crashed during peak. Auto-failover promoted a replica that was 4 minutes behind. 847 orders placed in those 4 minutes were lost (the orders existed as successful HTTP responses to users, but the DB records were gone). Business impact: manual reconciliation from payment processor records.
  • Example 2: MySQL replica was 30 seconds behind. Network partition caused primary to be fenced. Replica promoted. 30 seconds of user session writes gone. Users were logged out (session records deleted in primary-only window).

War Story

We were proud of our "less than 10-second failover." What we didn't advertise was the lag. The primary was getting 500 writes/second. When it crashed, the replica was 8 seconds behind — 4,000 writes. The failover was fast; the data loss was invisible. We found out 2 hours later when a user filed a support ticket: "I submitted my form, got a success message, and now it's gone." We checked the primary's WAL backup (we had that, thankfully) and did a surgical PITR restore of just those 4,000 transactions. Added synchronous replication for our critical order table the next week. Latency went up 5ms; we accepted it.

Cross-References