Skip to content

Symptoms: Database Replication Lag, Root Cause Is RAID Degradation

Domains: kubernetes_ops | linux_ops | datacenter_ops Level: L3 Estimated time: 45 min

Initial Alert

PostgreSQL monitoring fires at 04:15 UTC:

CRITICAL: pg_replication_lag > 60s
  primary: pg-primary-0 (prod namespace)
  replica: pg-replica-0 (prod namespace)
  lag: 127 seconds (threshold: 60s)
  trend: increasing

Follow-up alerts:

WARNING: pg-replica-0 — WAL receiver falling behind (replay_lag: 247MB)
WARNING: Read query latency on replica > 2000ms (baseline: 15ms)
CRITICAL: pg_replication_lag > 300s — replica is 312 seconds behind primary

Observable Symptoms

  • PostgreSQL streaming replication lag is growing steadily — 127s at first alert, 312s ten minutes later.
  • The replica's WAL receiver process is running but cannot keep up with the primary's WAL generation.
  • Write load on the primary is normal (2,400 TPS, consistent with past week).
  • Read queries routed to the replica are returning stale data (312+ seconds old).
  • pg_stat_replication on the primary shows the replica is connected and receiving WAL, but replay is slow.
  • Both primary and replica pods are Running and passing health checks.
  • No database schema changes, parameter changes, or deployments in the past 48 hours.

The Misleading Signal

Growing replication lag with a healthy primary looks like a database performance issue on the replica — maybe a long-running query holding locks, a vacuum blocking replay, or insufficient replica resources. Engineers start investigating PostgreSQL internals: pg_stat_activity, pg_locks, recovery parameters, and WAL replay configuration. The database-level investigation finds no obvious blockers.