Symptoms: Database Replication Lag, Root Cause Is RAID Degradation¶

Domains: kubernetes_ops | linux_ops | datacenter_ops Level: L3 Estimated time: 45 min

Initial Alert¶

PostgreSQL monitoring fires at 04:15 UTC:

CRITICAL: pg_replication_lag > 60s
  primary: pg-primary-0 (prod namespace)
  replica: pg-replica-0 (prod namespace)
  lag: 127 seconds (threshold: 60s)
  trend: increasing

Follow-up alerts:

WARNING: pg-replica-0 — WAL receiver falling behind (replay_lag: 247MB)
WARNING: Read query latency on replica > 2000ms (baseline: 15ms)
CRITICAL: pg_replication_lag > 300s — replica is 312 seconds behind primary

Observable Symptoms¶

PostgreSQL streaming replication lag is growing steadily — 127s at first alert, 312s ten minutes later.
The replica's WAL receiver process is running but cannot keep up with the primary's WAL generation.
Write load on the primary is normal (2,400 TPS, consistent with past week).
Read queries routed to the replica are returning stale data (312+ seconds old).
pg_stat_replication on the primary shows the replica is connected and receiving WAL, but replay is slow.
Both primary and replica pods are Running and passing health checks.
No database schema changes, parameter changes, or deployments in the past 48 hours.

The Misleading Signal¶

Growing replication lag with a healthy primary looks like a database performance issue on the replica — maybe a long-running query holding locks, a vacuum blocking replay, or insufficient replica resources. Engineers start investigating PostgreSQL internals: pg_stat_activity, pg_locks, recovery parameters, and WAL replay configuration. The database-level investigation finds no obvious blockers.