Symptoms: Database Replication Lag, Root Cause Is RAID Degradation¶
Domains: kubernetes_ops | linux_ops | datacenter_ops Level: L3 Estimated time: 45 min
Initial Alert¶
PostgreSQL monitoring fires at 04:15 UTC:
CRITICAL: pg_replication_lag > 60s
primary: pg-primary-0 (prod namespace)
replica: pg-replica-0 (prod namespace)
lag: 127 seconds (threshold: 60s)
trend: increasing
Follow-up alerts:
WARNING: pg-replica-0 — WAL receiver falling behind (replay_lag: 247MB)
WARNING: Read query latency on replica > 2000ms (baseline: 15ms)
CRITICAL: pg_replication_lag > 300s — replica is 312 seconds behind primary
Observable Symptoms¶
- PostgreSQL streaming replication lag is growing steadily — 127s at first alert, 312s ten minutes later.
- The replica's WAL receiver process is running but cannot keep up with the primary's WAL generation.
- Write load on the primary is normal (2,400 TPS, consistent with past week).
- Read queries routed to the replica are returning stale data (312+ seconds old).
pg_stat_replicationon the primary shows the replica is connected and receiving WAL, but replay is slow.- Both primary and replica pods are Running and passing health checks.
- No database schema changes, parameter changes, or deployments in the past 48 hours.
The Misleading Signal¶
Growing replication lag with a healthy primary looks like a database performance issue on the replica — maybe a long-running query holding locks, a vacuum blocking replay, or insufficient replica resources. Engineers start investigating PostgreSQL internals: pg_stat_activity, pg_locks, recovery parameters, and WAL replay configuration. The database-level investigation finds no obvious blockers.