Diagnostic Questions¶

Before revealing the investigation path:¶

PostgreSQL replication lag is increasing steadily. The primary has normal write load and the replica has no blocking queries. What other factors besides database configuration could cause WAL replay to fall behind?
iostat on the replica's node shows 98.7% disk utilization with 142ms await time. The pod's CPU and memory are fine. How does disk I/O bottleneck cause replication lag?
/proc/mdstat shows [U_] for the RAID-1 array. What does this mean, and how does losing one mirror in RAID-1 affect I/O performance?
The failed disk has 2847 reallocated sectors and SMART status FAILED. The failure occurred 12 hours ago but replication lag only appeared during morning peak. Why was there a delay between disk failure and visible database impact?
The fix is a physical disk replacement (datacenter ops). What monitoring would catch this failure before it impacts database performance? Why should RAID health be a first-class alert, not just a weekly check?