Investigation: Database Replication Lag, Root Cause Is RAID Degradation¶

Phase 1: Kubernetes Investigation (Dead End)¶

Check the replica pod:

$ kubectl exec pg-replica-0 -n prod -- psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
 pg_last_wal_receive_lsn | pg_last_wal_replay_lsn | pg_last_xact_replay_timestamp
-------------------------+------------------------+-------------------------------
 3A/8F000060             | 3A/7C000098            | 2026-03-19 04:10:03.182+00

The replica is receiving WAL but replay is behind. Check for blocking queries:

$ kubectl exec pg-replica-0 -n prod -- psql -U postgres -c \
    "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND backend_type = 'client backend';"
 pid | state  | query_start | query
-----+--------+-------------+-------
(0 rows)

No active queries blocking replay. Check recovery settings:

$ kubectl exec pg-replica-0 -n prod -- psql -U postgres -c "SHOW hot_standby_feedback;"
 hot_standby_feedback
----------------------
 on

$ kubectl exec pg-replica-0 -n prod -- psql -U postgres -c "SHOW max_standby_streaming_delay;"
 max_standby_streaming_delay
-----------------------------
 30s

Settings are reasonable. No query conflicts. Check the pod's resource usage:

$ kubectl top pod pg-replica-0 -n prod
NAME           CPU(cores)   MEMORY(bytes)
pg-replica-0   850m         3842Mi

# CPU and memory look fine. But check disk I/O from the pod:
$ kubectl exec pg-replica-0 -n prod -- iostat -x 1 3
Device  r/s     w/s     rkB/s    wkB/s   await   %util
sda     245.0   128.0   24576.0  8192.0  142.3   98.7

Disk I/O await time is 142ms and utilization is 98.7%. The disk is completely saturated. The replica cannot replay WAL fast enough because disk writes are extremely slow.

The Pivot¶

Why is the disk so slow? Check the underlying storage:

$ kubectl get pvc -n prod -l app=postgresql-replica
NAME                     STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
data-pg-replica-0        Bound    pv-xyz789  100Gi      RWO            local-storage

$ kubectl get pv pv-xyz789 -o jsonpath='{.spec.local.path}'
/mnt/data/pg-replica

# Check which node the replica pod is on
$ kubectl get pod pg-replica-0 -n prod -o jsonpath='{.spec.nodeName}'
worker-node-09

The replica uses local storage on worker-node-09. SSH to the node:

$ ssh worker-node-09
$ iostat -x 1 3 | grep sd
sda     245.0  128.0  24576.0  8192.0  142.3  98.7
sdb     0.0    0.0    0.0      0.0     0.0    0.0
sdc     0.0    0.0    0.0      0.0     0.0    0.0
sdd     412.0  0.0    52736.0  0.0     0.3    12.1

sda (where the PG data lives) is pegged at 98.7% utilization with 142ms await. Other disks are fine. Check what sda is:

$ lsblk /dev/sda
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda       8:0    0 931.5G  0 disk
├─sda1    8:1    0   512M  0 part  /boot/efi
├─sda2    8:2    0     1G  0 part  /boot
└─sda3    8:3    0   930G  0 part
  └─md0   9:0    0 929.9G  0 raid1 /mnt/data

/mnt/data is on an md0 RAID-1 array. Check the RAID status:

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[0] sdb3[2](F)
      975730688 blocks super 1.2 [2/1] [U_]

unused devices: <none>

[U_] — one disk in the RAID-1 mirror has failed. sdb3 is marked (F) (failed). The array is running in degraded mode on a single disk.

Phase 2: Linux Ops Investigation (Root Cause)¶

Check the failed disk:

$ smartctl -a /dev/sdb
=== START OF INFORMATION SECTION ===
Device Model:     ST1000NM0055-1V410C
Serial Number:    ZBS1K2V4
Firmware Version: SN04

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED
  Reallocated_Sector_Ct   0x0033   001   001   036    OLD_AGE   Always   FAILING_NOW   2847
  Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -         14
  Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -         3

$ dmesg | grep sdb | tail -10
[48291.442] sd 1:0:0:0: [sdb] tag#42 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[48291.443] sd 1:0:0:0: [sdb] tag#42 Sense Key : Medium Error [current]
[48291.444] blk_update_request: I/O error, dev sdb, sector 482910234 op 0x0 (READ) flags 0x0
[48291.445] md/raid1:md0: read error corrected (8 sectors at 482910234 on sdb3)
[48291.890] md/raid1:md0: Disk failure on sdb3, disabling device.

The disk has 2847 reallocated sectors, SMART status is FAILED, and the RAID array kicked it out due to medium errors. With only one disk remaining in the RAID-1 mirror, all I/O goes through a single disk, cutting throughput in half and removing the read parallelism that RAID-1 provides.

The timestamp in dmesg shows the failure occurred approximately 12 hours ago, but replication lag only became visible now because the write load increased during the morning batch processing window. The single-disk RAID cannot sustain both WAL replay writes and the read queries simultaneously.

Domain Bridge: Why This Crossed Domains¶

Key insight: The symptom was PostgreSQL replication lag visible in Kubernetes (kubernetes_ops), the root cause was disk I/O degradation caused by a RAID-1 member failure (linux_ops), and the fix requires replacing the failed disk (datacenter_ops). This is common because: database performance depends on disk I/O, which depends on RAID health. RAID degradation is invisible to Kubernetes and PostgreSQL — the filesystem still works, just slower. The performance degradation only manifests under load.

Root Cause¶

A hard drive in the RAID-1 array on worker-node-09 failed due to accumulated bad sectors. The RAID array degraded to a single disk, halving I/O throughput and eliminating read parallelism. The PostgreSQL replica, using local storage on this array, could no longer keep up with WAL replay during peak write load, causing steadily increasing replication lag.