Remediation: Database Replication Lag, Root Cause Is RAID Degradation¶

Immediate Fix (Datacenter Ops — Domain C)¶

The fix requires replacing the failed disk and rebuilding the RAID array.

Step 1: Order and install replacement disk¶

# Check the disk bay and model for replacement
$ lshw -class disk -short
H/W path        Device     Class      Description
================================================
/0/100/1f.2/0   /dev/sda   disk       1TB ST1000NM0055
/0/100/1f.2/1   /dev/sdb   disk       1TB ST1000NM0055 (FAILED)

# Physical disk replacement (datacenter hands-on):
# 1. Identify the disk bay (usually slot 1 based on controller mapping)
# 2. Hot-swap the failed drive with a replacement ST1000NM0055 or compatible
# 3. Verify the new disk is detected
$ lsblk /dev/sdb
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sdb      8:16   0 931.5G  0 disk

Step 2: Partition and add to RAID¶

# Copy partition layout from good disk
$ sfdisk -d /dev/sda | sfdisk /dev/sdb

# Add to RAID array
$ mdadm --manage /dev/md0 --add /dev/sdb3
mdadm: added /dev/sdb3

# Monitor rebuild
$ watch cat /proc/mdstat
md0 : active raid1 sda3[0] sdb3[1]
      975730688 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  0.3% (2932736/975730688) finish=142.0min speed=114189K/sec

Step 3: While rebuild is in progress, reduce load on replica¶

# Temporarily route read traffic away from the replica
$ kubectl exec pg-primary-0 -n prod -- psql -U postgres -c \
    "ALTER SYSTEM SET default_transaction_read_only = off;"
# (Application should fall back to primary for reads during maintenance window)

Step 4: Wait for RAID rebuild to complete¶

$ cat /proc/mdstat
md0 : active raid1 sda3[0] sdb3[1]
      975730688 blocks super 1.2 [2/2] [UU]

# [UU] — both disks are online

Step 5: Verify disk health¶

$ smartctl -t long /dev/sdb
# Wait for test to complete
$ smartctl -a /dev/sdb | grep -E "overall|Reallocated"
SMART overall-health self-assessment test result: PASSED
  Reallocated_Sector_Ct   0x0033   100   100   036    Old_age   Always       -         0

Verification¶

Domain A (Kubernetes) — Replication lag recovered¶

$ kubectl exec pg-primary-0 -n prod -- psql -U postgres -c \
    "SELECT client_addr, state, sent_lsn, replay_lsn,
            extract(epoch from replay_lag) as lag_seconds
     FROM pg_stat_replication;"
 client_addr  | state     | sent_lsn      | replay_lsn    | lag_seconds
--------------+-----------+---------------+---------------+-------------
 10.244.3.45  | streaming | 3B/12000060   | 3B/12000060   | 0.002

Replication lag: 2ms. Caught up.

Domain B (Linux Ops) — Disk I/O normal¶

$ ssh worker-node-09
$ iostat -x 1 3 | grep sda
sda     180.0   95.0   18432.0  6144.0  2.1  35.4

Await time 2.1ms (was 142ms). Utilization 35% (was 98.7%). Disk I/O back to normal with both RAID mirrors active.

Domain C (Datacenter Ops) — RAID healthy¶

$ cat /proc/mdstat
md0 : active raid1 sda3[0] sdb3[1]
      975730688 blocks super 1.2 [2/2] [UU]

Prevention¶

Monitoring: Add RAID health monitoring to all nodes. Alert immediately on array degradation.

- alert: RAIDArrayDegraded
  expr: node_md_disks_active < node_md_disks_required
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "RAID array degraded on {{ $labels.instance }} — {{ $labels.device }}"

Runbook: Check SMART status of all disks weekly via cron. Replace disks proactively when reallocated sector count exceeds 100 or SMART status is not PASSED.
Architecture: Use RAID-10 for database storage instead of RAID-1 to maintain performance during degradation. Consider moving to cloud block storage (EBS/GCE PD) where hardware failures are handled transparently. Add disk I/O latency to the database replication monitoring dashboard.