Remediation: Database Replication Lag, Root Cause Is RAID Degradation¶
Immediate Fix (Datacenter Ops — Domain C)¶
The fix requires replacing the failed disk and rebuilding the RAID array.
Step 1: Order and install replacement disk¶
# Check the disk bay and model for replacement
$ lshw -class disk -short
H/W path Device Class Description
================================================
/0/100/1f.2/0 /dev/sda disk 1TB ST1000NM0055
/0/100/1f.2/1 /dev/sdb disk 1TB ST1000NM0055 (FAILED)
# Physical disk replacement (datacenter hands-on):
# 1. Identify the disk bay (usually slot 1 based on controller mapping)
# 2. Hot-swap the failed drive with a replacement ST1000NM0055 or compatible
# 3. Verify the new disk is detected
$ lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sdb 8:16 0 931.5G 0 disk
Step 2: Partition and add to RAID¶
# Copy partition layout from good disk
$ sfdisk -d /dev/sda | sfdisk /dev/sdb
# Add to RAID array
$ mdadm --manage /dev/md0 --add /dev/sdb3
mdadm: added /dev/sdb3
# Monitor rebuild
$ watch cat /proc/mdstat
md0 : active raid1 sda3[0] sdb3[1]
975730688 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.3% (2932736/975730688) finish=142.0min speed=114189K/sec
Step 3: While rebuild is in progress, reduce load on replica¶
# Temporarily route read traffic away from the replica
$ kubectl exec pg-primary-0 -n prod -- psql -U postgres -c \
"ALTER SYSTEM SET default_transaction_read_only = off;"
# (Application should fall back to primary for reads during maintenance window)
Step 4: Wait for RAID rebuild to complete¶
$ cat /proc/mdstat
md0 : active raid1 sda3[0] sdb3[1]
975730688 blocks super 1.2 [2/2] [UU]
# [UU] — both disks are online
Step 5: Verify disk health¶
$ smartctl -t long /dev/sdb
# Wait for test to complete
$ smartctl -a /dev/sdb | grep -E "overall|Reallocated"
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct 0x0033 100 100 036 Old_age Always - 0
Verification¶
Domain A (Kubernetes) — Replication lag recovered¶
$ kubectl exec pg-primary-0 -n prod -- psql -U postgres -c \
"SELECT client_addr, state, sent_lsn, replay_lsn,
extract(epoch from replay_lag) as lag_seconds
FROM pg_stat_replication;"
client_addr | state | sent_lsn | replay_lsn | lag_seconds
--------------+-----------+---------------+---------------+-------------
10.244.3.45 | streaming | 3B/12000060 | 3B/12000060 | 0.002
Replication lag: 2ms. Caught up.
Domain B (Linux Ops) — Disk I/O normal¶
Await time 2.1ms (was 142ms). Utilization 35% (was 98.7%). Disk I/O back to normal with both RAID mirrors active.
Domain C (Datacenter Ops) — RAID healthy¶
Prevention¶
- Monitoring: Add RAID health monitoring to all nodes. Alert immediately on array degradation.
- alert: RAIDArrayDegraded
expr: node_md_disks_active < node_md_disks_required
for: 1m
labels:
severity: critical
annotations:
summary: "RAID array degraded on {{ $labels.instance }} — {{ $labels.device }}"
-
Runbook: Check SMART status of all disks weekly via cron. Replace disks proactively when reallocated sector count exceeds 100 or SMART status is not PASSED.
-
Architecture: Use RAID-10 for database storage instead of RAID-1 to maintain performance during degradation. Consider moving to cloud block storage (EBS/GCE PD) where hardware failures are handled transparently. Add disk I/O latency to the database replication monitoring dashboard.