Skip to content

Pattern: RAID Rebuild I/O Saturation

ID: FP-008 Family: Resource Exhaustion Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Moderate

The Shape

After a disk failure, RAID array reconstruction reads every block from surviving disks and writes them to the replacement. This rebuild process competes directly with production I/O on the same disks. On a heavily loaded storage system, the rebuild can saturate I/O bandwidth, causing application latency to spike 5–10x for hours to days. Notably, during rebuild the array is also vulnerable: a second disk failure causes complete data loss.

How You'll See It

In Linux/Infrastructure

$ cat /proc/mdstat
md0 : active raid5 sda[0] sdb[1] sdc[2](S) sdd[3]
      2929936384 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [=====>...............]  recovery = 27.3% (133442560/488978844) finish=143.5min speed=41287K/sec
iostat -x 1 shows all member disks at 100% utilization during rebuild. Application queries that normally complete in 2ms now take 200ms. The latency spike starts exactly when the rebuild begins, not when the disk failed.

In Kubernetes

StatefulSets backed by local persistent volumes (using the node's RAID array) show dramatically increased write/read latency after a disk failure. PVs on the affected node report high latency; pods may be evicted if readiness probes time out.

In Datacenter

RAID rebuild on a storage array serving multiple VMs causes all VMs to experience I/O latency simultaneously. The symptom looks like a network issue to the operators monitoring the VMs (all VMs slow at once) when the root cause is the shared storage layer rebuilding a degraded disk.

The Tell

/proc/mdstat or RAID controller status shows "recovery" or "rebuild" in progress. All disks in the array show high I/O utilization simultaneously. Application latency spike begins at the exact time the rebuild started (after disk replacement or array hot-spare activation), not when the disk failed.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Network congestion Storage I/O saturation iostat shows disk at 100%; network metrics normal
Application slowdown Shared storage rebuild All apps on same storage slow simultaneously
Disk failure (original) Rebuild of replacement Original failure was earlier; latency spike is new, coincides with rebuild start

The Fix (Generic)

  1. Immediate: Reduce RAID rebuild speed to leave I/O headroom for production: echo 10000 > /proc/sys/dev/raid/speed_limit_max (limit to 10MB/s). This extends rebuild time but keeps applications responsive.
  2. Short-term: Schedule disk replacements during low-traffic windows; throttle rebuild speed via mdadm --grow --set-bitmap or sysctl.
  3. Long-term: Use RAID 10 instead of RAID 5 (faster rebuild, no parity calculation overhead); provision dedicated rebuild I/O bandwidth; implement monitoring on /proc/mdstat and alert on degraded state before rebuild begins.

Real-World Examples

  • Example 1: Production database RAID 5 array started rebuilding after disk replacement at 9am (peak traffic). Application query latency spiked from 5ms to 50ms for 6 hours. The rebuild finished at 3pm; latency returned to normal.
  • Example 2: Hypervisor SAN serving 40 VMs had a disk failure. Rebuild started automatically. All 40 VMs experienced I/O stalls simultaneously, causing a wave of incidents across unrelated services.

War Story

We replaced the failed disk and congratulated ourselves on quick response time. Then the monitoring went crazy: every service on that storage tier spiked latency. We thought we'd somehow broken the array during replacement. cat /proc/mdstat showed 27% rebuild at 41MB/s — and iostat showed all three surviving disks pegged at 100%. The replacement disk was fine; the rebuild was killing us. We throttled to 10MB/s: latency dropped from 200ms to 8ms immediately, rebuild would finish in 8 hours instead of 2.5, but services were usable again.

Cross-References