Pattern: RAID Rebuild I/O Saturation¶
ID: FP-008 Family: Resource Exhaustion Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Moderate
The Shape¶
After a disk failure, RAID array reconstruction reads every block from surviving disks and writes them to the replacement. This rebuild process competes directly with production I/O on the same disks. On a heavily loaded storage system, the rebuild can saturate I/O bandwidth, causing application latency to spike 5–10x for hours to days. Notably, during rebuild the array is also vulnerable: a second disk failure causes complete data loss.
How You'll See It¶
In Linux/Infrastructure¶
$ cat /proc/mdstat
md0 : active raid5 sda[0] sdb[1] sdc[2](S) sdd[3]
2929936384 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
[=====>...............] recovery = 27.3% (133442560/488978844) finish=143.5min speed=41287K/sec
iostat -x 1 shows all member disks at 100% utilization during rebuild. Application
queries that normally complete in 2ms now take 200ms. The latency spike starts exactly
when the rebuild begins, not when the disk failed.
In Kubernetes¶
StatefulSets backed by local persistent volumes (using the node's RAID array) show dramatically increased write/read latency after a disk failure. PVs on the affected node report high latency; pods may be evicted if readiness probes time out.
In Datacenter¶
RAID rebuild on a storage array serving multiple VMs causes all VMs to experience I/O latency simultaneously. The symptom looks like a network issue to the operators monitoring the VMs (all VMs slow at once) when the root cause is the shared storage layer rebuilding a degraded disk.
The Tell¶
/proc/mdstator RAID controller status shows "recovery" or "rebuild" in progress. All disks in the array show high I/O utilization simultaneously. Application latency spike begins at the exact time the rebuild started (after disk replacement or array hot-spare activation), not when the disk failed.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Network congestion | Storage I/O saturation | iostat shows disk at 100%; network metrics normal |
| Application slowdown | Shared storage rebuild | All apps on same storage slow simultaneously |
| Disk failure (original) | Rebuild of replacement | Original failure was earlier; latency spike is new, coincides with rebuild start |
The Fix (Generic)¶
- Immediate: Reduce RAID rebuild speed to leave I/O headroom for production:
echo 10000 > /proc/sys/dev/raid/speed_limit_max(limit to 10MB/s). This extends rebuild time but keeps applications responsive. - Short-term: Schedule disk replacements during low-traffic windows; throttle rebuild speed via
mdadm --grow --set-bitmapor sysctl. - Long-term: Use RAID 10 instead of RAID 5 (faster rebuild, no parity calculation overhead); provision dedicated rebuild I/O bandwidth; implement monitoring on
/proc/mdstatand alert on degraded state before rebuild begins.
Real-World Examples¶
- Example 1: Production database RAID 5 array started rebuilding after disk replacement at 9am (peak traffic). Application query latency spiked from 5ms to 50ms for 6 hours. The rebuild finished at 3pm; latency returned to normal.
- Example 2: Hypervisor SAN serving 40 VMs had a disk failure. Rebuild started automatically. All 40 VMs experienced I/O stalls simultaneously, causing a wave of incidents across unrelated services.
War Story¶
We replaced the failed disk and congratulated ourselves on quick response time. Then the monitoring went crazy: every service on that storage tier spiked latency. We thought we'd somehow broken the array during replacement.
cat /proc/mdstatshowed 27% rebuild at 41MB/s — andiostatshowed all three surviving disks pegged at 100%. The replacement disk was fine; the rebuild was killing us. We throttled to 10MB/s: latency dropped from 200ms to 8ms immediately, rebuild would finish in 8 hours instead of 2.5, but services were usable again.
Cross-References¶
- Topic Packs: disk-and-storage-ops, datacenter
- Case Studies: datacenter_ops/raid-degraded-rebuild-latency/
- Footguns: disk-and-storage-ops/footguns.md — "RAID rebuild during peak hours"
- Related Patterns: FP-003 (disk full — another storage resource limit), FP-022 (dependency chain collapse — RAID serves many services)