Solution: RAID Degraded Rebuild Latency¶
Triage¶
- Confirm the array state immediately:
- Check current rebuild progress and ETA. A RAID-6 rebuild on 2TB drives can take 12-24+ hours.
- Verify application impact with
iostat -xz 1 5-- look atawaitand%utilcolumns. - Check
dmesg -T | grep -i "error\|fault\|reset"for additional disk errors on surviving drives.
Root Cause¶
The RAID array is in a degraded-rebuild state. During rebuild, every read must reconstruct data from parity across all surviving drives, and the rebuild process itself generates heavy sequential I/O. This competes with production random I/O, causing latency spikes.
With default kernel settings, speed_limit_max is often set to 200000 KB/s, which can saturate the I/O bus on older controllers. Meanwhile speed_limit_min guarantees a floor of rebuild throughput even under load.
Fix¶
-
Tune rebuild speed to reduce production impact:
Trade-off: slower rebuild = longer window of vulnerability. -
Increase stripe cache if memory allows:
This uses more RAM but makes rebuild I/O more efficient. -
Check and set I/O scheduler:
-
Reduce application load if possible:
- Fail over reads to a replica if available.
- Defer batch jobs, backups, or analytics queries.
-
Coordinate with the application team for a maintenance window if needed.
-
Verify new drive health:
Confirm no reallocated sectors or pending errors on the replacement. -
Monitor surviving drives:
Rollback / Safety¶
- If the rebuild fails or another drive shows errors, RAID-6 can survive one more drive loss. RAID-5 cannot -- this distinction is critical.
- If a second drive fails during RAID-6 rebuild, the array is still functional but now critically degraded. Stop all non-essential I/O and expedite rebuild.
- Keep a full backup verified and accessible before making any tuning changes.
- If rebuild must be restarted:
mdadm --manage /dev/md0 --re-add /dev/sdd1. - Document the failed drive serial number and RMA it.
Common Traps¶
- Trap: Cranking
speed_limit_maxto maximum to finish faster -- this can make the application completely unusable during rebuild. - Trap: Ignoring SMART warnings on other drives. After one drive fails in an array of the same age/batch, siblings are statistically more likely to fail soon.
- Trap: Not monitoring for UREs. A single Unrecoverable Read Error on a surviving drive during RAID-5 rebuild = data loss. RAID-6 provides one more layer of protection.
- Trap: Forgetting to re-enable normal speed limits after rebuild completes.
- Trap: Not verifying the rebuild actually completed -- check
mdadm --detailfor "State : active" (not "degraded" or "recovering").