Skip to content

Solution: RAID Degraded Rebuild Latency

Triage

  1. Confirm the array state immediately:
    cat /proc/mdstat
    mdadm --detail /dev/md0
    
  2. Check current rebuild progress and ETA. A RAID-6 rebuild on 2TB drives can take 12-24+ hours.
  3. Verify application impact with iostat -xz 1 5 -- look at await and %util columns.
  4. Check dmesg -T | grep -i "error\|fault\|reset" for additional disk errors on surviving drives.

Root Cause

The RAID array is in a degraded-rebuild state. During rebuild, every read must reconstruct data from parity across all surviving drives, and the rebuild process itself generates heavy sequential I/O. This competes with production random I/O, causing latency spikes.

With default kernel settings, speed_limit_max is often set to 200000 KB/s, which can saturate the I/O bus on older controllers. Meanwhile speed_limit_min guarantees a floor of rebuild throughput even under load.

Fix

  1. Tune rebuild speed to reduce production impact:

    # Lower the max rebuild speed (default 200000 KB/s)
    echo 50000 > /proc/sys/dev/raid/speed_limit_max
    # This slows rebuild but frees I/O for production
    
    Trade-off: slower rebuild = longer window of vulnerability.

  2. Increase stripe cache if memory allows:

    echo 8192 > /sys/block/md0/md/stripe_cache_size
    
    This uses more RAM but makes rebuild I/O more efficient.

  3. Check and set I/O scheduler:

    cat /sys/block/sda/queue/scheduler
    # For RAID on spinning disks, deadline or mq-deadline is usually best
    echo mq-deadline > /sys/block/sda/queue/scheduler
    

  4. Reduce application load if possible:

  5. Fail over reads to a replica if available.
  6. Defer batch jobs, backups, or analytics queries.
  7. Coordinate with the application team for a maintenance window if needed.

  8. Verify new drive health:

    smartctl -a /dev/sdd
    
    Confirm no reallocated sectors or pending errors on the replacement.

  9. Monitor surviving drives:

    for d in sd{a,b,c,e,f,g,h}; do smartctl -H /dev/$d; done
    

Rollback / Safety

  • If the rebuild fails or another drive shows errors, RAID-6 can survive one more drive loss. RAID-5 cannot -- this distinction is critical.
  • If a second drive fails during RAID-6 rebuild, the array is still functional but now critically degraded. Stop all non-essential I/O and expedite rebuild.
  • Keep a full backup verified and accessible before making any tuning changes.
  • If rebuild must be restarted: mdadm --manage /dev/md0 --re-add /dev/sdd1.
  • Document the failed drive serial number and RMA it.

Common Traps

  • Trap: Cranking speed_limit_max to maximum to finish faster -- this can make the application completely unusable during rebuild.
  • Trap: Ignoring SMART warnings on other drives. After one drive fails in an array of the same age/batch, siblings are statistically more likely to fail soon.
  • Trap: Not monitoring for UREs. A single Unrecoverable Read Error on a surviving drive during RAID-5 rebuild = data loss. RAID-6 provides one more layer of protection.
  • Trap: Forgetting to re-enable normal speed limits after rebuild completes.
  • Trap: Not verifying the rebuild actually completed -- check mdadm --detail for "State : active" (not "degraded" or "recovering").