Skip to content

Incident Replay: RAID Degraded — Rebuild Latency

Setup

  • System context: Production storage server with 12-drive RAID-6 array (PERC H740). One drive failed overnight and the array is rebuilding. Application I/O latency has tripled.
  • Time: Wednesday 07:30 UTC
  • Your role: On-call SRE / storage engineer

Round 1: Alert Fires

[Pressure cue: "Storage monitoring fires — RAID array degraded on stor-prod-04. Rebuild in progress at 12%. Application team reports 3x read latency. Morning traffic spike approaching."]

What you see: megacli -LDInfo -Lall -aALL shows VD0 is degraded with one drive in a Rebuild state. Current rebuild rate: 30%. Application I/O and rebuild I/O are competing for the same spindles.

Choose your action: - A) Increase the rebuild rate to 100% to finish faster - B) Check the current rebuild priority and I/O scheduling settings - C) Pause the rebuild until after peak hours - D) Migrate workloads off this storage server immediately

[Result: megacli -AdpGetProp RbldRate -aALL shows rebuild rate is at 30% (default). The controller is trying to balance rebuild and I/O, but peak traffic makes both slow. Proceed to Round 2.]

If you chose A:

[Result: Setting rebuild to 100% makes application I/O nearly unusable. Latency goes from 3x to 10x. Users scream. You revert immediately.]

If you chose C:

[Result: Pausing rebuild leaves the array vulnerable — if another drive fails before rebuild completes, you lose the array. Dangerous with 12 drives.]

If you chose D:

[Result: Migrating workloads is safe but takes 30+ minutes. The rebuild benefits from less I/O, but you need to weigh the migration disruption.]

Round 2: First Triage Data

[Pressure cue: "Rebuild is at 15% after 2 hours. At this rate, it will take 12+ hours. Risk of a second drive failure increases with time."]

What you see: Rebuild rate is 30%. Application I/O is heavy (peak hours). The RAID controller's I/O policy for the VD is set to "Direct I/O" with no caching during degraded state (safety measure).

Choose your action: - A) Set rebuild rate to 60% — a compromise between speed and I/O - B) Enable write-back caching during the degraded rebuild - C) Throttle application I/O using Linux I/O scheduling (ionice/cgroups) - D) Add the hot spare to accelerate the rebuild

[Result: megacli -AdpSetProp RbldRate 60 -aALL — rebuild rate increases. Application latency goes from 3x to 4x (acceptable) and rebuild ETA drops from 12 hours to 6 hours. A reasonable trade-off. Proceed to Round 3.]

If you chose B:

[Result: Enabling write-back cache during a degraded array is dangerous. If the BBU fails during this state, you risk data loss. Not recommended.]

If you chose C:

[Result: ionice helps with application-level I/O but the RAID rebuild is happening at the controller level below the OS scheduler. Limited effect.]

If you chose D:

[Result: The hot spare is already being used — it is the drive being rebuilt onto. There is no additional spare.]

Round 3: Root Cause Identification

[Pressure cue: "Rebuild at 50%. What caused the original drive failure?"]

What you see: Root cause: The failed drive had 4.2 years of power-on time. SMART data (captured before failure) showed increasing reallocated sector count over the past month. The predictive failure alert was configured but the threshold was too high — it did not fire until the drive actually failed.

Choose your action: - A) Lower the SMART predictive failure alert threshold - B) Implement a proactive drive replacement policy at 4 years - C) Order spare drives and check SMART status on all drives in the array - D) All of the above

[Result: Lower alert threshold catches failing drives earlier. Age-based replacement prevents surprise failures. Immediate fleet audit identifies 3 more drives approaching failure. Proceed to Round 4.]

If you chose A:

[Result: Earlier alerts help but drives can fail without SMART warnings too.]

If you chose B:

[Result: Age-based replacement is good but some drives last 7+ years while others fail at 2.]

If you chose C:

[Result: Good immediate action but needs the policy changes to be systematic.]

Round 4: Remediation

[Pressure cue: "Rebuild complete. Verify and close."]

Actions: 1. Verify RAID array is Optimal: megacli -LDInfo -Lall -aALL 2. Verify I/O latency returned to baseline 3. Reset rebuild rate to default: megacli -AdpSetProp RbldRate 30 -aALL 4. Order replacement drives for the 3 aging drives identified in audit 5. Update SMART alert thresholds in monitoring

Damage Report

  • Total downtime: 0 (degraded but operational)
  • Blast radius: Application read latency 3-4x higher for 6 hours during rebuild
  • Optimal resolution time: 6 hours (rebuild with tuned rate)
  • If every wrong choice was made: 12+ hours of degraded performance, risk of array loss from second drive failure

Cross-References