Incident Replay: RAID Degraded — Rebuild Latency¶
Setup¶
- System context: Production storage server with 12-drive RAID-6 array (PERC H740). One drive failed overnight and the array is rebuilding. Application I/O latency has tripled.
- Time: Wednesday 07:30 UTC
- Your role: On-call SRE / storage engineer
Round 1: Alert Fires¶
[Pressure cue: "Storage monitoring fires — RAID array degraded on stor-prod-04. Rebuild in progress at 12%. Application team reports 3x read latency. Morning traffic spike approaching."]
What you see:
megacli -LDInfo -Lall -aALL shows VD0 is degraded with one drive in a Rebuild state. Current rebuild rate: 30%. Application I/O and rebuild I/O are competing for the same spindles.
Choose your action: - A) Increase the rebuild rate to 100% to finish faster - B) Check the current rebuild priority and I/O scheduling settings - C) Pause the rebuild until after peak hours - D) Migrate workloads off this storage server immediately
If you chose B (recommended):¶
[Result:
megacli -AdpGetProp RbldRate -aALLshows rebuild rate is at 30% (default). The controller is trying to balance rebuild and I/O, but peak traffic makes both slow. Proceed to Round 2.]
If you chose A:¶
[Result: Setting rebuild to 100% makes application I/O nearly unusable. Latency goes from 3x to 10x. Users scream. You revert immediately.]
If you chose C:¶
[Result: Pausing rebuild leaves the array vulnerable — if another drive fails before rebuild completes, you lose the array. Dangerous with 12 drives.]
If you chose D:¶
[Result: Migrating workloads is safe but takes 30+ minutes. The rebuild benefits from less I/O, but you need to weigh the migration disruption.]
Round 2: First Triage Data¶
[Pressure cue: "Rebuild is at 15% after 2 hours. At this rate, it will take 12+ hours. Risk of a second drive failure increases with time."]
What you see: Rebuild rate is 30%. Application I/O is heavy (peak hours). The RAID controller's I/O policy for the VD is set to "Direct I/O" with no caching during degraded state (safety measure).
Choose your action: - A) Set rebuild rate to 60% — a compromise between speed and I/O - B) Enable write-back caching during the degraded rebuild - C) Throttle application I/O using Linux I/O scheduling (ionice/cgroups) - D) Add the hot spare to accelerate the rebuild
If you chose A (recommended):¶
[Result:
megacli -AdpSetProp RbldRate 60 -aALL— rebuild rate increases. Application latency goes from 3x to 4x (acceptable) and rebuild ETA drops from 12 hours to 6 hours. A reasonable trade-off. Proceed to Round 3.]
If you chose B:¶
[Result: Enabling write-back cache during a degraded array is dangerous. If the BBU fails during this state, you risk data loss. Not recommended.]
If you chose C:¶
[Result: ionice helps with application-level I/O but the RAID rebuild is happening at the controller level below the OS scheduler. Limited effect.]
If you chose D:¶
[Result: The hot spare is already being used — it is the drive being rebuilt onto. There is no additional spare.]
Round 3: Root Cause Identification¶
[Pressure cue: "Rebuild at 50%. What caused the original drive failure?"]
What you see: Root cause: The failed drive had 4.2 years of power-on time. SMART data (captured before failure) showed increasing reallocated sector count over the past month. The predictive failure alert was configured but the threshold was too high — it did not fire until the drive actually failed.
Choose your action: - A) Lower the SMART predictive failure alert threshold - B) Implement a proactive drive replacement policy at 4 years - C) Order spare drives and check SMART status on all drives in the array - D) All of the above
If you chose D (recommended):¶
[Result: Lower alert threshold catches failing drives earlier. Age-based replacement prevents surprise failures. Immediate fleet audit identifies 3 more drives approaching failure. Proceed to Round 4.]
If you chose A:¶
[Result: Earlier alerts help but drives can fail without SMART warnings too.]
If you chose B:¶
[Result: Age-based replacement is good but some drives last 7+ years while others fail at 2.]
If you chose C:¶
[Result: Good immediate action but needs the policy changes to be systematic.]
Round 4: Remediation¶
[Pressure cue: "Rebuild complete. Verify and close."]
Actions:
1. Verify RAID array is Optimal: megacli -LDInfo -Lall -aALL
2. Verify I/O latency returned to baseline
3. Reset rebuild rate to default: megacli -AdpSetProp RbldRate 30 -aALL
4. Order replacement drives for the 3 aging drives identified in audit
5. Update SMART alert thresholds in monitoring
Damage Report¶
- Total downtime: 0 (degraded but operational)
- Blast radius: Application read latency 3-4x higher for 6 hours during rebuild
- Optimal resolution time: 6 hours (rebuild with tuned rate)
- If every wrong choice was made: 12+ hours of degraded performance, risk of array loss from second drive failure
Cross-References¶
- Primer: Storage Ops
- Primer: Datacenter & Server Hardware
- Deep Dive: RAID and Storage Internals
- Footguns: Storage Ops