Skip to content

Grading Checklist: RAID Degraded Rebuild Latency

A good response must include:

  • Checked /proc/mdstat to confirm array state, rebuild progress, and estimated completion time
  • Reviewed dmesg / smartctl for signs of additional drive degradation
  • Identified the rebuild speed limits (/proc/sys/dev/raid/speed_limit_min and speed_limit_max) and recommended tuning
  • Considered adjusting stripe_cache_size to balance rebuild speed vs. application I/O
  • Checked the I/O scheduler in use and evaluated whether a change would help
  • Assessed the risk of a second drive failure during rebuild (RAID-6 tolerates it, RAID-5 does not)
  • Proposed a plan to reduce production I/O load during rebuild (read replicas, failover, traffic shifting)
  • Verified the replacement drive is healthy using smartctl
  • Mentioned monitoring for URE (Unrecoverable Read Errors) during rebuild
  • Documented a rollback or escalation plan if the rebuild fails or another drive drops
  • Considered whether ionice or cgroup I/O throttling could help prioritize application I/O
  • Communicated timeline and risk to stakeholders (DB team, application owners)