Skip to content

Portal | Level: L1: Foundations | Topics: RAID, Server Hardware | Domain: Datacenter & Hardware

Scenario: RAID 5 Array Degraded with Predictive Failure

Situation

At 09:47 AM, Nagios fired a CRITICAL alert for host app-node-07: "RAID array /c0/v0 is DEGRADED -- 1 drive predictive failure." The server is a Dell R640 running a RAID 5 array across six 1.2TB 10K SAS drives on a PERC H740P controller. This machine runs a stateful application tier with local storage for session data. The application is still serving traffic, but there is no redundancy left if another disk fails.

What You Know

  • RAID 5 with 6 drives, one showing predictive failure (SMART threshold exceeded)
  • No hot spare is configured in this array (cost-cutting decision from the original build)
  • The server has been in production for 3 years with the original drives
  • Last full backup was 6 hours ago; application data changes frequently
  • A spare drive of the same model is available in the datacenter parts cage

Investigation Steps

1. Identify the Degraded Array and Failed Disk

Command(s):

# Check overall controller and virtual drive status (Dell PERC / LSI MegaRAID)
sudo storcli /c0/vall show

# Get detailed info on all physical drives -- look for "Predictive Failure"
sudo storcli /c0/eall/sall show all | grep -E "Drive|State|Media Error|Predictive|Firmware state|Slot"

# Alternative with older megacli
sudo megacli -LDInfo -Lall -aALL
sudo megacli -PDList -aALL | grep -E "Slot|Firmware state|Predictive|Media Error|Drive has"

# Check Linux kernel messages for I/O errors
sudo dmesg | grep -iE "error|fault|reset|abort" | tail -30
What to look for: One physical drive should show state Degraded or Predictive Failure (PF). Note the enclosure ID and slot number (e.g., /c0/e252/s3). Check Media Error Count and Other Error Count -- if they are climbing, the drive is actively failing. Kernel messages with sd device errors or SCSI resets confirm I/O issues reaching the OS layer.

2. Verify SMART Data and Assess Drive Health

Command(s):

# Find the OS device mapping for the failing drive
sudo storcli /c0/e252/s3 show all | grep "Device Id"

# Pull SMART data directly (requires enabling JBOD passthrough or using smartctl with megaraid device type)
sudo smartctl -a -d megaraid,3 /dev/sda

# Check specifically for reallocated sectors and pending sectors
sudo smartctl -A -d megaraid,3 /dev/sda | grep -E "Reallocated|Pending|Uncorrect"
What to look for: Reallocated_Sector_Ct climbing above threshold is the classic predictive failure trigger. Current_Pending_Sector above zero means sectors waiting to be remapped on next write. Uncorrectable errors mean data was lost on those sectors. High Reallocated_Sector_Ct with a trend of increase means the drive surface is deteriorating.

3. Check Rebuild Risk and Estimate URE Probability

Command(s):

# Check the URE (Unrecoverable Read Error) spec for these drives
sudo smartctl -i -d megaraid,3 /dev/sda | grep -i "rotation\|model\|capacity"

# Check current array consistency state
sudo storcli /c0/v0 show cc

# Verify the replacement drive is compatible
sudo storcli /c0/e252/s7 show all  # assuming spare is in slot 7
What to look for: Enterprise SAS drives typically spec URE at 1 in 10^16 bits. Consumer SATA drives are 1 in 10^14 (100x worse). With 1.2TB drives in a RAID 5, a rebuild reads ~6TB total. At a URE rate of 10^14, there is a meaningful probability of hitting an unrecoverable read error during rebuild on the remaining drives, which would kill the entire array. This is why RAID 5 with large drives is dangerous.

Root Cause

One of the six SAS drives (enclosure 252, slot 3) exceeded its SMART Reallocated_Sector_Ct threshold after 3 years of 24/7 operation. The drive firmware flagged a predictive failure, and the PERC controller transitioned the drive to a degraded state. The RAID 5 array lost its parity redundancy. No hot spare was configured, so automatic rebuild could not begin. The remaining five drives are the same age and batch, increasing the risk of a second failure during rebuild (a "URE during rebuild" event).

Fix

Immediate:

# Step 1: Trigger a fresh backup BEFORE touching anything
# (application-specific -- snapshot, rsync, pg_basebackup, etc.)

# Step 2: Physically install the replacement drive in an empty slot (e.g., slot 6)
# The datacenter tech inserts the drive; it should appear as "Unconfigured Good"

# Step 3: Verify the new drive is detected
sudo storcli /c0/eall/sall show | grep "UGood"

# Step 4: Replace the failing drive -- start rebuild
# Option A: If you can hot-swap the bad drive's slot
sudo storcli /c0/e252/s3 set offline
sudo storcli /c0/e252/s3 set missing
# (physically replace drive in slot 3)
sudo storcli /c0/v0 start rebuild

# Option B: Add the new drive as a dedicated hot spare and let controller auto-rebuild
sudo storcli /c0/e252/s6 add hotsparedrive dgs=0

# Step 5: Monitor rebuild progress (can take 4-8 hours on 1.2TB drives)
sudo storcli /c0/v0 show rebuild

Preventive: - Always configure a global or dedicated hot spare so rebuilds start immediately - Set up SMART monitoring with smartd to alert on attribute degradation before threshold is crossed - For arrays larger than 4 drives with 1TB+ disks, use RAID 6 (dual parity) or RAID 10 instead of RAID 5 to survive URE during rebuild - Replace drives proactively at 3-4 years in 24/7 environments -- stagger replacements to avoid same-batch failures - Monitor rebuild progress; reduce I/O load during rebuild if possible (storcli /c0/v0 set rebuild rate=60)

Common Mistakes

  • Panicking and immediately yanking the predictive-failure drive -- always verify which slot, and always back up first
  • Not backing up before starting the rebuild -- if a URE occurs during rebuild, the array is lost
  • Pulling the wrong drive -- physically label and verify the slot number matches the controller's enclosure/slot mapping
  • Ignoring the age of the remaining drives -- if they are all from the same batch, they will fail around the same time
  • Attempting to rebuild a RAID 5 with consumer SATA drives larger than 2TB -- the URE probability makes this nearly suicidal

Interview Angle

Q: You get an alert that a RAID 5 array is degraded. What do you do, and what are the risks of rebuilding? Good answer shape: First identify which disk failed using the RAID controller CLI (storcli/megacli) and verify with SMART data. Back up before doing anything destructive. The key risk with RAID 5 is that during a rebuild, you have zero redundancy -- if any remaining drive hits an Unrecoverable Read Error (URE), the entire array is lost. This is why RAID 5 is not recommended for large drives. A strong answer mentions checking whether a hot spare exists, monitoring rebuild progress, reducing I/O during rebuild, and recommending RAID 6 or RAID 10 for future deployments.


Wiki Navigation

Prerequisites