Portal | Level: L1: Foundations | Topics: RAID, Server Hardware | Domain: Datacenter & Hardware

Scenario: RAID 5 Array Degraded with Predictive Failure¶

Situation¶

At 09:47 AM, Nagios fired a CRITICAL alert for host app-node-07: "RAID array /c0/v0 is DEGRADED -- 1 drive predictive failure." The server is a Dell R640 running a RAID 5 array across six 1.2TB 10K SAS drives on a PERC H740P controller. This machine runs a stateful application tier with local storage for session data. The application is still serving traffic, but there is no redundancy left if another disk fails.

What You Know¶

RAID 5 with 6 drives, one showing predictive failure (SMART threshold exceeded)
No hot spare is configured in this array (cost-cutting decision from the original build)
The server has been in production for 3 years with the original drives
Last full backup was 6 hours ago; application data changes frequently
A spare drive of the same model is available in the datacenter parts cage

Investigation Steps¶

1. Identify the Degraded Array and Failed Disk¶

Command(s):

# Check overall controller and virtual drive status (Dell PERC / LSI MegaRAID)
sudo storcli /c0/vall show

# Get detailed info on all physical drives -- look for "Predictive Failure"
sudo storcli /c0/eall/sall show all | grep -E "Drive|State|Media Error|Predictive|Firmware state|Slot"

# Alternative with older megacli
sudo megacli -LDInfo -Lall -aALL
sudo megacli -PDList -aALL | grep -E "Slot|Firmware state|Predictive|Media Error|Drive has"

# Check Linux kernel messages for I/O errors
sudo dmesg | grep -iE "error|fault|reset|abort" | tail -30

What to look for: One physical drive should show state Degraded or Predictive Failure (PF). Note the enclosure ID and slot number (e.g., /c0/e252/s3). Check Media Error Count and Other Error Count -- if they are climbing, the drive is actively failing. Kernel messages with sd device errors or SCSI resets confirm I/O issues reaching the OS layer.

2. Verify SMART Data and Assess Drive Health¶

Command(s):

# Find the OS device mapping for the failing drive
sudo storcli /c0/e252/s3 show all | grep "Device Id"

# Pull SMART data directly (requires enabling JBOD passthrough or using smartctl with megaraid device type)
sudo smartctl -a -d megaraid,3 /dev/sda

# Check specifically for reallocated sectors and pending sectors
sudo smartctl -A -d megaraid,3 /dev/sda | grep -E "Reallocated|Pending|Uncorrect"

What to look for: Reallocated_Sector_Ct climbing above threshold is the classic predictive failure trigger. Current_Pending_Sector above zero means sectors waiting to be remapped on next write. Uncorrectable errors mean data was lost on those sectors. High Reallocated_Sector_Ct with a trend of increase means the drive surface is deteriorating.

3. Check Rebuild Risk and Estimate URE Probability¶

Command(s):

# Check the URE (Unrecoverable Read Error) spec for these drives
sudo smartctl -i -d megaraid,3 /dev/sda | grep -i "rotation\|model\|capacity"

# Check current array consistency state
sudo storcli /c0/v0 show cc

# Verify the replacement drive is compatible
sudo storcli /c0/e252/s7 show all  # assuming spare is in slot 7

What to look for: Enterprise SAS drives typically spec URE at 1 in 10^16 bits. Consumer SATA drives are 1 in 10^14 (100x worse). With 1.2TB drives in a RAID 5, a rebuild reads ~6TB total. At a URE rate of 10^14, there is a meaningful probability of hitting an unrecoverable read error during rebuild on the remaining drives, which would kill the entire array. This is why RAID 5 with large drives is dangerous.

Root Cause¶

One of the six SAS drives (enclosure 252, slot 3) exceeded its SMART Reallocated_Sector_Ct threshold after 3 years of 24/7 operation. The drive firmware flagged a predictive failure, and the PERC controller transitioned the drive to a degraded state. The RAID 5 array lost its parity redundancy. No hot spare was configured, so automatic rebuild could not begin. The remaining five drives are the same age and batch, increasing the risk of a second failure during rebuild (a "URE during rebuild" event).

Fix¶

Immediate:

# Step 1: Trigger a fresh backup BEFORE touching anything
# (application-specific -- snapshot, rsync, pg_basebackup, etc.)

# Step 2: Physically install the replacement drive in an empty slot (e.g., slot 6)
# The datacenter tech inserts the drive; it should appear as "Unconfigured Good"

# Step 3: Verify the new drive is detected
sudo storcli /c0/eall/sall show | grep "UGood"

# Step 4: Replace the failing drive -- start rebuild
# Option A: If you can hot-swap the bad drive's slot
sudo storcli /c0/e252/s3 set offline
sudo storcli /c0/e252/s3 set missing
# (physically replace drive in slot 3)
sudo storcli /c0/v0 start rebuild

# Option B: Add the new drive as a dedicated hot spare and let controller auto-rebuild
sudo storcli /c0/e252/s6 add hotsparedrive dgs=0

# Step 5: Monitor rebuild progress (can take 4-8 hours on 1.2TB drives)
sudo storcli /c0/v0 show rebuild

Preventive: - Always configure a global or dedicated hot spare so rebuilds start immediately - Set up SMART monitoring with smartd to alert on attribute degradation before threshold is crossed - For arrays larger than 4 drives with 1TB+ disks, use RAID 6 (dual parity) or RAID 10 instead of RAID 5 to survive URE during rebuild - Replace drives proactively at 3-4 years in 24/7 environments -- stagger replacements to avoid same-batch failures - Monitor rebuild progress; reduce I/O load during rebuild if possible (storcli /c0/v0 set rebuild rate=60)

Common Mistakes¶

Panicking and immediately yanking the predictive-failure drive -- always verify which slot, and always back up first
Not backing up before starting the rebuild -- if a URE occurs during rebuild, the array is lost
Pulling the wrong drive -- physically label and verify the slot number matches the controller's enclosure/slot mapping
Ignoring the age of the remaining drives -- if they are all from the same batch, they will fail around the same time
Attempting to rebuild a RAID 5 with consumer SATA drives larger than 2TB -- the URE probability makes this nearly suicidal

Interview Angle¶

Q: You get an alert that a RAID 5 array is degraded. What do you do, and what are the risks of rebuilding? Good answer shape: First identify which disk failed using the RAID controller CLI (storcli/megacli) and verify with SMART data. Back up before doing anything destructive. The key risk with RAID 5 is that during a rebuild, you have zero redundancy -- if any remaining drive hits an Unrecoverable Read Error (URE), the entire array is lost. This is why RAID 5 is not recommended for large drives. A strong answer mentions checking whether a hot spare exists, monitoring rebuild progress, reducing I/O during rebuild, and recommending RAID 6 or RAID 10 for future deployments.

Prerequisites¶

Datacenter & Server Hardware (Topic Pack, L1)

Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — RAID, Server Hardware
Datacenter & Server Hardware (Topic Pack, L1) — RAID, Server Hardware
Dell PowerEdge Servers (Topic Pack, L1) — RAID, Server Hardware
Skillcheck: Datacenter (Assessment, L1) — RAID, Server Hardware
Bare-Metal Provisioning (Topic Pack, L2) — Server Hardware
Case Study: BIOS Settings Reset After CMOS (Case Study, L1) — Server Hardware
Case Study: Cable Management Wrong Port (Case Study, L1) — Server Hardware
Case Study: Firmware Update Boot Loop (Case Study, L2) — Server Hardware
Case Study: Link Flaps Bad Optic (Case Study, L1) — Server Hardware
Case Study: Memory ECC Errors Increasing (Case Study, L1) — Server Hardware

Scenario: RAID 5 Array Degraded with Predictive Failure¶

Situation¶

What You Know¶

Investigation Steps¶

1. Identify the Degraded Array and Failed Disk¶

2. Verify SMART Data and Assess Drive Health¶

3. Check Rebuild Risk and Estimate URE Probability¶

Root Cause¶

Fix¶

Common Mistakes¶

Interview Angle¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Scenario: RAID 5 Array Degraded with Predictive Failure¶

Situation¶

What You Know¶

Investigation Steps¶

1. Identify the Degraded Array and Failed Disk¶

2. Verify SMART Data and Assess Drive Health¶

3. Check Rebuild Risk and Estimate URE Probability¶

Root Cause¶

Fix¶

Common Mistakes¶

Interview Angle¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶