Skip to content

Thinking Out Loud: Disk & Storage Ops

A senior SRE's internal monologue while working through a real storage issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

A RAID-6 array on a database server is showing degraded status. One drive failed last night and monitoring shows a second drive with increasing SMART error counts. I need to replace the failed drive and assess the risk before the second drive fails.

The Monologue

RAID-6 degraded with a potentially failing second drive. RAID-6 can survive two drive failures, but we're already at one. If the second drive goes before I replace the first, we're in serious trouble during the rebuild — RAID-6 with two failed drives during rebuild is a total loss scenario on some controllers.

Let me assess the situation.

cat /proc/mdstat

md0 : active raid6 sda[0] sdb[1] sdc[2](F) sdd[3] sde[4] sdf[5] — 6-drive RAID-6, sdc is the failed drive. Let me check the health of the other drives, especially the one with increasing SMART errors.

sudo smartctl -a /dev/sdd | grep -E "Reallocated|Current_Pending|Offline_Uncorrectable|Temperature"

Reallocated_Sector_Ct: 48. Current_Pending_Sector: 12. That's bad. A healthy drive has 0 for both. 48 reallocated sectors means the drive has already moved data from failing sectors to spares, and 12 pending sectors means there are sectors that failed reads and haven't been relocated yet. This drive is dying.

Mental Model: SMART Error Interpretation

Reallocated sectors = sectors the drive has already moved to spare areas (past failures). Pending sectors = sectors that failed a read and are waiting to be reallocated on the next write. Offline uncorrectable = sectors that can't be read OR written. If any of these are non-zero AND increasing, the drive is failing. It's not a question of if, but when. On a degraded RAID array, this is an emergency.

The standard playbook is: replace the failed drive first, start the rebuild, and then assess the second drive. But with sdd in this state, I'm worried about the rebuild. RAID rebuilds read every sector on every surviving drive — if sdd has unreadable sectors, the rebuild could fail partway through, and that's worse than not starting it.

Let me check how bad sdd really is. Can I read all its sectors?

sudo badblocks -nsv /dev/sdd 2>&1 | tail -5

Wait — I should NOT run a destructive or even non-destructive write test on a drive that's part of an active RAID array. badblocks -n does non-destructive read-write tests, which would mess with the RAID. Let me just do a read test.

sudo dd if=/dev/sdd of=/dev/null bs=1M count=1000 status=progress 2>&1 | tail -3

No errors on the first 1GB. Let me spot-check the end of the drive too.

sudo dd if=/dev/sdd of=/dev/null bs=1M skip=900000 count=1000 status=progress 2>&1 | tail -3

Okay, reads are succeeding for now. The pending sectors might be in specific regions. The rebuild will stress-test every sector, but the current SMART numbers suggest the drive is in early failure, not imminent total failure.

Mental Model: The RAID Rebuild Risk Window

A RAID rebuild is the most dangerous time for an array. It reads every block on every surviving drive, which is when latent errors on "healthy" drives surface. The rebuild of a multi-TB array can take 12-24 hours, and during that time, if another drive fails, you lose data. This is why RAID-5 is considered dead for large drives — the probability of a second failure during rebuild is too high. RAID-6 survives two failures, but a degraded RAID-6 with a suspect drive is RAID-5 risk.

Here's my plan: 1. Take a backup NOW, before doing anything else 2. Replace the failed drive (sdc) 3. Start the rebuild 4. If the rebuild completes, hot-swap sdd 5. If sdd fails during rebuild, we still have the backup

sudo mdadm --detail /dev/md0 | grep -E "Array Size|Used Dev Size"

Array size: 4TB. I need a backup target with at least 4TB free. Let me check the backup server.

ssh backupserver df -h /backups/

6TB free. Good. Let me start a full backup. This will take a while but it's the safety net.

sudo rsync -avP --progress /mnt/data/ backupserver:/backups/dbserver-emergency-$(date +%Y%m%d)/ &

Backup is running in the background. While that's going, let me prepare the replacement drive. The new drive should be hot-swappable in this server.

# After physically inserting the new drive in the sdc slot
sudo mdadm /dev/md0 --remove /dev/sdc
lsblk | grep sd

New drive shows up as sdc. Let me partition it to match the others and add it to the array.

sudo sgdisk /dev/sda -R /dev/sdc
sudo sgdisk -G /dev/sdc
sudo mdadm /dev/md0 --add /dev/sdc

Rebuild started. Let me monitor it.

watch -n 30 'cat /proc/mdstat; echo; sudo smartctl -a /dev/sdd | grep -E "Reallocated|Pending"'

The rebuild is progressing... 2% done, estimated 18 hours. Meanwhile, sdd's SMART numbers haven't changed. I'll keep monitoring.

While the rebuild runs, let me set the rebuild speed to something that doesn't kill the database performance.

cat /proc/sys/dev/raid/speed_limit_max
sudo sysctl dev.raid.speed_limit_max=50000

Capped rebuild speed to 50MB/s. The database needs I/O bandwidth too. This will extend the rebuild time but keep the service usable.

One more thing — I need to order a replacement for sdd now. Don't wait until it fails. If it fails during the rebuild, we'll be in RAID-6 with two failed drives, which is technically survivable but any third failure is total loss.

What Made This Senior-Level

Junior Would... Senior Does... Why
Replace the failed drive and start rebuild immediately Back up the array FIRST, then start the rebuild If the rebuild fails (second drive dies), you need that backup
Not check SMART counters on surviving drives Assess all surviving drives' health before starting the rebuild A rebuild stress-tests every drive — a drive in early failure might not survive
Run badblocks on a live RAID member Know that destructive testing on a RAID member would corrupt the array Some diagnostic tools are safe on standalone drives but dangerous on arrays
Let the rebuild run at full speed Throttle the rebuild speed to preserve application I/O bandwidth The rebuild is important but the service still needs to function

Key Heuristics Used

  1. SMART Triage: Non-zero Reallocated, Pending, or Uncorrectable sectors on a RAID member = imminent replacement needed. During degraded state, this is an emergency.
  2. Backup Before Rebuild: A RAID rebuild is the highest-risk period. Always have a current backup before starting one, especially with suspect surviving drives.
  3. Rebuild Speed vs Service Impact: Throttle rebuild I/O to maintain application performance. A 24-hour rebuild that doesn't impact users is better than a 12-hour rebuild that causes timeouts.

Cross-References

  • Primer — RAID levels, SMART attributes, and storage fundamentals
  • Street Ops — mdadm commands, SMART monitoring, and drive replacement procedures
  • Footguns — Running badblocks on active RAID members and not checking surviving drives during degraded state