Datacenter Raid¶

15 cards — 🟢 4 easy | 🟡 6 medium | 🔴 5 hard

🟢 Easy (4)¶

1. What does RAID stand for, and what is it NOT?

Show answer

RAID stands for Redundant Array of Independent Disks. It combines multiple block devices to improve redundancy, capacity, or performance. RAID is NOT backup -- it does not protect against deletion, corruption, ransomware, operator error, or site loss.

2. What is RAID 0 and what happens when one disk fails?

Show answer

RAID 0 uses striping with no redundancy. Data is distributed across disks for maximum performance and capacity. If any single disk dies, the entire array is lost because there is no parity or mirror copy.

3. What is RAID 1 and what is its primary tradeoff?

Show answer

RAID 1 mirrors data across two or more disks, providing full redundancy. Read performance can benefit from multiple copies. The primary tradeoff is storage efficiency -- you lose 50% of total capacity to mirroring.

4. How do you check the status of a Linux software RAID array using mdadm?

Show answer

Run "cat /proc/mdstat" for a quick overview of all arrays including sync/rebuild progress. Use "mdadm --detail /dev/md0" for detailed info on a specific array including state, member disks, and rebuild status. Also check dmesg and journalctl -k for RAID-related kernel messages.

🟡 Medium (6)¶

1. How does RAID 5 distribute parity and how many disk failures can it survive?

Show answer

RAID 5 uses striping with distributed parity spread across all member disks (minimum 3). It can survive exactly one disk failure. Parity writes incur overhead because the controller must read old data/parity, compute new parity, then write both.

2. How does RAID 6 differ from RAID 5 and when should you prefer it?

Show answer

RAID 6 uses dual distributed parity, requiring a minimum of 4 disks and tolerating up to 2 simultaneous disk failures. Prefer RAID 6 over RAID 5 for large arrays where rebuild times are long (8TB+ disks can take 12-24 hours), reducing the risk of a second failure during rebuild.

3. What is RAID 10 and why is it preferred for write-heavy workloads like databases?

Show answer

RAID 10 combines striping and mirroring (striped mirrors). It requires a minimum of 4 disks and can tolerate one failure per mirror pair. It is preferred for databases because mirror writes are simpler than parity computation, providing better write performance with strong redundancy.

4. What is the difference between write-back and write-through cache on a RAID controller?

Show answer

Write-back cache acknowledges writes as soon as data hits the controller's RAM cache, giving much better performance. Write-through cache waits until data is written to disk before acknowledging, which is slower but safer. Write-back requires a Battery Backup Unit (BBU) to protect cached data during power loss.

5. What is the difference between a hot spare and a cold spare in a RAID array?

Show answer

A hot spare is a disk installed in the system and pre-assigned to the RAID controller; it automatically begins rebuilding when a member disk fails, minimizing the degraded window. A cold spare is a replacement disk kept on the shelf that requires manual intervention to install and initiate the rebuild.

6. What are common RAID failure patterns that operators should watch for?

Show answer

Key failure patterns include: replacing the wrong disk (classic mistake), array degraded unnoticed for too long before a second failure, running rebuilds under heavy I/O load, assuming parity protects against all corruption, ignoring SMART errors on underlying drives, and having no backup despite RAID redundancy.

🔴 Hard (5)¶

1. Why is a RAID rebuild a high-risk period, especially on large-capacity disks?

Show answer

During rebuild, redundancy margin is reduced, remaining disks are stressed harder with additional I/O, and performance degrades. On large disks (8TB+), rebuilds can take 12-24 hours. Another disk failure during this window can cause complete data loss on RAID 5 or exceed RAID 6 tolerance.

2. What is the RAID 5 write hole and why does it matter?

Show answer

The RAID 5 write hole occurs when a crash or power loss interrupts a parity update: some data/parity blocks are written but others are not, leaving parity inconsistent. This can cause silent data corruption during degraded operation or rebuild. Mitigations include battery-backed cache, write-intent bitmaps, and partial parity logs (PPL).

3. What are perccli and storcli, and when would you use them instead of mdadm?

Show answer

perccli (Dell PERC) and storcli (LSI MegaRAID/Broadcom) are CLI tools for managing hardware RAID controllers. Use them instead of mdadm when the server has a hardware RAID controller rather than Linux software RAID. They manage virtual drives, check physical disk status, configure hot spares, and monitor rebuild progress.

4. What SMART attributes indicate an imminent disk failure in a RAID array?

Show answer

Critical SMART indicators include: Reallocated Sector Count (bad sectors remapped -- rising count signals failure), Current Pending Sector (sectors awaiting remap), and elevated temperature over sustained periods. Monitor with "smartctl -a /dev/sdX". A rising reallocated sector count is the strongest predictor of impending drive failure.

5. Why are small random writes particularly expensive on parity RAID (5/6)?

Show answer

Each small random write on RAID 5/6 triggers a read-modify-write cycle: the controller must read the old data block and old parity, compute new parity via XOR, then write both the new data and new parity. This means each logical write generates 4 I/O operations, making parity RAID a poor choice for random write-heavy workloads like OLTP databases.