Storage¶

43 cards — 🟢 8 easy | 🟡 17 medium | 🔴 8 hard

🟢 Easy (8)¶

1. Tell me about your storage background.

Show answer

I've managed enterprise SAN/NAS systems, NFS infrastructure, iSCSI, and distributed storage like Portworx and MinIO. I've handled provisioning, performance tuning, failover, and debugging throughput/latency issues. I also have deep experience with Linux filesystems like XFS, EXT4, and BTRFS.

Interview tip: Lead with the most relevant storage experience for the role. Mention specific technologies AND scale (number of hosts, capacity managed).

Remember: "Block = raw disk (SAN/EBS), File = shared filesystem (NFS/EFS), Object = HTTP blobs (S3/GCS)."

2. What is the difference between RAID 1 and RAID 5?

Show answer

RAID 1 mirrors data across two disks (50% capacity, tolerates one disk failure, fast reads). RAID 5 stripes data with parity across three or more disks (loses one disk worth of capacity to parity, tolerates one disk failure, but write performance suffers due to parity calculation). RAID 1 is simpler to rebuild; RAID 5 rebuild times grow with disk size and stress surviving disks.

Remember: "RAID 1 = mirror (2 disks, 50% capacity). RAID 5 = striping + parity (3+ disks, N-1 capacity)." RAID is not backup.

Gotcha: RAID 5 rebuild times grow with disk size. A 10TB disk rebuild can take 24+ hours, during which a second failure = total data loss.

3. When would you choose RAID 10 over RAID 5?

Show answer

RAID 10 is preferred for write-heavy workloads (databases, transactional systems) because it avoids the parity write penalty. It also rebuilds faster since only a mirror copy needs to be re-synced rather than recalculating parity across all disks. The tradeoff is 50% usable capacity vs roughly 75-80% for RAID 5.

Remember: "RAID 10 = mirror first, then stripe. RAID 01 = stripe first, then mirror." RAID 10 is preferred because it survives more failure combinations.

4. What are the key differences between ext4 and XFS?

Show answer

ext4 supports volumes up to 1 EB and files up to 16 TB; it is mature, well-tested, and supports online resize (grow only). XFS excels at large file sequential I/O and parallel writes, supports online defragmentation, and scales better on high-core-count systems. ext4 is generally better for small-file workloads; XFS is preferred for large data volumes and enterprise use (default in RHEL).

Fun fact: XFS was created by Silicon Graphics (SGI) in 1993 for IRIX. It became the default filesystem in RHEL 7 (2014), replacing ext4.

Remember: "XFS for large files, ext4 for general purpose." XFS cannot be shrunk; ext4 can (offline only).

5. A disk shows 100% full but du reports less usage than df. What could explain the discrepancy?

Show answer

Deleted files still held open by running processes. The kernel does not free the disk blocks until the file descriptor is closed. Use "lsof +L1" to find processes holding deleted files. Restarting or signaling those processes releases the space. Another cause: reserved blocks (typically 5% on ext4) visible to df but not to du.

Debug clue: `lsof +L1` finds processes holding deleted files. Common culprit: rotated log files still open by the application.

Remember: "Deleted but held open = space not freed." The kernel waits until all file descriptors close.

6. How does iotop help with storage troubleshooting?

Show answer

iotop shows per-process I/O usage in real time, similar to top for CPU. Run "iotop -oP" to see only processes actively doing I/O. It reveals which process is generating excessive reads or writes, helping you identify runaway log writers, backup jobs, or misbehaving applications consuming all disk bandwidth. Requires root or CAP_NET_ADMIN.

Gotcha: iotop requires the CONFIG_TASK_IO_ACCOUNTING kernel option. If it shows no data, check kernel config or use `pidstat -d 1` as an alternative.

7. What are the critical NFS export options and their security implications?

Show answer

no_root_squash allows remote root to act as root on the server -- dangerous, avoid unless required. "root_squash" (default) maps remote root to nobody. "sync" ensures writes are committed to disk before acknowledging (safe but slower). "async" acknowledges before disk commit (faster but risks data loss on server crash). "no_all_squash" preserves UID/GID mapping. Always restrict exports to specific subnets, not "*".

Gotcha: `no_root_squash` on an NFS export means any client with root can read and write ANY file on the export. Use only when absolutely necessary.

8. A monitoring dashboard alerts on a disk with Temperature = 42C, Power-On Hours = 38000, and Start/Stop Count = 150. Should you be concerned?

Show answer

No. These attributes are poor predictors of failure. The Backblaze and Google disk failure studies show temperature (unless extreme >60C), power-on hours, and start/stop count do not reliably correlate with imminent failure. Old drives are not inherently more likely to fail. Instead, check the attributes that matter: IDs 5, 187, 197, and 198.

Remember: "Temperature, power-on hours, and start/stop count are noise." The Google and Backblaze studies both confirmed these have weak correlation with failure.

🟡 Medium (17)¶

1. How do you troubleshoot storage latency?

Show answer

Check I/O wait, iostat, dmesg for disk errors, network paths for NFS/iSCSI, and confirm the backend array is healthy. Then isolate: system-level? network-level? storage-level?

Remember: "Block = raw disk (SAN/EBS), File = shared filesystem (NFS/EFS), Object = HTTP blobs (S3/GCS)."

Example: Database needs block storage (IOPS). Web assets need object storage (scalability). Shared configs need file storage (NFS).

Debug clue: Start with `iostat -xz 1` to identify which device is slow, then check `await` (I/O latency) and `%util` (saturation).

2. Why is RAID 5 considered risky with large modern disks?

Show answer

With multi-TB disks, RAID 5 rebuild times can exceed 24 hours. During rebuild, the array is degraded with zero redundancy. A second disk failure or an unrecoverable read error (URE) during rebuild causes total data loss. RAID 6 (dual parity) or RAID 10 are safer alternatives for large disks.

Number anchor: With 10TB disks at 10^14 URE (Unrecoverable Read Error) rate, there\'s a ~5% chance of hitting an unrecoverable error during rebuild.

3. What are the advantages and risks of btrfs?

Show answer

Btrfs offers copy-on-write snapshots, built-in checksumming (detects silent corruption), transparent compression, subvolumes, and send/receive for incremental backups. Risks include RAID 5/6 implementation being historically unreliable (the "write hole"), performance degradation with heavy random writes and snapshots, and less maturity than ext4/XFS for production database workloads.

Remember: "IOPS = random I/O speed, throughput = sequential I/O speed, latency = time per operation." All three matter but for different workloads.

Fun fact: Btrfs ("butter-FS" or "better-FS") was started by Oracle in 2007. Facebook was its largest production user before switching to XFS for database workloads.

4. A disk shows 0% space used but you cannot create files. What is the likely cause?

Show answer

The filesystem has run out of inodes. Each file (including tiny files, symlinks, sockets) consumes one inode. Check with "df -i". Common causes: millions of small files (mail queues, session files, cache dirs). Fix by cleaning up files, or reformat with a higher inode count (mke2fs -N). XFS dynamically allocates inodes so this is rarer on XFS.

Debug clue: `df -i` shows inode usage. If inodes are at 100% but space is available, the fix is to delete many small files or reformat.

Remember: "Each file = one inode." Millions of tiny cache or session files exhaust inodes before space.

5. You add an entry to /etc/fstab and the system fails to boot. How do you recover?

Show answer

Boot into single-user mode or a rescue/live environment. The failed mount with default options causes boot to hang or drop to emergency shell. Edit /etc/fstab to fix or comment out the bad entry. Use "mount -a" to test before rebooting. Best practice: always use "nofail" or "x-systemd.device-timeout" mount options for non-critical mounts to prevent boot failures.

Remember: "nofail in fstab = don\'t block boot." Always add `nofail` for non-critical mounts. Test with `mount -a` before rebooting.

6. An NFS mount hangs and makes the entire system unresponsive. How do you diagnose and fix this?

Show answer

NFS default is "hard" mount, meaning processes block indefinitely waiting for the server. Use "mount -o soft,timeo=30,retrans=3" for non-critical mounts so operations fail instead of hanging. For diagnosis: check NFS server availability, network path, firewall rules (ports 2049, 111), and "nfsstat -c" for RPC errors. Use "umount -f" or "umount -l" (lazy) to remove a stuck mount.

Remember: "Hard mount = hang forever. Soft mount = return error." Use `soft,timeo=30` for non-critical NFS mounts.

7. How do you validate that a backup can actually be restored?

Show answer

Schedule regular restore drills to an isolated environment. Verify: file counts and sizes match, checksums of critical files match originals, databases start and pass consistency checks, application health checks pass against restored data. Automate this as a CI/CD job or cron task. An untested backup is not a backup -- it is a hope.

Remember: "An untested backup is Schrödinger\'s backup — simultaneously valid and corrupt until you try to restore."

8. Explain the difference between storage latency and throughput and when each matters.

Show answer

Latency is the time for a single I/O operation to complete (measured in ms or us). Throughput is the total data transferred per unit time (MB/s). Databases and OLTP workloads are latency-sensitive (many small random reads/writes). Streaming, video, and big-data analytics are throughput-sensitive (large sequential reads). Optimizing for one can hurt the other -- e.g., large block sizes improve throughput but can increase latency for small I/O.

Analogy: Latency is like how fast a single car goes. Throughput is like how many cars pass per hour. A highway can have high throughput but high latency (traffic jams).

9. How do you use iostat to identify a storage bottleneck?

Show answer

Run "iostat -xz 1" to see per-device stats. Key columns: await (average I/O wait time in ms -- high means slow device), %util (device utilization -- sustained 100% means saturated), r/s and w/s (IOPS), and avgqu-sz (queue depth). High await with high %util indicates the device cannot keep up. High await with low %util may indicate upstream queuing or a firmware issue.

Remember: "await > 10ms = slow device. %util > 80% sustained = saturated." These two columns tell you 80% of storage problems.

10. What is the operational benefit of LVM and what is a common pitfall?

Show answer

LVM (Logical Volume Manager) provides flexible volume resizing, snapshots, and volume spanning across physical disks without downtime. Common pitfall: LVM snapshots degrade performance because every write to the original volume triggers a copy-on-write to the snapshot's COW area. A full snapshot COW area causes the snapshot to become invalid. Always monitor snapshot usage with "lvs" and remove snapshots promptly after use.

Remember: "LVM snapshots are temporary — remove within hours." Long-lived snapshots cause severe I/O degradation.

11. Walk through extending a logical volume and filesystem online.

Show answer

1) Ensure the VG has free space ("vgs"). 2) Extend the LV: "lvextend -L +10G /dev/vg0/lv_data" (or -l +100%FREE for all remaining). 3) Resize the filesystem: for ext4 use "resize2fs /dev/vg0/lv_data", for XFS use "xfs_growfs /mountpoint". Both support online (mounted) grow. Shrinking ext4 requires unmounting first; XFS cannot be shrunk at all. Always verify with "df -h" after.

Gotcha: `resize2fs` without a size argument grows to fill the LV. XFS uses `xfs_growfs` instead. XFS cannot shrink — plan capacity carefully.

12. How does dm-multipath work and how do you troubleshoot path failures?

Show answer

dm-multipath aggregates multiple physical paths to a storage device into a single logical device, providing failover and optional load balancing. Use "multipath -ll" to list paths and their status (active, failed, ghost). Check "multipathd show paths" for individual path health. Common issues: path flapping (check cables, switch ports, HBA firmware), asymmetric access (ALUA) misconfiguration, and missing multipath.conf entries causing devices to not be claimed.

Debug clue: `multipath -ll` shows path status. A path in `faulty` state with `checker msg` reveals the specific failure reason.

13. How do you use dmesg and kernel logs to diagnose disk hardware issues?

Show answer

Check "dmesg -T | grep -i 'error\|fail\|reset\|ata'" for disk errors. Key patterns: "I/O error, dev sdX" indicates read/write failures, "ata bus reset" suggests cable or controller issues, "medium error" means unreadable sectors, "DRDY error" points to drive firmware or hardware failure. Correlate timestamps with application errors. Repeated errors on the same device warrant immediate SMART check and proactive disk replacement.

Remember: "ata bus reset = cable or controller. medium error = bad sectors. DRDY error = drive dying." Each pattern points to a different fix.

14. Which SMART attribute is the strongest single predictor of disk failure, and what does it measure?

Show answer

Reallocated Sector Count (ID 5). It tracks the number of bad sectors that have been remapped to the drive's spare area. A rising count means the drive is encountering read errors in its media and running through its reserve of spare sectors. Per Backblaze data, any non-zero value in this attribute warrants investigation and likely proactive replacement.

Number anchor: Backblaze data shows drives with non-zero Reallocated Sector Count are 4-10x more likely to fail within the next year.

15. What is the Backblaze rule of thumb for SMART-based disk replacement?

Show answer

Any non-zero value in SMART attributes 5 (Reallocated Sector Count), 187 (Reported Uncorrectable Errors), 197 (Current Pending Sector), or 198 (Offline Uncorrectable) warrants investigation and likely proactive replacement. This simple four-attribute check catches the vast majority of predictable failures. Most healthy drives show zeros in all four for their entire lifespan.

Remember: "5, 187, 197, 198 — the four horsemen of disk failure." Check these four SMART attributes and act on any non-zero value.

16. What is the difference between Current Pending Sector (197) and Reallocated Sector Count (5)?

Show answer

Current Pending Sector (197) tracks sectors the drive cannot currently read and is waiting to remap on the next write to that sector — these represent an active problem with potentially unrecoverable data. Reallocated Sector Count (5) tracks sectors that have already been remapped to the spare area — the remap succeeded but the original location was bad. Pending sectors are more urgent because data may still be at risk.

Debug clue: A drive with high Current Pending (197) but low Reallocated (5) means it\'s encountering errors but hasn\'t successfully remapped them — data may be at risk.

17. How does NVMe health monitoring differ from SATA SMART, and what are the key fields to watch?

Show answer

NVMe drives do not use numbered SMART attribute IDs. Instead, they expose a standardized health log page with fields like: Critical Warning (bitmask — any non-zero is urgent), Media and Data Integrity Errors (any non-zero = corruption risk), Available Spare (replacement due when below threshold), and Percentage Used (drive endurance consumed). Use `smartctl -a /dev/nvme0` or `nvme smart-log /dev/nvme0n1`. The fields are more standardized across vendors than SATA SMART.

Gotcha: NVMe drives report Percentage Used > 100% when they exceed their rated endurance. The drive still works but is operating beyond warranty.

🔴 Hard (8)¶

1. How does ZFS protect against silent data corruption?

Show answer

ZFS checksums every block (data and metadata) using SHA-256 or fletcher4 in a Merkle tree structure. On read, it verifies the checksum and, if using a redundant vdev (mirror or raidz), automatically repairs corrupted blocks from a good copy (self-healing). This catches bit rot, firmware bugs, and phantom writes that traditional RAID cannot detect.

Fun fact: ZFS was created by Sun Microsystems (Jeff Bonwick and team) in 2005. The name originally stood for "Zettabyte File System."

Remember: "ZFS: checksums + redundancy = self-healing storage." It detects AND repairs corruption automatically.

2. Why are filesystem snapshots not a substitute for backups?

Show answer

Snapshots reside on the same physical storage as the original data, so a disk or controller failure destroys both. Snapshots also share underlying blocks (copy-on-write); filesystem corruption can affect both snapshot and live data. Snapshots do not protect against accidental pool/volume deletion. True backups must be on separate media, ideally offsite, and should be tested with regular restore validation.

Remember: "3-2-1 backup rule: 3 copies, 2 different media, 1 offsite." Snapshots are copy #2 on the same media — not a real backup.

3. How does SMART monitoring help predict disk failures?

Show answer

SMART (Self-Monitoring, Analysis and Reporting Technology) tracks internal disk health counters. Key attributes: Reallocated_Sector_Ct (bad sectors remapped -- rising count is a strong failure predictor), Current_Pending_Sector (sectors waiting to be remapped), UDMA_CRC_Error_Count (cable/interface issues), and Spin_Retry_Count. Use "smartctl -a /dev/sdX" to inspect. Configure smartd to alert on threshold crossings. SMART is not perfect but catches many gradual degradation failures.

Remember: "Backblaze 4 attributes: IDs 5, 187, 197, 198." Any non-zero = investigate. This simple check catches most predictable failures.

4. When and how do you run fsck, and what are the risks?

Show answer

Run fsck only on unmounted or read-only filesystems -- running on a mounted read-write filesystem causes corruption. For root filesystem, boot into single-user mode or use a live environment. "fsck -n" does a read-only check. "fsck -y" auto-fixes (risky -- may delete data to fix structure). For XFS use "xfs_repair" (never "fsck.xfs" which is a no-op). Always have backups before running destructive repair.

War story: Running fsck on a mounted filesystem corrupted an ext4 journal, causing data loss. Always unmount or boot to rescue mode first.

5. What are common iSCSI failure modes and how do you troubleshoot them?

Show answer

Common failures: initiator cannot discover targets (check iscsid service, target portal IP/port, firewall rules on port 3260), session drops under load (check network MTU mismatches, switch errors, multipath configuration), and timeout-induced I/O errors (tune node.session.timeo.replacement_timeout). Use "iscsiadm -m session -P 3" for detailed session info. Multipath (dm-multipath) is essential for redundancy -- a single path failure should not cause I/O errors.

Remember: "Multipath = mandatory for iSCSI in production." A single path failure without multipath causes all I/O to error out.

6. What is the RAID write hole and how do modern systems mitigate it?

Show answer

The RAID write hole occurs when a power failure interrupts a stripe write: data blocks are updated but parity is not (or vice versa). On restart, parity is inconsistent, and a subsequent disk failure causes silent data loss because parity-based reconstruction produces wrong data. Mitigations: battery-backed write cache (hardware RAID), ZFS (no traditional parity write hole due to COW), MD RAID with write-intent bitmap, or journal-based parity (btrfs plans, dm-integrity).

Under the hood: The write hole occurs when data and parity are out of sync after a power failure. ZFS avoids this entirely with copy-on-write — old data is preserved until the new write is complete.

7. What are the risks of thin provisioning and how do you monitor it safely?

Show answer

Thin provisioning allocates storage on demand rather than upfront, allowing overcommitment. The risk: if actual usage exceeds physical capacity, I/O fails catastrophically -- applications get write errors, databases corrupt, VMs pause. Monitor the thin pool usage (lvs for LVM thin pools, storage array dashboards) and set alerts at 80% pool usage. Always maintain a reserve and have automated alerts. Thick provisioning is safer for critical workloads where write failures are unacceptable.

War story: A thin-provisioned LVM pool hit 100% usage overnight when a backup job ran, causing all VMs to pause with I/O errors. Alert at 80%, expand at 85%.

8. How do you configure smartd for automated disk health monitoring with alerting?

Show answer

Edit /etc/smartd.conf with one line per drive (or DEVICESCAN for all). Key options: -a (monitor all attributes), -o on (enable offline testing), -s (S/../.././02|L/../../6/03) (schedule short tests daily at 2am, long tests Saturdays at 3am), -m admin@example.com (email on failure), -M exec /path/to/script (run custom alert script for Slack/PagerDuty integration). Enable with `systemctl enable --now smartd`. The daemon watches for attribute threshold crossings and test failures.

Example: A minimal smartd.conf line: `DEVICESCAN -a -o on -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smartd-alert.sh`