Disk & Storage Ops Footguns¶
Mistakes that destroy data, brick servers, or create outages that are entirely preventable.
1. Partitioning the wrong disk¶
You run fdisk /dev/sdb but the new disk is actually /dev/sdc. You just wiped the partition table on a disk with production data. In a VM where device names shift after reboot, or when multiple disks of the same size are attached, this is terrifyingly easy.
# BEFORE any destructive disk operation, verify you have the right target
lsblk -o NAME,SIZE,MODEL,SERIAL,MOUNTPOINT
# Cross-reference: size, model, serial number, and whether anything is mounted
# If a disk has mountpoints, it is NOT the blank new disk
# Double-check with lsblk by-id
ls -la /dev/disk/by-id/ | grep sdb
# For cloud VMs, verify the volume attachment in the cloud console
# AWS: aws ec2 describe-volumes | check attachment device name
Fix: Always run lsblk before touching any disk. Verify by size, model, serial, and mount status. Never trust that /dev/sdb is the same disk it was yesterday.
2. Missing nofail in fstab¶
You add a new disk to /etc/fstab without the nofail option. The disk fails or is detached. On next reboot, systemd waits for the disk, times out, and drops to emergency mode. Your server is unbootable and you need console access to fix it.
# WRONG — server will not boot if this disk is missing
UUID=abc123 /data ext4 defaults 0 2
# RIGHT — nofail allows boot to continue if disk is absent
UUID=abc123 /data ext4 defaults,noatime,nofail 0 2
# For network filesystems, also add _netdev
192.168.1.10:/share /mnt/nfs nfs _netdev,nofail,rw 0 0
Fix: Always add nofail for non-root filesystems. Always add _netdev for network mounts. Always run sudo mount -a after editing fstab, before rebooting.
3. ext4 vs XFS resize differences¶
You need to shrink a filesystem. If it is XFS, you cannot. XFS supports online grow only — there is no shrink operation at all. If you assumed you could resize both directions like ext4, you are stuck with a migration instead of a resize.
# ext4: can grow online, can shrink offline
sudo resize2fs /dev/vg0/lv_data 50G # shrink to 50G (must unmount first)
sudo resize2fs /dev/vg0/lv_data # grow to fill LV (online)
# XFS: can grow online only — NO shrink
sudo xfs_growfs /mount/point # grow (online)
# There is no xfs_shrinkfs. It does not exist.
Fix: If you might need to shrink later, use ext4. If you chose XFS and need to shrink, you must create a new smaller volume, copy data over with rsync -a, and swap mounts.
4. Filling disk to 100% and locking yourself out¶
Root filesystem hits 100%. Now you cannot write logs, create temp files, or even edit files to fix the problem. Shells may not start because they cannot write history. Applications crash because they cannot write anything.
# ext4 reserves 5% for root by default — this is your safety margin
sudo tune2fs -l /dev/sda1 | grep "Reserved block count"
# Reduce reservation if disk is data-only (not root filesystem)
sudo tune2fs -m 1 /dev/sdb1 # 1% reserved instead of 5%
# Emergency: if root fs is full, these commands still work as root
# because of the reserved blocks
> /var/log/large-log-file.log # truncate a log
journalctl --vacuum-size=50M # trim journal
apt clean # clear package cache
find /tmp -type f -atime +7 -delete # remove old temp files
Fix: Monitor disk usage and alert at 80%. Keep the 5% reservation on root filesystems. For data volumes, even 1% reservation provides a safety buffer.
5. RAID 5 write hole¶
RAID 5 computes parity across stripes. If power is lost during a write, the stripe may be partially updated: data blocks and parity block are inconsistent. After power returns, the array does not know which block is correct. Data is silently corrupted.
This is not theoretical — it is a known, documented failure mode of RAID 5 (and RAID 6 to a lesser extent).
Fix: Use a battery-backed write cache (BBU/BBWCon hardware RAID controllers). For software RAID, use a write-intent bitmap (mdadm --grow /dev/md0 --bitmap=internal). Better yet, use RAID 10 for write-heavy workloads, or use ZFS/Btrfs which have checksums that detect this.
6. rm vs truncate for log cleanup¶
A 40GB log file is filling the disk. You rm it. df still shows the disk full. The process that was writing the log still has the file descriptor open. The kernel will not free the space until that fd is closed (process restart or exit).
# WRONG — space not freed if a process has the file open
rm /var/log/huge-app.log
df -h # still full!
# Check: is anything holding the deleted file open?
lsof +L1 | grep deleted
# java 12345 app 15w REG 253,0 42949672960 0 /var/log/huge-app.log (deleted)
# RIGHT — truncate zeroes the file in place, space freed immediately
truncate -s 0 /var/log/huge-app.log
# Or:
> /var/log/huge-app.log
# If you already deleted it, your options are:
# 1. Restart the process (closes the fd, frees space)
# 2. Truncate via /proc: > /proc/12345/fd/15
Fix: Use truncate -s 0 or > filename instead of rm for active log files. If you must delete, verify with lsof +L1 that no process holds the file open.
7. Not testing fstab with mount -a before reboot¶
You edit /etc/fstab, add a typo in the UUID, and reboot. The system drops to emergency mode because it cannot mount a required filesystem. Now you need physical or out-of-band console access.
# After EVERY fstab edit:
sudo mount -a
echo $?
# 0 = all entries mount successfully — safe to reboot
# Non-zero = fix the error BEFORE rebooting
# Also validate fstab syntax
findmnt --verify --tab-file /etc/fstab
Fix: Run sudo mount -a after every fstab edit. No exceptions. If it fails, fix it before rebooting. Consider using nofail on all non-root entries as defense in depth.
8. fdisk vs parted for >2TB disks¶
You use fdisk with MBR on a 4TB disk. It creates a partition table that can only address 2TB. The remaining 2TB is invisible and unusable. Depending on fdisk version, it may not even warn you.
# Check disk size first
lsblk /dev/sdb
# If >2TB, you MUST use GPT
# Use parted or gdisk for GPT
sudo parted /dev/sdb mklabel gpt
sudo parted /dev/sdb mkpart primary ext4 0% 100%
# Modern fdisk can do GPT too (g command), but parted is explicit about it
Fix: For any disk 2TB or larger, always use GPT (via parted or gdisk). For new systems of any size, default to GPT — there is no reason to use MBR on modern hardware.
9. mkfs on the wrong partition¶
You run mkfs.ext4 /dev/sdb1 but /dev/sdb1 is your existing data partition. The filesystem is gone. The data is gone. mkfs does not ask "are you sure?" by default on most distributions. It just does it.
# mkfs will happily destroy a mounted filesystem:
sudo mkfs.ext4 /dev/sdb1
# mke2fs 1.46.5 (30-Dec-2021)
# /dev/sdb1 contains a ext4 file system last mounted on /data
# Proceed anyway? (y,N) <-- some versions ask, many do not
# VERIFY before running mkfs:
lsblk -f /dev/sdb1
# NAME FSTYPE LABEL UUID MOUNTPOINTS
# sdb1 ext4 data a1b2c3d4-... /data
# ^^^ If it has a filesystem and mountpoint, DO NOT mkfs it
Fix: Always run lsblk -f before mkfs. If the partition has an existing filesystem or mountpoint, you are about to destroy data. mkfs should only be run on genuinely empty or newly-created partitions.
10. RAID rebuild performance impact¶
A disk fails in a RAID 5 array during peak hours. The rebuild starts and saturates disk I/O. Application latency spikes 10x. Users are affected for the entire 6-hour rebuild window because the rebuild is reading every block on every surviving disk.
# Check current rebuild speed limits
cat /proc/sys/dev/raid/speed_limit_min # minimum rebuild speed
cat /proc/sys/dev/raid/speed_limit_max # maximum rebuild speed
# Slow down rebuild during business hours
echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_max
# Speed up rebuild during off-peak
echo 500000 | sudo tee /proc/sys/dev/raid/speed_limit_max
Fix: Monitor rebuild speed and throttle during peak hours. Use hot spares so rebuilds start immediately (reducing the degraded window). For large arrays, RAID 6 or RAID 10 is more resilient to the double-failure risk during long rebuilds.
11. LVM snapshot filling up¶
You create an LVM snapshot for a backup. The snapshot has 10GB allocated. The source volume receives 12GB of writes during the backup. The snapshot overflows, becomes invalid, and is automatically dropped. Your backup is incomplete.
# Check snapshot usage
sudo lvs -o lv_name,data_percent,snap_percent
# lv_snap 85.23 <-- 85% full, about to overflow
# The snapshot is copy-on-write: every write to the origin is copied here
# More writes to origin = faster the snapshot fills up
# If snapshot is close to full, extend it
sudo lvextend -L +20G /dev/vg0/lv_snap
Fix: Allocate snapshots generously — at least 20% of the origin volume size, more for write-heavy workloads. Monitor snapshot usage during backup operations. Set the snapshot to auto-extend in lvm.conf (snapshot_autoextend_threshold and snapshot_autoextend_percent). Remove snapshots immediately after backup completes.
12. Not monitoring SMART — surprise disk failure¶
You have no SMART monitoring. A disk has been accumulating reallocated sectors for months. One day it fails completely. If you had been watching SMART, you would have had weeks of warning to replace the disk proactively.
# Minimum viable SMART monitoring
# Add to cron or monitoring system:
sudo smartctl -H /dev/sda | grep -q PASSED || echo "DISK FAILING: /dev/sda"
# Check critical attributes
sudo smartctl -A /dev/sda | awk '
$2 == "Reallocated_Sector_Ct" && $10 > 0 { print "WARN: reallocated sectors:", $10 }
$2 == "Current_Pending_Sector" && $10 > 0 { print "WARN: pending sectors:", $10 }
'
# Enable smartd daemon for automated monitoring
sudo systemctl enable --now smartd
Fix: Enable smartd on every server. Configure email alerts for SMART failures. Check SMART health as part of regular infrastructure monitoring. In cloud environments, monitor the provider's disk health metrics (AWS: VolumeQueueLength, BurstBalance; GCP: disk/throttled_read_ops_count).
RAID-Specific Footguns¶
13. Running RAID 5 on large disks without understanding rebuild risk¶
You build RAID 5 with 8TB drives. A disk fails. Rebuild takes 18 hours, reads every sector of every surviving disk. The probability of hitting an URE on a consumer SATA drive during a full-disk read is non-trivial. A single URE during rebuild kills the array.
Fix: Use RAID 6 or RAID 10 for drives larger than 2TB. Use enterprise drives with lower URE rates.
14. Hot-swapping the wrong disk during replacement¶
The controller says slot 3 failed. You pull what you think is slot 3 -- but it was slot 4, a healthy disk. RAID 5 just lost two disks. Array is dead.
Fix: Before pulling any disk, blink the fault LED: storcli /c0/e252/s3 start locate. Physically confirm the blinking drive before pulling. Never pull without LED confirmation.
15. Not having a hot spare configured¶
A disk fails at 2 AM. The array is degraded. You create a ticket for the next datacenter visit. Three days pass. A second disk fails. Without a hot spare, rebuild never started. RAID 5 with two failures = total data loss.
Fix: Always configure at least one global hot spare per RAID controller.
16. Rebuilding under full production I/O load¶
A disk fails, rebuild starts automatically. But the rebuild competes with production I/O. Rebuild takes 3x longer than expected. The longer the rebuild window, the higher the risk of a second failure.
Fix: Throttle non-critical workloads during rebuild. Tune speed_limit_min/speed_limit_max for mdadm. For hardware RAID, set rebuild priority to high.