Storage Operations Footguns¶
Mistakes that cause data loss, corruption, or catastrophic storage outages.
1. Extending the LV but forgetting the filesystem¶
You run lvextend -L +50G /dev/vg/lv. The LV is bigger, but df still shows the old size. The filesystem doesn't know about the extra space. You think it's full when it's not.
Fix: Always follow lvextend with xfs_growfs (XFS) or resize2fs (ext4). Or use lvextend -r which does both in one step.
2. Running fsck on a mounted filesystem¶
You notice corruption warnings. You run fsck /dev/sda1 while it's mounted. fsck modifies filesystem structures while the kernel is also modifying them. You just turned minor corruption into catastrophic corruption.
Fix: Always unmount first, or boot from rescue media. fsck will warn you, but -y (auto-yes) skips the warning.
3. NFS hard mount with a dead server¶
You use hard mount (the default). The NFS server goes down. Every process that touches the mount point hangs indefinitely. Your application freezes. You can't even ls the mount point because it blocks forever.
Fix: Use hard,intr (interruptible) or set timeo and retrans to prevent infinite waits. For non-critical data, consider soft mounts. Monitor NFS server health proactively.
4. RAID rebuild during peak I/O¶
A disk fails in your RAID 5 array. You add the replacement disk and rebuild starts automatically. The rebuild saturates I/O bandwidth. Your database performance drops 80%. Users complain.
Fix: Schedule rebuilds during low-I/O windows if possible. Tune rebuild speed:
echo 50000 > /proc/sys/dev/raid/speed_limit_min # Lower = less impact
echo 100000 > /proc/sys/dev/raid/speed_limit_max
5. RAID 5 with large drives and no hot spare¶
You run RAID 5 with 8 TB drives. A drive fails. Rebuild takes 12+ hours. During rebuild, a second drive fails — the array is under stress and reading every sector of every remaining drive. You lose everything.
Fix: Use RAID 6 (tolerates 2 failures) for large drives. Always configure a hot spare so rebuild starts immediately. Monitor SMART proactively and replace drives showing pre-failure signs.
6. Deleting LVM snapshot while it's mounted¶
You lvremove a snapshot that's still mounted. Depending on the kernel version, you get I/O errors, data corruption on the origin volume, or a kernel panic.
Fix: Always umount before lvremove. Check mount | grep snap before removing.
7. Filling /var/log and losing the OS¶
/var/log is on the root partition. A chatty application fills the disk. The OS can't write to /var, syslog fails, cron fails, logins fail. The server is effectively dead.
Fix: Put /var/log on its own partition or LV. Set log rotation and max size. Monitor disk usage and alert at 80%.
8. iSCSI multipath not configured¶
Your server connects to an iSCSI SAN through two network paths for redundancy. But multipath isn't configured. The OS sees the same LUN twice — as /dev/sdb and /dev/sdc. You format both. You've just corrupted the SAN LUN.
Fix: Always configure multipath-tools (dm-multipath) before using iSCSI with multiple paths. Verify with multipath -ll.
9. ZFS pool without ECC RAM¶
You run ZFS for data integrity (checksums, scrubs, self-healing). But your server has non-ECC RAM. A memory bit-flip corrupts data in the ARC cache. ZFS writes the corrupt data to disk and updates the checksum. The corruption is now permanent and "valid."
Fix: Always use ECC RAM with ZFS (or any storage that promises data integrity). Non-ECC undermines the entire checksumming guarantee.
10. No capacity planning¶
Your storage grows 5% per month. You don't monitor the trend. Eight months later, production storage fills up on a Friday night. The application can't write. The database crashes. You emergency-purchase and ship disks over the weekend.
Fix: Track storage growth trends. Alert at 70% with a projected date of exhaustion. Plan capacity quarterly. Build lead time for procurement into your forecasts.