Skip to content

Linux Storage Footguns

Mistakes that destroy filesystems, lose data, or cause silent corruption in production storage operations.


1. Using /dev/sdX names in fstab instead of UUIDs

You add /dev/sdb1 to fstab. Next reboot, a new disk is detected first. What was /dev/sdb is now /dev/sdc. The system mounts the wrong partition -- or fails to boot because the expected device does not exist. Swap on the wrong device wipes the first 2GB of your data.

Why people do it: /dev/sdb1 is what lsblk shows. It is short and readable. UUIDs are ugly 36-character strings.

Fix: Always use UUID= in fstab. Get UUIDs with blkid. Alternatively, use filesystem labels with LABEL=, but labels are not guaranteed unique. Never use /dev/sdX in any persistent configuration.

Under the hood: Linux assigns /dev/sdX names based on detection order during boot, which depends on driver load timing and bus enumeration. Adding a USB drive, a SAN LUN, or even a kernel update can change the order. UUIDs are embedded in the filesystem superblock and are stable regardless of device enumeration order. /dev/disk/by-uuid/ symlinks show the mapping.


2. Running mkfs on an active partition

You intend to format /dev/sdc1. You type /dev/sdb1. That is your mounted data volume. mkfs does not check if the device is mounted (some versions warn, many do not). The filesystem is destroyed. Data is gone. The mount is stale and the next write causes a kernel panic.

Why people do it: Muscle memory. Tab completion. Similar device names. No confirmation prompt from mkfs.

Fix: Before running mkfs, verify the device is not mounted: mount | grep <device> and lsblk to confirm the target. Better: unmount everything on that device first, then format. Use wipefs -a to clear signatures and force a deliberate pause before formatting.

War story: The Pixar "Toy Story 2" incident (1998) is the canonical example. An errant rm -rf command deleted 90% of the film's assets from the server. Backups existed but hadn't been verified and were also corrupted. The film was saved only because a technical director had a full copy on her home machine. Always verify which device you're targeting before destructive operations.


3. Extending a filesystem without extending the underlying volume

You extend an LVM logical volume: lvextend -L +10G /dev/vg0/data. The LV is now larger. You forget to resize the filesystem. df still shows the old size. You think the resize failed and try again. Or worse, you start writing data assuming 10GB of new space exists -- the filesystem fills at the old boundary.

Why people do it: LVM and the filesystem are separate layers. Extending one does not automatically extend the other. The operation "feels complete" after lvextend.

Fix: Always resize the filesystem after lvextend. For ext4: resize2fs /dev/vg0/data. For XFS: xfs_growfs /mountpoint. Or use lvextend -r which does both in one step. Verify with df -h after.


4. Reducing an XFS filesystem

You need to shrink a logical volume. You try xfs_growfs with a smaller size. It fails. XFS cannot be shrunk -- it is a fundamental limitation of the filesystem design. You search for workarounds and find none. The only option is backup, recreate smaller, restore.

Why people do it: ext4 supports shrinking (offline). People assume all filesystems support it. The LVM layer can shrink, so it feels like the filesystem should follow.

Fix: Know your filesystem capabilities before choosing. If you might need to shrink volumes, use ext4. If you are on XFS and need to reclaim space, the path is: backup data, delete LV, create smaller LV, create new XFS, restore data.

Gotcha: RHEL/CentOS/Rocky/Alma default to XFS for all partitions since RHEL 7. If you're using these distributions and provisioned volumes with the installer defaults, you cannot shrink them. This matters most for root volumes that were over-provisioned during initial setup.


5. Not checking dmesg for I/O errors before blaming the application

The application reports slow writes and occasional timeouts. You profile the application, tune the database, increase timeouts. The disk has been throwing I/O errors for a week. dmesg shows blk_update_request: I/O error, EXT4-fs error, and SCSI sense key errors. The disk is dying.

Why people do it: Application-layer debugging is familiar. dmesg is not in the standard debugging workflow. Storage errors are invisible to application metrics until they cause catastrophic failure.

Fix: When you see I/O-related symptoms (slow writes, timeouts, corruption), check dmesg | grep -i error and journalctl -k --since '1 hour ago' first. Check smartctl -a /dev/sdX for SMART health. Storage problems masquerade as application problems.

Debug clue: smartctl -a /dev/sdX key fields: Reallocated_Sector_Ct (bad sectors remapped — any non-zero value is concerning), Current_Pending_Sector (sectors waiting for reallocation), Offline_Uncorrectable (sectors that can't be read). If any of these are increasing, the drive is failing. Replace it before it takes your data with it.


6. Filling a filesystem to 100%

The root filesystem fills up. Now you cannot write logs, create temp files, or even log in via SSH (PAM needs to write to wtmp). Services crash because they cannot write PID files. The system is effectively dead even though the kernel is running fine.

Why people do it: Nobody watches disk space proactively. Log rotation is misconfigured. A core dump or temp file explosion fills the disk overnight.

Fix: Monitor disk usage and alert at 80%. Set filesystem reserved blocks (ext4 default is 5% for root): tune2fs -m 5 /dev/sdX. Put /var/log, /tmp, and application data on separate partitions so a log explosion does not kill the root filesystem. Use logrotate with maxsize and dateext.


7. Forgetting mount -a after editing fstab

You add a new entry to fstab. You test it with mount /new/path and it works. You do not run mount -a to validate the full fstab. Three months later, the server reboots. The fstab has a syntax error on line 6 (from your edit). The system drops into emergency mode because a required filesystem failed to mount.

Why people do it: The manual mount worked, so fstab "must be fine." mount -a is an extra step that feels redundant.

Fix: After every fstab edit, run mount -a to validate. Check the exit code. Also run findmnt --verify to catch syntax issues. For critical systems, use nofail mount option so a failed mount does not block boot. Better: test by rebooting in a maintenance window.


8. Using dd without double-checking source and destination

dd if=/dev/sda of=/dev/sdb -- but you swapped source and destination. You just overwrote your production disk with the blank spare. dd has no confirmation prompt, no safety check, and no undo. There is a reason it is called "disk destroyer."

Why people do it: dd syntax is archaic. if= and of= look similar. Fatigue and muscle memory at 3 AM.

Fix: Before running dd, echo the command and verify. Use lsblk to confirm which device is which. For disk imaging, prefer tools with safety checks like ddrescue or pv. Consider creating a shell alias that adds confirmation: dd-safe() { echo "Will write $2 to $4. Ctrl-C to abort."; sleep 5; dd "$@"; }.


9. Running LVM operations without free PE in the volume group

You try lvextend -L +20G /dev/vg0/data. It fails: "Insufficient free space." You do not have 20GB of free physical extents in the VG. So you remove a different LV to free space -- but you remove the wrong one. Or you add a new PV from a disk that was in use elsewhere.

Why people do it: LVM abstracts storage so well that people forget the physical layer. "The server has a 500GB disk, so the VG must have 500GB." It does not -- partitions, metadata, and other LVs consume space.

Fix: Always check free space first: vgs shows free PE. If you need to add space, verify the physical disk is truly unused: pvs, lsblk, and blkid. Never remove an LV without confirming its mount status and what data is on it.


10. Ignoring mount options and defaulting to async

You mount a critical data volume without specifying sync or data=journal. The default (async, data=ordered for ext4) is fine for most workloads. But for a database or write-ahead log that requires durability guarantees, async writes mean data that the application thinks is on disk is still in the page cache. A power failure loses those writes.

Why people do it: Defaults work for 90% of cases. Mount options are arcane. Performance with sync is significantly worse, so nobody wants to enable it.

Fix: Understand your workload's durability requirements. For databases, use data=journal (ext4) or let the database handle its own fsync. For XFS, consider wsync for synchronous directory operations. Never assume the kernel's write ordering matches your application's expectations. When in doubt, check with mount | grep <device> to verify active mount options.