Linux Data Hoarding Footguns¶
Mistakes that cause data loss, wasted recovery hours, or silent corruption in Linux data hoarding setups.
1. btrfs RAID5/RAID6 Write Hole — Data Loss Risk¶
btrfs RAID5 and RAID6 have an unfixed write hole bug. If power is lost during a write, parity can become inconsistent with data. On the next scrub or repair, btrfs may "fix" good data with bad parity — making the corruption worse. This has caused real data loss for real users.
As of 2025, the btrfs developers still classify RAID5/6 profiles as unstable. The kernel will now print a warning when creating RAID5/6 profiles.
Fix: Never use btrfs RAID5 or RAID6 for data you care about. Use btrfs RAID1 (mirroring) or btrfs RAID10 instead. Or use ZFS RAID-Z, which does not have this bug. For JBOD setups, use ext4/XFS per drive with SnapRAID for parity — it avoids the write hole entirely because parity is computed offline.
2. The ZFS + ECC RAM Debate¶
A persistent myth claims that ZFS "requires" ECC RAM and will destroy your data without it. The reality is more nuanced:
- All filesystems benefit from ECC RAM. This is not ZFS-specific.
- ZFS checksums data, which means it can detect corruption from bad RAM. Other filesystems silently propagate it.
- The feared "scrub of death" scenario (ZFS "correcting" good data to match RAM-corrupted checksums) is largely theoretical — memory management does not work the way the worst-case scenario assumes.
- However, ZFS has no fsck equivalent. If a pool becomes unimportable, there is no recovery tool. Corruption that is survivable on ext4 (run
fsck) can be terminal on ZFS.
Fix: Use ECC RAM if your budget and platform support it — it is good practice for any server. But do not avoid ZFS solely because you lack ECC. The checksums still provide more protection than any non-checksumming filesystem. Always maintain backups regardless.
3. Untested Restores — "Backups You've Never Restored Aren't Backups"¶
The most common data hoarding disaster is discovering that your backup is unrestorable only when you need it. Common causes: - borg/restic repo corruption (disk errors on the backup target) - Encryption key lost or never exported - Backup software version incompatibility after upgrade - Backup target ran out of space months ago and pruning silently stopped working - Backed up symlinks instead of actual data
Fix: Schedule quarterly restore tests. At minimum:
# borg: test extracting a random archive to /tmp
borg list /mnt/backup/repo
borg extract --dry-run /mnt/backup/repo::latest
# restic: verify integrity and test restore
restic -r /mnt/backup/restic check
restic -r /mnt/backup/restic restore latest --target /tmp/restore-test/ --include /some/path
Automate this. A cronjob that runs borg check weekly costs nothing and catches corruption early.
4. Mixing mergerfs + ZFS (Choose One Paradigm)¶
ZFS provides its own pooling (vdevs), redundancy (RAID-Z), and checksumming. mergerfs provides pooling for JBOD setups where each drive is independent. Using both together creates confusion:
- ZFS pools already present a single mount point — mergerfs adds nothing.
- ZFS expects to manage its own drives — having mergerfs proxy writes adds FUSE overhead for no benefit.
- SnapRAID does not understand ZFS datasets — it sees files, not ZFS blocks.
Fix: Choose a paradigm: - JBOD route: ext4/XFS per drive + mergerfs + SnapRAID - ZFS route: ZFS pool with RAID-Z + ZFS snapshots + zfs send for backup
Do not mix them. The only valid exception is if ZFS manages your boot/VM pool and mergerfs manages a separate JBOD media pool — completely separate drive sets.
5. SnapRAID with Drives Mounted Out of Order¶
SnapRAID's parity is computed based on the data d1, data d2, data d3 labels in the config. If a drive fails to mount (cable loose, fstab error) and another drive takes its mount point, SnapRAID will compute parity against the wrong data.
Example: /mnt/disk2 fails to mount. You run snapraid sync. SnapRAID writes parity as if disk2 is empty. Your parity for disk2's files is now destroyed.
Fix:
- Use nofail in fstab so the system boots even if a drive is missing (but the mount point stays empty).
- Use the deletethreshold in snapraid-runner — it aborts if too many files appear "deleted" (which is what happens when a drive doesn't mount).
- Before running snapraid sync, always check: df -h /mnt/disk* — verify all drives are mounted.
- Consider a pre-sync check script:
#!/bin/bash
# Abort sync if any data drive is not mounted
for disk in /mnt/disk{1..4}; do
if ! mountpoint -q "$disk"; then
echo "ABORT: $disk is not mounted" >&2
exit 1
fi
done
snapraid sync
6. Backup on the Same Physical Drive as Source¶
This sounds obvious but it happens constantly:
- borg repo on /mnt/disk1/backups backing up /mnt/disk1/data
- restic repo on the same mergerfs pool as the source files (parity protects against disk failure, but not against rm -rf or ransomware affecting the whole pool)
- "Offsite" backup to a USB drive that sits next to the server permanently
Fix: The 3-2-1 rule exists for a reason. At minimum: 1. Backup to a different physical drive (local backup) 2. Backup to a different location (offsite — rclone to B2/S3, or a USB drive stored elsewhere) 3. Parity (SnapRAID) is not backup — it protects against hardware failure, not deletion or ransomware
7. Trusting SMART to Predict All Failures¶
Google's study of 100,000+ drives found that 36% of drives that failed showed zero SMART warnings beforehand. SMART catches gradual degradation (sector reallocation, increasing error counts) but misses: - Sudden head crashes - PCB (controller board) failures - Firmware bugs - Power surge damage
Backblaze's ongoing data (290,000+ drives, publicly published quarterly) confirms: SMART is a useful early warning system, not a crystal ball.
Fix: Use SMART monitoring (smartd) but do not rely on it as your only defense. The stack should be: 1. SMART monitoring → early warning 2. SnapRAID scrub → detect silent corruption 3. SnapRAID parity → recover from disk failure 4. Backups (borg/restic) → recover from everything else
8. Encryption Key Lost — No Recovery¶
Full-disk encryption (LUKS), borg encryption, restic encryption, and rclone crypt all depend on keys or passphrases. Lose the key, lose the data. Common scenarios:
- Reinstalled the OS without backing up LUKS headers
- borg repokey stored only in the repo itself (single point of failure)
- rclone crypt password stored in rclone.conf on the encrypted drive
- Passphrase in a password manager — password manager backup on the encrypted drive (circular dependency)
Fix:
- Export and backup LUKS headers: cryptsetup luksHeaderBackup /dev/sdX --header-backup-file luks-header-sdX.bak
- Export borg keys: borg key export /path/to/repo /safe/location/borg-key.txt
- Store encryption keys in at least two locations: password manager + printed paper in a safe, or password manager + separate encrypted USB
- Test the recovery procedure: can you actually decrypt your backups using only the key material stored offsite?
9. Not Running SnapRAID Sync After Adding Files¶
Between snapraid sync runs, newly added files have zero parity protection. If a disk fails before the next sync, those files are gone. This is the fundamental tradeoff of snapshot parity vs real-time RAID.
The window of vulnerability is from file creation to next sync. If you sync daily at 3am and add 500GB of data at 9am, those files are unprotected for 18 hours.
Fix:
- Run snapraid sync daily (at minimum)
- After large ingestion operations (copying a whole media library), run sync immediately: snapraid sync
- For truly irreplaceable data, do not rely on SnapRAID alone — copy to a backup immediately
10. Running scrub Too Aggressively on Spinning Drives¶
snapraid scrub -p 100 reads every block on every data drive. On a large array (e.g., 8 drives x 12TB), this means reading 96TB of data. This takes days and keeps all drives spinning continuously, generating heat and wear.
Fix: Use the default scrub plan — snapraid scrub (without -p 100) verifies ~8% of data per run, covering all data roughly once every 12 runs (3 months if running weekly). This is sufficient for most setups. Run a full scrub (-p 100) only when:
- You suspect silent corruption
- After a firmware update
- After a drive replacement
- Quarterly, scheduled during low-usage periods
11. Forgetting noatime and nofail in fstab¶
Two mount options that belong on every data drive in a hoarding setup:
- noatime: Without it, every file read updates the access time metadata, causing writes on every read. For media streaming, this means constant unnecessary writes. SnapRAID sees these as changes and sync takes longer.
- nofail: Without it, if a data drive fails or disconnects, the system hangs at boot waiting for the mount. With
nofail, boot continues and the mount point stays empty — you get a degraded array instead of an unbootable server.
# Bad
/dev/disk/by-id/ata-DRIVE-SERIAL /mnt/disk1 ext4 defaults 0 2
# Good
/dev/disk/by-id/ata-DRIVE-SERIAL /mnt/disk1 ext4 defaults,noatime,nofail 0 2
Fix: Always use defaults,noatime,nofail for data drives. Add x-systemd.device-timeout=10 if you want a faster boot when a drive is missing.