Skip to content

Linux Data Hoarding - Primer

Why This Matters

Data hoarding is the practice of collecting, organizing, and preserving large volumes of digital data — media libraries, archives, datasets, personal backups — on self-managed storage. Unlike enterprise SAN/NAS environments with six-figure budgets, data hoarding builds reliable multi-terabyte storage from commodity hardware and open-source software. The Linux ecosystem dominates this space because every critical tool — SnapRAID, mergerfs, rclone, borg, smartmontools — runs natively and composes through standard UNIX patterns (mount points, cron, pipes).

If you manage a homelab, run a media server, archive research data, or just refuse to trust a cloud provider with your only copy, this is your toolkit. The patterns here also translate directly to production: the same filesystem decisions, backup strategies, and integrity checks apply at any scale.

Core Concepts

1. The JBOD Philosophy

Traditional RAID stripes data across identical disks in lockstep. Data hoarding takes a different path: JBOD (Just a Bunch of Disks) — each drive is an independent filesystem, combined into a single namespace by a union filesystem (mergerfs) and protected by snapshot parity (SnapRAID).

Traditional RAID                  JBOD + mergerfs + SnapRAID
──────────────                    ──────────────────────────
All disks same size               Mix any sizes (4TB + 8TB + 12TB)
One filesystem spans all disks    Each disk is standalone ext4/xfs
Lose the array = lose everything  Lose one disk = lose that disk's files
Rebuild = rewrite entire array    Fix = reconstruct only lost files
Can't read disks individually     Pull any disk  mount and read it

Why this wins for media and archives:

  • Incremental growth. Add one drive at a time — no rebuilding arrays.
  • Mixed sizes. A 4TB drive from 2018 sits next to a 20TB drive from 2025.
  • Individual disk readability. If your server dies, plug any data drive into another machine and read it directly.
  • Lower risk. A failed drive only loses the files physically on that drive, and SnapRAID can reconstruct them from parity.

Mnemonic: "JBOD = Just Buy One Drive." You scale by buying one drive at a time, not matching sets. This is the opposite of traditional RAID, where all disks must be present and identical.

2. The "Perfect Media Server" Stack

The canonical data hoarding architecture, popularized by Alex Kretzschmar (ironicbadger) on perfectmediaserver.com, combines:

Layer Tool Role
Union filesystem mergerfs Combines multiple drives into one mount point
Parity protection SnapRAID Snapshot parity — reconstructs files after disk failure
Individual drives ext4 or XFS Each data drive formatted independently
Cloud backup rclone Encrypted sync to Backblaze B2, S3, Google Drive
Local backup borg / restic Deduplicated, encrypted backups of critical data
Containers Docker + compose Plex, Jellyfin, Sonarr, Radarr, *arr stack
Monitoring smartd + scripts SMART health checks, disk usage alerts

This stack is not theoretical — it powers thousands of home media servers and small-archive setups. The key insight is that each layer is independently replaceable. Swap mergerfs for a ZFS pool. Swap borg for restic. The architecture stays the same.

See also: The mergerfs topic covers union filesystem policies in depth. The homelab topic covers hardware selection and virtualization.

3. SnapRAID In Depth

SnapRAID is a snapshot parity tool created by Andrea Mazzoleni (first released 2011, GPLv3). Unlike real-time RAID, SnapRAID calculates parity on a schedule (typically daily via cron). This makes it ideal for write-once-read-many workloads like media libraries, where files rarely change after initial ingest.

Etymology: "Snap" = snapshot. Parity is calculated at a point in time, not continuously. This is the fundamental difference from mdraid or hardware RAID.

How Snapshot Parity Works

1. Files land on data drives throughout the day
2. Nightly cron runs `snapraid sync`
3. SnapRAID reads all data drives, computes parity blocks
4. Parity blocks are written to dedicated parity drive(s)
5. Between syncs, newly added files have NO parity protection

Timeline:
  ──────────────────────────────────────────────────
  |  files added  |  sync runs  |  files protected  |
  |  (vulnerable) |  (compute)  |  (parity valid)   |
  ──────────────────────────────────────────────────

snapraid.conf Format

The configuration file (typically /etc/snapraid.conf) uses a simple line-based format:

# Parity file — stored on a dedicated drive
# Parity drive must be >= largest data drive
parity /mnt/parity1/snapraid.parity

# Optional: additional parity for multi-disk fault tolerance
# 2-parity through 6-parity supported (up to 6 simultaneous failures)
#2-parity /mnt/parity2/snapraid.2-parity

# Content files — the checksum database
# Store at least 2 copies on DIFFERENT drives
content /var/snapraid.content
content /mnt/disk1/snapraid.content
content /mnt/disk2/snapraid.content

# Data drives — order matters for parity calculation
# Changing order after initial sync corrupts parity
data d1 /mnt/disk1/
data d2 /mnt/disk2/
data d3 /mnt/disk3/
data d4 /mnt/disk4/

# Exclude patterns
exclude *.unrecoverable
exclude /tmp/
exclude /lost+found/
exclude *.part

Key rules: - Parity drive must be at least as large as the largest data drive. If your biggest data drive is 12TB, parity must be >= 12TB. - Content files are critical. Lose all copies and you lose the ability to recover. Store on multiple drives. - Data drive order is sacred. Renaming d1/d2 or reordering after sync invalidates parity. Use stable labels. - Split parity (v11.0+): A single parity level can span multiple smaller drives with comma-separated paths.

SnapRAID Operations

Command What It Does When to Run
snapraid sync Compute/update parity from current file state Daily (cron)
snapraid scrub Verify data integrity against stored checksums (~8% per run by default) Weekly (cron)
snapraid scrub -p 100 Full scrub — verify everything Monthly or after hardware concerns
snapraid status Show array health, fragmentation, error count On demand
snapraid diff Show files changed since last sync Before sync (safety check)
snapraid fix -d d1 Reconstruct all files on disk d1 from parity After disk failure
snapraid fix -f /path/to/file Reconstruct a specific file After accidental deletion
snapraid check Simulate recovery without writing (dry-run fix) Periodic validation
snapraid smart Show SMART health report with failure probability On demand
snapraid dup Find duplicate files by hash Space reclamation
snapraid touch Set sub-second timestamps for move detection Before sync if files were moved

Split Parity (v11.0+, 2016)

SnapRAID supports up to 6 parity levels — meaning it can survive up to 6 simultaneous disk failures. Each parity level requires a dedicated drive (or split across drives):

1-parity:  Survives 1 disk failure  (like RAID5)
2-parity:  Survives 2 disk failures (like RAID6)
3-parity:  Survives 3 disk failures (like RAID-Z3)
4–6-parity: For very large arrays (20+ disks)

Rule of thumb: For home use, 1 parity disk per 4 data disks. For irreplaceable data, use 2-parity.

snapraid-runner (Automation)

snapraid-runner is a Python wrapper (by Chronial) that automates SnapRAID with safety checks:

  1. Runs snapraid diff first
  2. Aborts if deletions exceed a threshold (protects against accidental mass deletion)
  3. Runs snapraid sync
  4. Optionally runs snapraid scrub
  5. Sends email notification with results

This is the standard automation approach — bare snapraid sync in cron has no safety net against a misconfigured mount wiping your parity.

When SnapRAID Is Wrong

SnapRAID is not suitable for: - Databases (high write churn — parity never catches up) - Virtual machine images (constant writes) - Anything requiring real-time redundancy (between syncs, new files are unprotected) - Small file workloads (parity overhead is per-block, not per-file)

4. Filesystem Choices for Data Drives

Each data drive in a JBOD array needs its own filesystem. The choice matters for performance, features, and failure modes.

Decision Matrix

Filesystem Best For Strengths Weaknesses
ext4 Default choice, general data Boring and reliable, mature fsck, universal tooling No checksums, no snapshots, max 1 EiB volume
XFS Large media files (4K video, ISOs) Excellent large-file I/O, reflink copy (cp --reflink), scales to 8 EiB Cannot shrink, historically fragile on power loss (improved with v5 format)
btrfs Snapshots + checksums needed CoW, snapshots, online defrag, checksums, send/receive, RAID1 RAID5/6 write hole (data loss risk), complex repair tools
ZFS Self-contained redundant pools CoW, checksums, send/recv, ARC cache, RAID-Z levels, proven track record Memory hungry (1GB per TB rule of thumb), can't easily add single drives, kernel module not in mainline

Mnemonic: "ext4 = Toyota Corolla, XFS = pickup truck, btrfs = Tesla (exciting but recalls), ZFS = tank (indestructible but expensive)."

Practical advice for data hoarding: - Default to ext4 for SnapRAID data drives. It just works, recovery tools are mature, and every Linux distro supports it. - Use XFS if storing predominantly large files (video editing, ISO archives) — its allocator handles large sequential I/O better. - Use btrfs only for RAID1 mirrors or single-drive use where you want snapshots. Never use btrfs RAID5/6 — the write hole bug remains unfixed as of 2025. - Use ZFS when you want a fully self-contained solution (ZFS pools replace mergerfs+SnapRAID). But understand you are choosing a different paradigm — ZFS pools are not JBOD.

See also: The mounts-filesystems topic covers mount options, fstab syntax, and VFS internals. The disk-and-storage-ops topic covers partitioning, LVM, and block device management.

5. Backup Tools

Parity (SnapRAID) is not backup. Parity protects against disk failure. Backup protects against accidental deletion, ransomware, fire, and theft. You need both.

borg (BorgBackup)

  • Language: Python + C (Cython for performance)
  • First release: 2015 (fork of Attic, which started 2010)
  • License: BSD-3-Clause
  • Deduplication: Content-defined chunking (variable-length blocks)
  • Compression: lz4, zstd, zlib, lzma (configurable per-archive)
  • Encryption: AES-256-CTR + HMAC-SHA256 (authenticated encryption)
  • Key feature: Append-only mode for remote repos (ransomware protection)
# Initialize an encrypted repository
borg init --encryption=repokey /mnt/backup/borg-repo

# Create a backup with compression
borg create --compression zstd,3 /mnt/backup/borg-repo::daily-{now:%Y-%m-%d} /mnt/data

# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune --keep-daily=7 --keep-weekly=4 --keep-monthly=6 /mnt/backup/borg-repo

# Verify backup integrity
borg check /mnt/backup/borg-repo

# List archives
borg list /mnt/backup/borg-repo

restic

  • Language: Go (single static binary)
  • First release: 2015
  • License: BSD-2-Clause
  • Deduplication: Content-defined chunking (Rabin fingerprinting)
  • Encryption: AES-256-CTR + Poly1305 (always on, can't be disabled)
  • Key feature: Native multi-backend support — local, SFTP, S3, B2, Azure, GCS, rclone
# Initialize a repo on Backblaze B2
restic -r b2:mybucket:/backups init

# Backup with tags
restic -r b2:mybucket:/backups backup /mnt/data --tag media

# Forget + prune old snapshots
restic -r b2:mybucket:/backups forget --keep-daily 7 --keep-weekly 4 --prune

# Check integrity
restic -r b2:mybucket:/backups check

borg vs restic decision: - Choose borg when: backing up to local disk or SFTP, want maximum compression, need append-only repos - Choose restic when: backing up to S3/B2/cloud, want a single binary with no dependencies, want lock-free concurrent backups

rclone

  • Language: Go
  • Creator: Nick Craig-Wood (first release 2014)
  • Backends: 70+ cloud storage providers (S3, B2, Google Drive, Dropbox, OneDrive, SFTP, FTP, and many more)
  • Key feature: The crypt overlay — client-side encryption as a transparent layer on any backend
# Configure a remote (interactive)
rclone config

# Sync local directory to encrypted B2 bucket
rclone sync /mnt/data/important remote-crypt:backups/ --progress

# Check for differences without transferring
rclone check /mnt/data/important remote-crypt:backups/

# Mount cloud storage as local filesystem (FUSE)
rclone mount remote:media /mnt/cloud-media --vfs-cache-mode writes

rclone is not a backup tool — it is a sync/transfer tool. Use it as the transport layer for offsite copies. Pair with borg or restic for versioning and deduplication.

rsync

The baseline. Every Linux system has it. No deduplication, no encryption (use SSH), no versioning (use --link-dest for poor-man's snapshots). Still the fastest way to copy large trees between local disks or over SSH.

# Mirror a directory (archive mode, delete removed files)
rsync -avh --delete /mnt/disk1/media/ /mnt/disk2/media-mirror/

# Remote sync over SSH
rsync -avhz /mnt/data/ user@offsite:/backup/data/

# Bandwidth-limited transfer
rsync -avh --bwlimit=10M /mnt/data/ /mnt/backup/

See also: The backup-restore topic covers the 3-2-1 rule, RPO/RTO, and enterprise backup strategy in detail.

6. Data Integrity

par2 (Parchive)

Created by Tobias Rieper and Stefan Wehlus (v1 spec, October 2001), then redesigned by Howard Fukada (v2 spec, January 2002). Uses Reed-Solomon error correction to create recovery blocks — originally designed for Usenet file transfers, now used for archive integrity.

# Create 10% redundancy for a directory of files
par2 create -r10 archive.par2 /mnt/archive/*.tar.gz

# Verify integrity
par2 verify archive.par2

# Repair corrupted files (up to 10% damage)
par2 repair archive.par2

Best used for: cold storage archives, files being transferred to untrusted media, long-term preservation.

SMART Monitoring

Self-Monitoring, Analysis, and Reporting Technology — built into every modern HDD and SSD. Monitored via smartctl (from smartmontools).

# Quick health check
smartctl -H /dev/sda

# Full attribute dump
smartctl -A /dev/sda

# Key attributes to watch:
#   5   Reallocated_Sector_Ct   — bad sectors remapped (>0 = concern)
#   187 Reported_Uncorrect      — uncorrectable errors
#   188 Command_Timeout         — lost communication with controller
#   197 Current_Pending_Sector  — unstable sectors awaiting reallocation
#   198 Offline_Uncorrectable   — sectors that failed offline testing

# Enable automatic monitoring daemon
systemctl enable --now smartd

Critical reality check: Google's 2007 study of 100,000+ drives found that 36% of failed drives showed zero SMART warnings beforehand. SMART catches obvious degradation but misses sudden failures (head crashes, PCB failures, firmware bugs). Never rely on SMART alone — always have parity + backups.

Backblaze publishes quarterly drive stats (publicly available). Their 2024 data shows an overall AFR (Annualized Failure Rate) of 1.57% across ~290,000 drives, dropping to 1.36% in 2025. Key finding: failure rates vary dramatically by model and age.

Drive Burn-In

Before trusting a new drive with irreplaceable data, test it:

# 1. Run a SMART extended self-test (takes 8-24 hours on large drives)
smartctl -t long /dev/sdX

# 2. Check test result
smartctl -a /dev/sdX | grep -A2 "Self-test"

# 3. Write + read test with badblocks (destructive — erases all data)
#    Use ONLY on new, empty drives
badblocks -wsv -b 4096 /dev/sdX

# 4. Check SMART again after burn-in
smartctl -A /dev/sdX

A burn-in catches infant mortality failures (drives that fail within the first few weeks). Drives that pass burn-in are statistically more reliable.

Scrubbing

Periodic integrity verification catches silent data corruption (bit rot):

Tool Command What It Checks
SnapRAID snapraid scrub File checksums against stored hashes
btrfs btrfs scrub start /mnt/pool Block checksums (if btrfs)
ZFS zpool scrub tank Block checksums + automatic repair from mirrors/parity

7. Duplicate Detection

When you hoard data long enough, duplicates accumulate. Three tools dominate on Linux:

Tool Language Speed Key Feature
jdupes C Fastest (7x faster than fdupes) Hardlink/softlink/delete modes, hash-based
fdupes C Baseline The original (Adrian Lopez, 1999), simpler interface
rdfind C++ Fast Ranking-based dedup, O(N log N)
# jdupes: find duplicates and hardlink them (saves space, no data loss)
jdupes -rL /mnt/data/

# jdupes: report only, don't change anything
jdupes -r /mnt/data/

# fdupes: interactive deletion of duplicates
fdupes -r /mnt/data/

# rdfind: find duplicates and replace with hardlinks
rdfind -makehardlinks true /mnt/data/

Gotcha: jdupes matches only 100% identical files (byte-for-byte). It is not a fuzzy/similarity matcher. It is also NOT a drop-in replacement for fdupes — option flags differ.

8. Drive Management

Key Commands

Command Purpose Example
lsblk List block devices with hierarchy lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT,SERIAL
blkid Show filesystem UUIDs and types blkid /dev/sda1
smartctl SMART health and attributes smartctl -a /dev/sda
hdparm Get/set drive parameters, benchmarks hdparm -Tt /dev/sda (read speed test)
hd-idle Spin down idle drives hd-idle -i 600 /dev/sdb (10-min timeout)
lsscsi List SCSI/SATA devices lsscsi --size

UDEV Rules for Consistent Naming

Drive letters (/dev/sda, /dev/sdb) can change between reboots. Use UDEV rules or /dev/disk/by-id/ symlinks for stable references:

# Use /dev/disk/by-id/ in fstab (includes drive serial number)
ls -la /dev/disk/by-id/

# Example fstab entry using disk ID
/dev/disk/by-id/ata-WDC_WD120EMFZ-11A6JA0_SERIAL-part1  /mnt/disk1  ext4  defaults,noatime  0  2

# Custom UDEV rule to create friendly names
# /etc/udev/rules.d/99-data-drives.rules
# SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="SERIAL123", SYMLINK+="data/disk1"

Bay-based naming: In a drive enclosure, label drives by physical bay position. If bay 3 fails, you know exactly which drive to pull — no guessing at serial numbers.

Drive Spindown

For drives that are accessed infrequently (archive/cold storage), spinning them down saves power and reduces wear:

# hdparm: set standby timeout (value * 5 seconds, 0=disable)
# 242 = 1 hour
hdparm -S 242 /dev/sdb

# hd-idle: more reliable for USB and some SATA controllers
# -i 600 = 600 seconds idle before spindown
hd-idle -i 600 -a /dev/sdb

9. Media Tools (Brief Pointers)

Data hoarding often serves a media library. These tools sit on top of the storage stack:

Tool Purpose Note
Plex Media server with transcoding Closed-source, freemium, proprietary metadata
Jellyfin Open-source media server (fork of Emby) GPLv2, no tracking, community-driven
Emby Media server (Jellyfin forked from this) Partially closed-source since 2018
Sonarr TV show management + download automation Monitors RSS, renames, organizes
Radarr Movie management (Sonarr fork) Same pattern, movies instead of TV
Lidarr Music management Same architecture
yt-dlp Video archival from YouTube and 1000+ sites Fork of youtube-dl, actively maintained

All run as Docker containers pointed at your mergerfs mount. The *arr stack handles media lifecycle; the storage stack handles durability.

Quick Reference

Build a data hoarding stack:
1. Format data drives:       mkfs.ext4 -L disk1 /dev/sdX
2. Mount individually:       /mnt/disk1, /mnt/disk2, /mnt/disk3, ...
3. Pool with mergerfs:       /mnt/disk*  /mnt/storage (union mount)
4. Protect with SnapRAID:    /mnt/parity1 (parity drive >= largest data)
5. Backup with borg/restic:  Critical data  local backup + cloud
6. Sync with rclone:         Encrypted offsite copy to B2/S3
7. Monitor with smartd:      SMART alerts + weekly scrub
8. Automate with cron:       snapraid-runner daily, scrub weekly

Key files:
  /etc/snapraid.conf           parity configuration
  /etc/fstab                   drive mounts (use by-id or UUID)
  /etc/smartd.conf             SMART monitoring daemon
  /etc/cron.d/snapraid         automation schedule

Wiki Navigation

Prerequisites

  • Linux Data Hoarding Flashcards (CLI) (flashcard_deck, L1) — Linux Data Hoarding