- linux
- l2
- topic-pack
- linux-data-hoarding --- Portal | Level: L2: Operations | Topics: Linux Data Hoarding | Domain: Linux
Linux Data Hoarding - Primer¶
Why This Matters¶
Data hoarding is the practice of collecting, organizing, and preserving large volumes of digital data — media libraries, archives, datasets, personal backups — on self-managed storage. Unlike enterprise SAN/NAS environments with six-figure budgets, data hoarding builds reliable multi-terabyte storage from commodity hardware and open-source software. The Linux ecosystem dominates this space because every critical tool — SnapRAID, mergerfs, rclone, borg, smartmontools — runs natively and composes through standard UNIX patterns (mount points, cron, pipes).
If you manage a homelab, run a media server, archive research data, or just refuse to trust a cloud provider with your only copy, this is your toolkit. The patterns here also translate directly to production: the same filesystem decisions, backup strategies, and integrity checks apply at any scale.
Core Concepts¶
1. The JBOD Philosophy¶
Traditional RAID stripes data across identical disks in lockstep. Data hoarding takes a different path: JBOD (Just a Bunch of Disks) — each drive is an independent filesystem, combined into a single namespace by a union filesystem (mergerfs) and protected by snapshot parity (SnapRAID).
Traditional RAID JBOD + mergerfs + SnapRAID
────────────── ──────────────────────────
All disks same size Mix any sizes (4TB + 8TB + 12TB)
One filesystem spans all disks Each disk is standalone ext4/xfs
Lose the array = lose everything Lose one disk = lose that disk's files
Rebuild = rewrite entire array Fix = reconstruct only lost files
Can't read disks individually Pull any disk → mount and read it
Why this wins for media and archives:
- Incremental growth. Add one drive at a time — no rebuilding arrays.
- Mixed sizes. A 4TB drive from 2018 sits next to a 20TB drive from 2025.
- Individual disk readability. If your server dies, plug any data drive into another machine and read it directly.
- Lower risk. A failed drive only loses the files physically on that drive, and SnapRAID can reconstruct them from parity.
Mnemonic: "JBOD = Just Buy One Drive." You scale by buying one drive at a time, not matching sets. This is the opposite of traditional RAID, where all disks must be present and identical.
2. The "Perfect Media Server" Stack¶
The canonical data hoarding architecture, popularized by Alex Kretzschmar (ironicbadger) on perfectmediaserver.com, combines:
| Layer | Tool | Role |
|---|---|---|
| Union filesystem | mergerfs | Combines multiple drives into one mount point |
| Parity protection | SnapRAID | Snapshot parity — reconstructs files after disk failure |
| Individual drives | ext4 or XFS | Each data drive formatted independently |
| Cloud backup | rclone | Encrypted sync to Backblaze B2, S3, Google Drive |
| Local backup | borg / restic | Deduplicated, encrypted backups of critical data |
| Containers | Docker + compose | Plex, Jellyfin, Sonarr, Radarr, *arr stack |
| Monitoring | smartd + scripts | SMART health checks, disk usage alerts |
This stack is not theoretical — it powers thousands of home media servers and small-archive setups. The key insight is that each layer is independently replaceable. Swap mergerfs for a ZFS pool. Swap borg for restic. The architecture stays the same.
See also: The mergerfs topic covers union filesystem policies in depth. The homelab topic covers hardware selection and virtualization.
3. SnapRAID In Depth¶
SnapRAID is a snapshot parity tool created by Andrea Mazzoleni (first released 2011, GPLv3). Unlike real-time RAID, SnapRAID calculates parity on a schedule (typically daily via cron). This makes it ideal for write-once-read-many workloads like media libraries, where files rarely change after initial ingest.
Etymology: "Snap" = snapshot. Parity is calculated at a point in time, not continuously. This is the fundamental difference from mdraid or hardware RAID.
How Snapshot Parity Works¶
1. Files land on data drives throughout the day
2. Nightly cron runs `snapraid sync`
3. SnapRAID reads all data drives, computes parity blocks
4. Parity blocks are written to dedicated parity drive(s)
5. Between syncs, newly added files have NO parity protection
Timeline:
──────────────────────────────────────────────────
| files added | sync runs | files protected |
| (vulnerable) | (compute) | (parity valid) |
──────────────────────────────────────────────────
snapraid.conf Format¶
The configuration file (typically /etc/snapraid.conf) uses a simple line-based format:
# Parity file — stored on a dedicated drive
# Parity drive must be >= largest data drive
parity /mnt/parity1/snapraid.parity
# Optional: additional parity for multi-disk fault tolerance
# 2-parity through 6-parity supported (up to 6 simultaneous failures)
#2-parity /mnt/parity2/snapraid.2-parity
# Content files — the checksum database
# Store at least 2 copies on DIFFERENT drives
content /var/snapraid.content
content /mnt/disk1/snapraid.content
content /mnt/disk2/snapraid.content
# Data drives — order matters for parity calculation
# Changing order after initial sync corrupts parity
data d1 /mnt/disk1/
data d2 /mnt/disk2/
data d3 /mnt/disk3/
data d4 /mnt/disk4/
# Exclude patterns
exclude *.unrecoverable
exclude /tmp/
exclude /lost+found/
exclude *.part
Key rules:
- Parity drive must be at least as large as the largest data drive. If your biggest data drive is 12TB, parity must be >= 12TB.
- Content files are critical. Lose all copies and you lose the ability to recover. Store on multiple drives.
- Data drive order is sacred. Renaming d1/d2 or reordering after sync invalidates parity. Use stable labels.
- Split parity (v11.0+): A single parity level can span multiple smaller drives with comma-separated paths.
SnapRAID Operations¶
| Command | What It Does | When to Run |
|---|---|---|
snapraid sync |
Compute/update parity from current file state | Daily (cron) |
snapraid scrub |
Verify data integrity against stored checksums (~8% per run by default) | Weekly (cron) |
snapraid scrub -p 100 |
Full scrub — verify everything | Monthly or after hardware concerns |
snapraid status |
Show array health, fragmentation, error count | On demand |
snapraid diff |
Show files changed since last sync | Before sync (safety check) |
snapraid fix -d d1 |
Reconstruct all files on disk d1 from parity | After disk failure |
snapraid fix -f /path/to/file |
Reconstruct a specific file | After accidental deletion |
snapraid check |
Simulate recovery without writing (dry-run fix) | Periodic validation |
snapraid smart |
Show SMART health report with failure probability | On demand |
snapraid dup |
Find duplicate files by hash | Space reclamation |
snapraid touch |
Set sub-second timestamps for move detection | Before sync if files were moved |
Split Parity (v11.0+, 2016)¶
SnapRAID supports up to 6 parity levels — meaning it can survive up to 6 simultaneous disk failures. Each parity level requires a dedicated drive (or split across drives):
1-parity: Survives 1 disk failure (like RAID5)
2-parity: Survives 2 disk failures (like RAID6)
3-parity: Survives 3 disk failures (like RAID-Z3)
4–6-parity: For very large arrays (20+ disks)
Rule of thumb: For home use, 1 parity disk per 4 data disks. For irreplaceable data, use 2-parity.
snapraid-runner (Automation)¶
snapraid-runner is a Python wrapper (by Chronial) that automates SnapRAID with safety checks:
- Runs
snapraid difffirst - Aborts if deletions exceed a threshold (protects against accidental mass deletion)
- Runs
snapraid sync - Optionally runs
snapraid scrub - Sends email notification with results
This is the standard automation approach — bare snapraid sync in cron has no safety net against a misconfigured mount wiping your parity.
When SnapRAID Is Wrong¶
SnapRAID is not suitable for: - Databases (high write churn — parity never catches up) - Virtual machine images (constant writes) - Anything requiring real-time redundancy (between syncs, new files are unprotected) - Small file workloads (parity overhead is per-block, not per-file)
4. Filesystem Choices for Data Drives¶
Each data drive in a JBOD array needs its own filesystem. The choice matters for performance, features, and failure modes.
Decision Matrix¶
| Filesystem | Best For | Strengths | Weaknesses |
|---|---|---|---|
| ext4 | Default choice, general data | Boring and reliable, mature fsck, universal tooling | No checksums, no snapshots, max 1 EiB volume |
| XFS | Large media files (4K video, ISOs) | Excellent large-file I/O, reflink copy (cp --reflink), scales to 8 EiB | Cannot shrink, historically fragile on power loss (improved with v5 format) |
| btrfs | Snapshots + checksums needed | CoW, snapshots, online defrag, checksums, send/receive, RAID1 | RAID5/6 write hole (data loss risk), complex repair tools |
| ZFS | Self-contained redundant pools | CoW, checksums, send/recv, ARC cache, RAID-Z levels, proven track record | Memory hungry (1GB per TB rule of thumb), can't easily add single drives, kernel module not in mainline |
Mnemonic: "ext4 = Toyota Corolla, XFS = pickup truck, btrfs = Tesla (exciting but recalls), ZFS = tank (indestructible but expensive)."
Practical advice for data hoarding: - Default to ext4 for SnapRAID data drives. It just works, recovery tools are mature, and every Linux distro supports it. - Use XFS if storing predominantly large files (video editing, ISO archives) — its allocator handles large sequential I/O better. - Use btrfs only for RAID1 mirrors or single-drive use where you want snapshots. Never use btrfs RAID5/6 — the write hole bug remains unfixed as of 2025. - Use ZFS when you want a fully self-contained solution (ZFS pools replace mergerfs+SnapRAID). But understand you are choosing a different paradigm — ZFS pools are not JBOD.
See also: The mounts-filesystems topic covers mount options, fstab syntax, and VFS internals. The disk-and-storage-ops topic covers partitioning, LVM, and block device management.
5. Backup Tools¶
Parity (SnapRAID) is not backup. Parity protects against disk failure. Backup protects against accidental deletion, ransomware, fire, and theft. You need both.
borg (BorgBackup)¶
- Language: Python + C (Cython for performance)
- First release: 2015 (fork of Attic, which started 2010)
- License: BSD-3-Clause
- Deduplication: Content-defined chunking (variable-length blocks)
- Compression: lz4, zstd, zlib, lzma (configurable per-archive)
- Encryption: AES-256-CTR + HMAC-SHA256 (authenticated encryption)
- Key feature: Append-only mode for remote repos (ransomware protection)
# Initialize an encrypted repository
borg init --encryption=repokey /mnt/backup/borg-repo
# Create a backup with compression
borg create --compression zstd,3 /mnt/backup/borg-repo::daily-{now:%Y-%m-%d} /mnt/data
# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune --keep-daily=7 --keep-weekly=4 --keep-monthly=6 /mnt/backup/borg-repo
# Verify backup integrity
borg check /mnt/backup/borg-repo
# List archives
borg list /mnt/backup/borg-repo
restic¶
- Language: Go (single static binary)
- First release: 2015
- License: BSD-2-Clause
- Deduplication: Content-defined chunking (Rabin fingerprinting)
- Encryption: AES-256-CTR + Poly1305 (always on, can't be disabled)
- Key feature: Native multi-backend support — local, SFTP, S3, B2, Azure, GCS, rclone
# Initialize a repo on Backblaze B2
restic -r b2:mybucket:/backups init
# Backup with tags
restic -r b2:mybucket:/backups backup /mnt/data --tag media
# Forget + prune old snapshots
restic -r b2:mybucket:/backups forget --keep-daily 7 --keep-weekly 4 --prune
# Check integrity
restic -r b2:mybucket:/backups check
borg vs restic decision: - Choose borg when: backing up to local disk or SFTP, want maximum compression, need append-only repos - Choose restic when: backing up to S3/B2/cloud, want a single binary with no dependencies, want lock-free concurrent backups
rclone¶
- Language: Go
- Creator: Nick Craig-Wood (first release 2014)
- Backends: 70+ cloud storage providers (S3, B2, Google Drive, Dropbox, OneDrive, SFTP, FTP, and many more)
- Key feature: The
cryptoverlay — client-side encryption as a transparent layer on any backend
# Configure a remote (interactive)
rclone config
# Sync local directory to encrypted B2 bucket
rclone sync /mnt/data/important remote-crypt:backups/ --progress
# Check for differences without transferring
rclone check /mnt/data/important remote-crypt:backups/
# Mount cloud storage as local filesystem (FUSE)
rclone mount remote:media /mnt/cloud-media --vfs-cache-mode writes
rclone is not a backup tool — it is a sync/transfer tool. Use it as the transport layer for offsite copies. Pair with borg or restic for versioning and deduplication.
rsync¶
The baseline. Every Linux system has it. No deduplication, no encryption (use SSH), no versioning (use --link-dest for poor-man's snapshots). Still the fastest way to copy large trees between local disks or over SSH.
# Mirror a directory (archive mode, delete removed files)
rsync -avh --delete /mnt/disk1/media/ /mnt/disk2/media-mirror/
# Remote sync over SSH
rsync -avhz /mnt/data/ user@offsite:/backup/data/
# Bandwidth-limited transfer
rsync -avh --bwlimit=10M /mnt/data/ /mnt/backup/
See also: The backup-restore topic covers the 3-2-1 rule, RPO/RTO, and enterprise backup strategy in detail.
6. Data Integrity¶
par2 (Parchive)¶
Created by Tobias Rieper and Stefan Wehlus (v1 spec, October 2001), then redesigned by Howard Fukada (v2 spec, January 2002). Uses Reed-Solomon error correction to create recovery blocks — originally designed for Usenet file transfers, now used for archive integrity.
# Create 10% redundancy for a directory of files
par2 create -r10 archive.par2 /mnt/archive/*.tar.gz
# Verify integrity
par2 verify archive.par2
# Repair corrupted files (up to 10% damage)
par2 repair archive.par2
Best used for: cold storage archives, files being transferred to untrusted media, long-term preservation.
SMART Monitoring¶
Self-Monitoring, Analysis, and Reporting Technology — built into every modern HDD and SSD. Monitored via smartctl (from smartmontools).
# Quick health check
smartctl -H /dev/sda
# Full attribute dump
smartctl -A /dev/sda
# Key attributes to watch:
# 5 Reallocated_Sector_Ct — bad sectors remapped (>0 = concern)
# 187 Reported_Uncorrect — uncorrectable errors
# 188 Command_Timeout — lost communication with controller
# 197 Current_Pending_Sector — unstable sectors awaiting reallocation
# 198 Offline_Uncorrectable — sectors that failed offline testing
# Enable automatic monitoring daemon
systemctl enable --now smartd
Critical reality check: Google's 2007 study of 100,000+ drives found that 36% of failed drives showed zero SMART warnings beforehand. SMART catches obvious degradation but misses sudden failures (head crashes, PCB failures, firmware bugs). Never rely on SMART alone — always have parity + backups.
Backblaze publishes quarterly drive stats (publicly available). Their 2024 data shows an overall AFR (Annualized Failure Rate) of 1.57% across ~290,000 drives, dropping to 1.36% in 2025. Key finding: failure rates vary dramatically by model and age.
Drive Burn-In¶
Before trusting a new drive with irreplaceable data, test it:
# 1. Run a SMART extended self-test (takes 8-24 hours on large drives)
smartctl -t long /dev/sdX
# 2. Check test result
smartctl -a /dev/sdX | grep -A2 "Self-test"
# 3. Write + read test with badblocks (destructive — erases all data)
# Use ONLY on new, empty drives
badblocks -wsv -b 4096 /dev/sdX
# 4. Check SMART again after burn-in
smartctl -A /dev/sdX
A burn-in catches infant mortality failures (drives that fail within the first few weeks). Drives that pass burn-in are statistically more reliable.
Scrubbing¶
Periodic integrity verification catches silent data corruption (bit rot):
| Tool | Command | What It Checks |
|---|---|---|
| SnapRAID | snapraid scrub |
File checksums against stored hashes |
| btrfs | btrfs scrub start /mnt/pool |
Block checksums (if btrfs) |
| ZFS | zpool scrub tank |
Block checksums + automatic repair from mirrors/parity |
7. Duplicate Detection¶
When you hoard data long enough, duplicates accumulate. Three tools dominate on Linux:
| Tool | Language | Speed | Key Feature |
|---|---|---|---|
| jdupes | C | Fastest (7x faster than fdupes) | Hardlink/softlink/delete modes, hash-based |
| fdupes | C | Baseline | The original (Adrian Lopez, 1999), simpler interface |
| rdfind | C++ | Fast | Ranking-based dedup, O(N log N) |
# jdupes: find duplicates and hardlink them (saves space, no data loss)
jdupes -rL /mnt/data/
# jdupes: report only, don't change anything
jdupes -r /mnt/data/
# fdupes: interactive deletion of duplicates
fdupes -r /mnt/data/
# rdfind: find duplicates and replace with hardlinks
rdfind -makehardlinks true /mnt/data/
Gotcha: jdupes matches only 100% identical files (byte-for-byte). It is not a fuzzy/similarity matcher. It is also NOT a drop-in replacement for fdupes — option flags differ.
8. Drive Management¶
Key Commands¶
| Command | Purpose | Example |
|---|---|---|
lsblk |
List block devices with hierarchy | lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT,SERIAL |
blkid |
Show filesystem UUIDs and types | blkid /dev/sda1 |
smartctl |
SMART health and attributes | smartctl -a /dev/sda |
hdparm |
Get/set drive parameters, benchmarks | hdparm -Tt /dev/sda (read speed test) |
hd-idle |
Spin down idle drives | hd-idle -i 600 /dev/sdb (10-min timeout) |
lsscsi |
List SCSI/SATA devices | lsscsi --size |
UDEV Rules for Consistent Naming¶
Drive letters (/dev/sda, /dev/sdb) can change between reboots. Use UDEV rules or /dev/disk/by-id/ symlinks for stable references:
# Use /dev/disk/by-id/ in fstab (includes drive serial number)
ls -la /dev/disk/by-id/
# Example fstab entry using disk ID
/dev/disk/by-id/ata-WDC_WD120EMFZ-11A6JA0_SERIAL-part1 /mnt/disk1 ext4 defaults,noatime 0 2
# Custom UDEV rule to create friendly names
# /etc/udev/rules.d/99-data-drives.rules
# SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="SERIAL123", SYMLINK+="data/disk1"
Bay-based naming: In a drive enclosure, label drives by physical bay position. If bay 3 fails, you know exactly which drive to pull — no guessing at serial numbers.
Drive Spindown¶
For drives that are accessed infrequently (archive/cold storage), spinning them down saves power and reduces wear:
# hdparm: set standby timeout (value * 5 seconds, 0=disable)
# 242 = 1 hour
hdparm -S 242 /dev/sdb
# hd-idle: more reliable for USB and some SATA controllers
# -i 600 = 600 seconds idle before spindown
hd-idle -i 600 -a /dev/sdb
9. Media Tools (Brief Pointers)¶
Data hoarding often serves a media library. These tools sit on top of the storage stack:
| Tool | Purpose | Note |
|---|---|---|
| Plex | Media server with transcoding | Closed-source, freemium, proprietary metadata |
| Jellyfin | Open-source media server (fork of Emby) | GPLv2, no tracking, community-driven |
| Emby | Media server (Jellyfin forked from this) | Partially closed-source since 2018 |
| Sonarr | TV show management + download automation | Monitors RSS, renames, organizes |
| Radarr | Movie management (Sonarr fork) | Same pattern, movies instead of TV |
| Lidarr | Music management | Same architecture |
| yt-dlp | Video archival from YouTube and 1000+ sites | Fork of youtube-dl, actively maintained |
All run as Docker containers pointed at your mergerfs mount. The *arr stack handles media lifecycle; the storage stack handles durability.
Quick Reference¶
Build a data hoarding stack:
1. Format data drives: mkfs.ext4 -L disk1 /dev/sdX
2. Mount individually: /mnt/disk1, /mnt/disk2, /mnt/disk3, ...
3. Pool with mergerfs: /mnt/disk* → /mnt/storage (union mount)
4. Protect with SnapRAID: /mnt/parity1 (parity drive >= largest data)
5. Backup with borg/restic: Critical data → local backup + cloud
6. Sync with rclone: Encrypted offsite copy to B2/S3
7. Monitor with smartd: SMART alerts + weekly scrub
8. Automate with cron: snapraid-runner daily, scrub weekly
Key files:
/etc/snapraid.conf — parity configuration
/etc/fstab — drive mounts (use by-id or UUID)
/etc/smartd.conf — SMART monitoring daemon
/etc/cron.d/snapraid — automation schedule
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Linux Data Hoarding Flashcards (CLI) (flashcard_deck, L1) — Linux Data Hoarding