linux
l2
topic-pack
linux-data-hoarding --- Portal | Level: L2: Operations | Topics: Linux Data Hoarding | Domain: Linux

Linux Data Hoarding - Primer¶

Why This Matters¶

Data hoarding is the practice of collecting, organizing, and preserving large volumes of digital data — media libraries, archives, datasets, personal backups — on self-managed storage. Unlike enterprise SAN/NAS environments with six-figure budgets, data hoarding builds reliable multi-terabyte storage from commodity hardware and open-source software. The Linux ecosystem dominates this space because every critical tool — SnapRAID, mergerfs, rclone, borg, smartmontools — runs natively and composes through standard UNIX patterns (mount points, cron, pipes).

If you manage a homelab, run a media server, archive research data, or just refuse to trust a cloud provider with your only copy, this is your toolkit. The patterns here also translate directly to production: the same filesystem decisions, backup strategies, and integrity checks apply at any scale.

Core Concepts¶

1. The JBOD Philosophy¶

Traditional RAID stripes data across identical disks in lockstep. Data hoarding takes a different path: JBOD (Just a Bunch of Disks) — each drive is an independent filesystem, combined into a single namespace by a union filesystem (mergerfs) and protected by snapshot parity (SnapRAID).

Traditional RAID                  JBOD + mergerfs + SnapRAID
──────────────                    ──────────────────────────
All disks same size               Mix any sizes (4TB + 8TB + 12TB)
One filesystem spans all disks    Each disk is standalone ext4/xfs
Lose the array = lose everything  Lose one disk = lose that disk's files
Rebuild = rewrite entire array    Fix = reconstruct only lost files
Can't read disks individually     Pull any disk → mount and read it

Why this wins for media and archives:

Incremental growth. Add one drive at a time — no rebuilding arrays.
Mixed sizes. A 4TB drive from 2018 sits next to a 20TB drive from 2025.
Individual disk readability. If your server dies, plug any data drive into another machine and read it directly.
Lower risk. A failed drive only loses the files physically on that drive, and SnapRAID can reconstruct them from parity.

Mnemonic: "JBOD = Just Buy One Drive." You scale by buying one drive at a time, not matching sets. This is the opposite of traditional RAID, where all disks must be present and identical.

2. The "Perfect Media Server" Stack¶

The canonical data hoarding architecture, popularized by Alex Kretzschmar (ironicbadger) on perfectmediaserver.com, combines:

Layer	Tool	Role
Union filesystem	mergerfs	Combines multiple drives into one mount point
Parity protection	SnapRAID	Snapshot parity — reconstructs files after disk failure
Individual drives	ext4 or XFS	Each data drive formatted independently
Cloud backup	rclone	Encrypted sync to Backblaze B2, S3, Google Drive
Local backup	borg / restic	Deduplicated, encrypted backups of critical data
Containers	Docker + compose	Plex, Jellyfin, Sonarr, Radarr, *arr stack
Monitoring	smartd + scripts	SMART health checks, disk usage alerts

This stack is not theoretical — it powers thousands of home media servers and small-archive setups. The key insight is that each layer is independently replaceable. Swap mergerfs for a ZFS pool. Swap borg for restic. The architecture stays the same.

See also: The mergerfs topic covers union filesystem policies in depth. The homelab topic covers hardware selection and virtualization.

3. SnapRAID In Depth¶

SnapRAID is a snapshot parity tool created by Andrea Mazzoleni (first released 2011, GPLv3). Unlike real-time RAID, SnapRAID calculates parity on a schedule (typically daily via cron). This makes it ideal for write-once-read-many workloads like media libraries, where files rarely change after initial ingest.

Etymology: "Snap" = snapshot. Parity is calculated at a point in time, not continuously. This is the fundamental difference from mdraid or hardware RAID.

How Snapshot Parity Works¶

1. Files land on data drives throughout the day
2. Nightly cron runs `snapraid sync`
3. SnapRAID reads all data drives, computes parity blocks
4. Parity blocks are written to dedicated parity drive(s)
5. Between syncs, newly added files have NO parity protection

Timeline:
  ──────────────────────────────────────────────────
  |  files added  |  sync runs  |  files protected  |
  |  (vulnerable) |  (compute)  |  (parity valid)   |
  ──────────────────────────────────────────────────

snapraid.conf Format¶

The configuration file (typically /etc/snapraid.conf) uses a simple line-based format:

# Parity file — stored on a dedicated drive
# Parity drive must be >= largest data drive
parity /mnt/parity1/snapraid.parity

# Optional: additional parity for multi-disk fault tolerance
# 2-parity through 6-parity supported (up to 6 simultaneous failures)
#2-parity /mnt/parity2/snapraid.2-parity

# Content files — the checksum database
# Store at least 2 copies on DIFFERENT drives
content /var/snapraid.content
content /mnt/disk1/snapraid.content
content /mnt/disk2/snapraid.content

# Data drives — order matters for parity calculation
# Changing order after initial sync corrupts parity
data d1 /mnt/disk1/
data d2 /mnt/disk2/
data d3 /mnt/disk3/
data d4 /mnt/disk4/

# Exclude patterns
exclude *.unrecoverable
exclude /tmp/
exclude /lost+found/
exclude *.part

Key rules: - Parity drive must be at least as large as the largest data drive. If your biggest data drive is 12TB, parity must be >= 12TB. - Content files are critical. Lose all copies and you lose the ability to recover. Store on multiple drives. - Data drive order is sacred. Renaming d1/d2 or reordering after sync invalidates parity. Use stable labels. - Split parity (v11.0+): A single parity level can span multiple smaller drives with comma-separated paths.

SnapRAID Operations¶

Command	What It Does	When to Run
`snapraid sync`	Compute/update parity from current file state	Daily (cron)
`snapraid scrub`	Verify data integrity against stored checksums (~8% per run by default)	Weekly (cron)
`snapraid scrub -p 100`	Full scrub — verify everything	Monthly or after hardware concerns
`snapraid status`	Show array health, fragmentation, error count	On demand
`snapraid diff`	Show files changed since last sync	Before sync (safety check)
`snapraid fix -d d1`	Reconstruct all files on disk d1 from parity	After disk failure
`snapraid fix -f /path/to/file`	Reconstruct a specific file	After accidental deletion
`snapraid check`	Simulate recovery without writing (dry-run fix)	Periodic validation
`snapraid smart`	Show SMART health report with failure probability	On demand
`snapraid dup`	Find duplicate files by hash	Space reclamation
`snapraid touch`	Set sub-second timestamps for move detection	Before sync if files were moved

Split Parity (v11.0+, 2016)¶

SnapRAID supports up to 6 parity levels — meaning it can survive up to 6 simultaneous disk failures. Each parity level requires a dedicated drive (or split across drives):

1-parity:  Survives 1 disk failure  (like RAID5)
2-parity:  Survives 2 disk failures (like RAID6)
3-parity:  Survives 3 disk failures (like RAID-Z3)
4–6-parity: For very large arrays (20+ disks)

Rule of thumb: For home use, 1 parity disk per 4 data disks. For irreplaceable data, use 2-parity.

snapraid-runner (Automation)¶

snapraid-runner is a Python wrapper (by Chronial) that automates SnapRAID with safety checks:

Runs snapraid diff first
Aborts if deletions exceed a threshold (protects against accidental mass deletion)
Runs snapraid sync
Optionally runs snapraid scrub
Sends email notification with results

This is the standard automation approach — bare snapraid sync in cron has no safety net against a misconfigured mount wiping your parity.

When SnapRAID Is Wrong¶

SnapRAID is not suitable for: - Databases (high write churn — parity never catches up) - Virtual machine images (constant writes) - Anything requiring real-time redundancy (between syncs, new files are unprotected) - Small file workloads (parity overhead is per-block, not per-file)

4. Filesystem Choices for Data Drives¶

Each data drive in a JBOD array needs its own filesystem. The choice matters for performance, features, and failure modes.

Decision Matrix¶

Filesystem	Best For	Strengths	Weaknesses
ext4	Default choice, general data	Boring and reliable, mature fsck, universal tooling	No checksums, no snapshots, max 1 EiB volume
XFS	Large media files (4K video, ISOs)	Excellent large-file I/O, reflink copy (cp --reflink), scales to 8 EiB	Cannot shrink, historically fragile on power loss (improved with v5 format)
btrfs	Snapshots + checksums needed	CoW, snapshots, online defrag, checksums, send/receive, RAID1	RAID5/6 write hole (data loss risk), complex repair tools
ZFS	Self-contained redundant pools	CoW, checksums, send/recv, ARC cache, RAID-Z levels, proven track record	Memory hungry (1GB per TB rule of thumb), can't easily add single drives, kernel module not in mainline

Mnemonic: "ext4 = Toyota Corolla, XFS = pickup truck, btrfs = Tesla (exciting but recalls), ZFS = tank (indestructible but expensive)."

Practical advice for data hoarding: - Default to ext4 for SnapRAID data drives. It just works, recovery tools are mature, and every Linux distro supports it. - Use XFS if storing predominantly large files (video editing, ISO archives) — its allocator handles large sequential I/O better. - Use btrfs only for RAID1 mirrors or single-drive use where you want snapshots. Never use btrfs RAID5/6 — the write hole bug remains unfixed as of 2025. - Use ZFS when you want a fully self-contained solution (ZFS pools replace mergerfs+SnapRAID). But understand you are choosing a different paradigm — ZFS pools are not JBOD.

See also: The mounts-filesystems topic covers mount options, fstab syntax, and VFS internals. The disk-and-storage-ops topic covers partitioning, LVM, and block device management.

5. Backup Tools¶

Parity (SnapRAID) is not backup. Parity protects against disk failure. Backup protects against accidental deletion, ransomware, fire, and theft. You need both.

borg (BorgBackup)¶

Language: Python + C (Cython for performance)
First release: 2015 (fork of Attic, which started 2010)
License: BSD-3-Clause
Deduplication: Content-defined chunking (variable-length blocks)
Compression: lz4, zstd, zlib, lzma (configurable per-archive)
Encryption: AES-256-CTR + HMAC-SHA256 (authenticated encryption)
Key feature: Append-only mode for remote repos (ransomware protection)

# Initialize an encrypted repository
borg init --encryption=repokey /mnt/backup/borg-repo

# Create a backup with compression
borg create --compression zstd,3 /mnt/backup/borg-repo::daily-{now:%Y-%m-%d} /mnt/data

# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune --keep-daily=7 --keep-weekly=4 --keep-monthly=6 /mnt/backup/borg-repo

# Verify backup integrity
borg check /mnt/backup/borg-repo

# List archives
borg list /mnt/backup/borg-repo

restic¶

Language: Go (single static binary)
First release: 2015
License: BSD-2-Clause
Deduplication: Content-defined chunking (Rabin fingerprinting)
Encryption: AES-256-CTR + Poly1305 (always on, can't be disabled)
Key feature: Native multi-backend support — local, SFTP, S3, B2, Azure, GCS, rclone

# Initialize a repo on Backblaze B2
restic -r b2:mybucket:/backups init

# Backup with tags
restic -r b2:mybucket:/backups backup /mnt/data --tag media

# Forget + prune old snapshots
restic -r b2:mybucket:/backups forget --keep-daily 7 --keep-weekly 4 --prune

# Check integrity
restic -r b2:mybucket:/backups check

borg vs restic decision: - Choose borg when: backing up to local disk or SFTP, want maximum compression, need append-only repos - Choose restic when: backing up to S3/B2/cloud, want a single binary with no dependencies, want lock-free concurrent backups

rclone¶

Language: Go
Creator: Nick Craig-Wood (first release 2014)
Backends: 70+ cloud storage providers (S3, B2, Google Drive, Dropbox, OneDrive, SFTP, FTP, and many more)
Key feature: The crypt overlay — client-side encryption as a transparent layer on any backend

# Configure a remote (interactive)
rclone config

# Sync local directory to encrypted B2 bucket
rclone sync /mnt/data/important remote-crypt:backups/ --progress

# Check for differences without transferring
rclone check /mnt/data/important remote-crypt:backups/

# Mount cloud storage as local filesystem (FUSE)
rclone mount remote:media /mnt/cloud-media --vfs-cache-mode writes

rclone is not a backup tool — it is a sync/transfer tool. Use it as the transport layer for offsite copies. Pair with borg or restic for versioning and deduplication.

rsync¶

The baseline. Every Linux system has it. No deduplication, no encryption (use SSH), no versioning (use --link-dest for poor-man's snapshots). Still the fastest way to copy large trees between local disks or over SSH.

# Mirror a directory (archive mode, delete removed files)
rsync -avh --delete /mnt/disk1/media/ /mnt/disk2/media-mirror/

# Remote sync over SSH
rsync -avhz /mnt/data/ user@offsite:/backup/data/

# Bandwidth-limited transfer
rsync -avh --bwlimit=10M /mnt/data/ /mnt/backup/

See also: The backup-restore topic covers the 3-2-1 rule, RPO/RTO, and enterprise backup strategy in detail.

6. Data Integrity¶

par2 (Parchive)¶

Created by Tobias Rieper and Stefan Wehlus (v1 spec, October 2001), then redesigned by Howard Fukada (v2 spec, January 2002). Uses Reed-Solomon error correction to create recovery blocks — originally designed for Usenet file transfers, now used for archive integrity.

# Create 10% redundancy for a directory of files
par2 create -r10 archive.par2 /mnt/archive/*.tar.gz

# Verify integrity
par2 verify archive.par2

# Repair corrupted files (up to 10% damage)
par2 repair archive.par2

Best used for: cold storage archives, files being transferred to untrusted media, long-term preservation.

SMART Monitoring¶

Self-Monitoring, Analysis, and Reporting Technology — built into every modern HDD and SSD. Monitored via smartctl (from smartmontools).

# Quick health check
smartctl -H /dev/sda

# Full attribute dump
smartctl -A /dev/sda

# Key attributes to watch:
#   5   Reallocated_Sector_Ct   — bad sectors remapped (>0 = concern)
#   187 Reported_Uncorrect      — uncorrectable errors
#   188 Command_Timeout         — lost communication with controller
#   197 Current_Pending_Sector  — unstable sectors awaiting reallocation
#   198 Offline_Uncorrectable   — sectors that failed offline testing

# Enable automatic monitoring daemon
systemctl enable --now smartd

Critical reality check: Google's 2007 study of 100,000+ drives found that 36% of failed drives showed zero SMART warnings beforehand. SMART catches obvious degradation but misses sudden failures (head crashes, PCB failures, firmware bugs). Never rely on SMART alone — always have parity + backups.

Backblaze publishes quarterly drive stats (publicly available). Their 2024 data shows an overall AFR (Annualized Failure Rate) of 1.57% across ~290,000 drives, dropping to 1.36% in 2025. Key finding: failure rates vary dramatically by model and age.

Drive Burn-In¶

Before trusting a new drive with irreplaceable data, test it:

# 1. Run a SMART extended self-test (takes 8-24 hours on large drives)
smartctl -t long /dev/sdX

# 2. Check test result
smartctl -a /dev/sdX | grep -A2 "Self-test"

# 3. Write + read test with badblocks (destructive — erases all data)
#    Use ONLY on new, empty drives
badblocks -wsv -b 4096 /dev/sdX

# 4. Check SMART again after burn-in
smartctl -A /dev/sdX

A burn-in catches infant mortality failures (drives that fail within the first few weeks). Drives that pass burn-in are statistically more reliable.

Scrubbing¶

Periodic integrity verification catches silent data corruption (bit rot):

Tool	Command	What It Checks
SnapRAID	`snapraid scrub`	File checksums against stored hashes
btrfs	`btrfs scrub start /mnt/pool`	Block checksums (if btrfs)
ZFS	`zpool scrub tank`	Block checksums + automatic repair from mirrors/parity

7. Duplicate Detection¶

When you hoard data long enough, duplicates accumulate. Three tools dominate on Linux:

Tool	Language	Speed	Key Feature
jdupes	C	Fastest (7x faster than fdupes)	Hardlink/softlink/delete modes, hash-based
fdupes	C	Baseline	The original (Adrian Lopez, 1999), simpler interface
rdfind	C++	Fast	Ranking-based dedup, O(N log N)

# jdupes: find duplicates and hardlink them (saves space, no data loss)
jdupes -rL /mnt/data/

# jdupes: report only, don't change anything
jdupes -r /mnt/data/

# fdupes: interactive deletion of duplicates
fdupes -r /mnt/data/

# rdfind: find duplicates and replace with hardlinks
rdfind -makehardlinks true /mnt/data/

Gotcha: jdupes matches only 100% identical files (byte-for-byte). It is not a fuzzy/similarity matcher. It is also NOT a drop-in replacement for fdupes — option flags differ.

8. Drive Management¶

Key Commands¶

Command	Purpose	Example
`lsblk`	List block devices with hierarchy	`lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT,SERIAL`
`blkid`	Show filesystem UUIDs and types	`blkid /dev/sda1`
`smartctl`	SMART health and attributes	`smartctl -a /dev/sda`
`hdparm`	Get/set drive parameters, benchmarks	`hdparm -Tt /dev/sda` (read speed test)
`hd-idle`	Spin down idle drives	`hd-idle -i 600 /dev/sdb` (10-min timeout)
`lsscsi`	List SCSI/SATA devices	`lsscsi --size`

UDEV Rules for Consistent Naming¶

Drive letters (/dev/sda, /dev/sdb) can change between reboots. Use UDEV rules or /dev/disk/by-id/ symlinks for stable references:

# Use /dev/disk/by-id/ in fstab (includes drive serial number)
ls -la /dev/disk/by-id/

# Example fstab entry using disk ID
/dev/disk/by-id/ata-WDC_WD120EMFZ-11A6JA0_SERIAL-part1  /mnt/disk1  ext4  defaults,noatime  0  2

# Custom UDEV rule to create friendly names
# /etc/udev/rules.d/99-data-drives.rules
# SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="SERIAL123", SYMLINK+="data/disk1"

Bay-based naming: In a drive enclosure, label drives by physical bay position. If bay 3 fails, you know exactly which drive to pull — no guessing at serial numbers.

Drive Spindown¶

For drives that are accessed infrequently (archive/cold storage), spinning them down saves power and reduces wear:

# hdparm: set standby timeout (value * 5 seconds, 0=disable)
# 242 = 1 hour
hdparm -S 242 /dev/sdb

# hd-idle: more reliable for USB and some SATA controllers
# -i 600 = 600 seconds idle before spindown
hd-idle -i 600 -a /dev/sdb

9. Media Tools (Brief Pointers)¶

Data hoarding often serves a media library. These tools sit on top of the storage stack:

Tool	Purpose	Note
Plex	Media server with transcoding	Closed-source, freemium, proprietary metadata
Jellyfin	Open-source media server (fork of Emby)	GPLv2, no tracking, community-driven
Emby	Media server (Jellyfin forked from this)	Partially closed-source since 2018
Sonarr	TV show management + download automation	Monitors RSS, renames, organizes
Radarr	Movie management (Sonarr fork)	Same pattern, movies instead of TV
Lidarr	Music management	Same architecture
yt-dlp	Video archival from YouTube and 1000+ sites	Fork of youtube-dl, actively maintained

All run as Docker containers pointed at your mergerfs mount. The *arr stack handles media lifecycle; the storage stack handles durability.

Quick Reference¶

Build a data hoarding stack:
1. Format data drives:       mkfs.ext4 -L disk1 /dev/sdX
2. Mount individually:       /mnt/disk1, /mnt/disk2, /mnt/disk3, ...
3. Pool with mergerfs:       /mnt/disk* → /mnt/storage (union mount)
4. Protect with SnapRAID:    /mnt/parity1 (parity drive >= largest data)
5. Backup with borg/restic:  Critical data → local backup + cloud
6. Sync with rclone:         Encrypted offsite copy to B2/S3
7. Monitor with smartd:      SMART alerts + weekly scrub
8. Automate with cron:       snapraid-runner daily, scrub weekly

Key files:
  /etc/snapraid.conf          — parity configuration
  /etc/fstab                  — drive mounts (use by-id or UUID)
  /etc/smartd.conf            — SMART monitoring daemon
  /etc/cron.d/snapraid        — automation schedule

Prerequisites¶

Linux Ops (Topic Pack, L0)

Linux Data Hoarding Flashcards (CLI) (flashcard_deck, L1) — Linux Data Hoarding

Linux Data Hoarding - Primer¶

Why This Matters¶

Core Concepts¶

1. The JBOD Philosophy¶

2. The "Perfect Media Server" Stack¶

3. SnapRAID In Depth¶

How Snapshot Parity Works¶

snapraid.conf Format¶

SnapRAID Operations¶

Split Parity (v11.0+, 2016)¶

snapraid-runner (Automation)¶

When SnapRAID Is Wrong¶

4. Filesystem Choices for Data Drives¶

Decision Matrix¶

5. Backup Tools¶

borg (BorgBackup)¶

restic¶

rclone¶

rsync¶

6. Data Integrity¶

par2 (Parchive)¶

SMART Monitoring¶

Drive Burn-In¶

Scrubbing¶

7. Duplicate Detection¶

8. Drive Management¶

Key Commands¶

UDEV Rules for Consistent Naming¶

Drive Spindown¶

9. Media Tools (Brief Pointers)¶

Quick Reference¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Linux Data Hoarding - Primer¶

Why This Matters¶

Core Concepts¶

1. The JBOD Philosophy¶

2. The "Perfect Media Server" Stack¶

3. SnapRAID In Depth¶

How Snapshot Parity Works¶

snapraid.conf Format¶

SnapRAID Operations¶

Split Parity (v11.0+, 2016)¶

snapraid-runner (Automation)¶

When SnapRAID Is Wrong¶

4. Filesystem Choices for Data Drives¶

Decision Matrix¶

5. Backup Tools¶

borg (BorgBackup)¶

restic¶

rclone¶

rsync¶

6. Data Integrity¶

par2 (Parchive)¶

SMART Monitoring¶

Drive Burn-In¶

Scrubbing¶

7. Duplicate Detection¶

8. Drive Management¶

Key Commands¶

UDEV Rules for Consistent Naming¶

Drive Spindown¶

9. Media Tools (Brief Pointers)¶

Quick Reference¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶