Linux Data Hoarding - Street-Level Ops¶

Real-world patterns and debugging techniques for building and maintaining a Linux data hoarding stack in production.

Quick Diagnosis Commands¶

# 1. Check overall pool health (mergerfs + individual drives)
df -h /mnt/disk* /mnt/storage

# 2. SnapRAID array status
snapraid status

# 3. SMART health across all drives
for d in /dev/sd?; do echo "=== $d ==="; smartctl -H "$d"; done

# 4. Check for drives running hot
for d in /dev/sd?; do echo "$d: $(smartctl -A "$d" | grep Temperature_Celsius | awk '{print $10}')C"; done

# 5. Find largest files consuming space
find /mnt/storage -type f -printf '%s %p\n' | sort -rn | head -20

Common Scenarios¶

Scenario 1: New Drive Setup Workflow¶

A new 16TB drive arrives. Here is the full workflow from unboxing to production.

# Step 1: Identify the new drive
lsblk -o NAME,SIZE,MODEL,SERIAL
# Note: new drive appears as /dev/sdX (e.g., /dev/sde)

# Step 2: SMART check — reject DOA drives immediately
smartctl -H /dev/sde
smartctl -a /dev/sde | head -30

# Step 3: Burn-in test (8-24 hours for large drives)
# DESTRUCTIVE — only on new, empty drives
smartctl -t long /dev/sde     # Start SMART extended test
badblocks -wsv -b 4096 /dev/sde  # Write+read all sectors

# Step 4: Check SMART after burn-in
smartctl -a /dev/sde | grep -E "Reallocated|Pending|Uncorrect"
# If any non-zero: RMA the drive

# Step 5: Partition (optional — many data hoarders skip partitioning)
# Whole-disk format is simpler and avoids partition table overhead
# If you want a partition table:
parted /dev/sde mklabel gpt
parted /dev/sde mkpart primary ext4 0% 100%

# Step 6: Format with a label
mkfs.ext4 -L disk5 /dev/sde    # or /dev/sde1 if partitioned
# For large files (media), consider XFS:
# mkfs.xfs -L disk5 /dev/sde

# Step 7: Get the drive's stable identifier
ls -la /dev/disk/by-id/ | grep "$(smartctl -i /dev/sde | grep Serial | awk '{print $NF}')"

# Step 8: Create mount point and add to fstab
mkdir -p /mnt/disk5
# Use by-id for stable naming:
echo '/dev/disk/by-id/ata-WDC_WD161KFGX-68AFPN0_SERIALHERE /mnt/disk5 ext4 defaults,noatime,nofail 0 2' >> /etc/fstab
mount /mnt/disk5

# Step 9: Add to mergerfs pool (if using fstab, edit the mergerfs line)
# Before: /mnt/disk1:/mnt/disk2:/mnt/disk3:/mnt/disk4
# After:  /mnt/disk1:/mnt/disk2:/mnt/disk3:/mnt/disk4:/mnt/disk5

# Step 10: Add to snapraid.conf
echo 'data d5 /mnt/disk5/' >> /etc/snapraid.conf
# Add a content file on the new drive
echo 'content /mnt/disk5/snapraid.content' >> /etc/snapraid.conf

# Step 11: Run snapraid sync to incorporate new drive
snapraid sync

# Step 12: Verify
snapraid status
df -h /mnt/disk5 /mnt/storage

Total time: ~24 hours (mostly burn-in), ~15 minutes of active work.

Scenario 2: snapraid.conf for a Typical 4+1 Array¶

A complete, production-ready configuration:

# /etc/snapraid.conf
# 4 data drives + 1 parity drive
# Parity drive MUST be >= largest data drive (16TB here)

# Parity — dedicated 16TB drive
parity /mnt/parity1/snapraid.parity

# Content files — at least 2 copies on different drives
content /var/snapraid.content
content /mnt/disk1/snapraid.content
content /mnt/disk3/snapraid.content

# Data drives — order is SACRED, never reorder after first sync
data d1 /mnt/disk1/
data d2 /mnt/disk2/
data d3 /mnt/disk3/
data d4 /mnt/disk4/

# Exclude patterns
exclude *.unrecoverable
exclude /tmp/
exclude /lost+found/
exclude *.!sync
exclude *.part
exclude /downloads/incomplete/
exclude .Trash-*/
exclude *.nfo

# Optional: exclude large transient files
exclude /torrents/

Scenario 3: Automated SnapRAID with Cron¶

Option A: snapraid-runner (recommended)¶

# Install snapraid-runner
cd /opt
git clone https://github.com/Chronial/snapraid-runner.git
cp snapraid-runner/snapraid-runner.conf.example snapraid-runner/snapraid-runner.conf

Edit /opt/snapraid-runner/snapraid-runner.conf:

[snapraid]
executable = /usr/bin/snapraid
config = /etc/snapraid.conf
deletethreshold = 40
touch = false

[logging]
file = /var/log/snapraid-runner.log
maxsize = 5000

[email]
sendon = error
from = snapraid@myserver.lan
to = admin@example.com
smtp = localhost

Cron entry:

# /etc/cron.d/snapraid
# Run sync daily at 3am, scrub weekly on Sunday at 5am
0 3 * * * root python3 /opt/snapraid-runner/snapraid-runner.py -c /opt/snapraid-runner/snapraid-runner.conf >/dev/null 2>&1
0 5 * * 0 root /usr/bin/snapraid scrub -p 12 -o 30 >>/var/log/snapraid-scrub.log 2>&1

The deletethreshold = 40 is critical: if snapraid diff reports more than 40 deleted files, the sync is aborted. This protects against accidental mass deletion (e.g., a drive failed to mount and SnapRAID sees all files as "deleted").

Option B: Simple bash script¶

#!/bin/bash
# /usr/local/bin/snapraid-sync.sh
set -euo pipefail

LOG="/var/log/snapraid-sync-$(date +%Y%m%d).log"
THRESHOLD=40

echo "=== SnapRAID sync $(date) ===" | tee "$LOG"

# Safety check: count deletions
DELETED=$(snapraid diff 2>&1 | grep -c "^remove" || true)
echo "Files pending deletion: $DELETED" | tee -a "$LOG"

if [ "$DELETED" -gt "$THRESHOLD" ]; then
    echo "ABORT: $DELETED deletions exceeds threshold of $THRESHOLD" | tee -a "$LOG"
    # Send alert
    echo "SnapRAID sync aborted: $DELETED deletions" | mail -s "SnapRAID ALERT" admin@example.com
    exit 1
fi

snapraid sync 2>&1 | tee -a "$LOG"
echo "=== Sync complete $(date) ===" | tee -a "$LOG"

Scenario 4: rclone Cloud Backup¶

Encrypt and sync critical data to Backblaze B2:

# 1. Configure B2 remote
rclone config
# Choose: New remote → name: b2 → type: b2 → enter account/key

# 2. Configure encryption layer on top
rclone config
# Choose: New remote → name: b2-crypt → type: crypt
# Remote to encrypt: b2:mybucket/encrypted
# Filename encryption: standard
# Directory name encryption: true

# 3. Sync critical data (not the entire pool — just irreplaceable files)
rclone sync /mnt/storage/documents b2-crypt:documents/ --progress --transfers 8
rclone sync /mnt/storage/photos b2-crypt:photos/ --progress --transfers 8

# 4. Verify the sync
rclone check /mnt/storage/documents b2-crypt:documents/

# 5. Automate via cron
# 0 4 * * * /usr/bin/rclone sync /mnt/storage/documents b2-crypt:documents/ --log-file /var/log/rclone-backup.log --log-level INFO

Cost estimate: Backblaze B2 charges $0.006/GB/month. 1TB offsite = ~$6/month.

Scenario 5: Monitoring Setup¶

smartd Configuration¶

# /etc/smartd.conf
# Monitor all SATA drives, email on issues, run short test weekly, long test monthly
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,45,50 -m admin@example.com

Breakdown: - -a: Monitor all attributes - -o on: Enable offline data collection - -S on: Enable attribute autosave - -n standby,q: Skip if drive is in standby (quiet mode) - -s (S/../.././02|L/../../6/03): Short test daily at 2am, long test Saturday at 3am - -W 4,45,50: Warn if temp diff >4C, info at 45C, critical at 50C - -m: Email recipient

Disk Usage Monitoring Script¶

#!/bin/bash
# /usr/local/bin/disk-usage-check.sh
THRESHOLD=90

for mount in /mnt/disk*; do
    USAGE=$(df --output=pcent "$mount" | tail -1 | tr -d ' %')
    if [ "$USAGE" -gt "$THRESHOLD" ]; then
        echo "WARNING: $mount is ${USAGE}% full" | \
            mail -s "Disk space warning: $mount" admin@example.com
    fi
done

# Also check mergerfs pool
POOL_USAGE=$(df --output=pcent /mnt/storage | tail -1 | tr -d ' %')
echo "Pool usage: ${POOL_USAGE}%"

Operational Patterns¶

Recovery Workflow: Drive Failure¶

1. IDENTIFY failed drive
   smartctl -H /dev/sdX → FAILED
   dmesg | grep -i error

2. STOP services using the pool
   systemctl stop plex jellyfin sonarr radarr

3. CHECK what was on the failed drive
   snapraid status
   snapraid list -d d3 > /tmp/d3-files.txt   # if d3 failed

4. REPLACE the physical drive
   - Power down (or hot-swap if enclosure supports it)
   - Install new drive in same bay
   - Partition + format (same as new-drive workflow above)
   - Mount at the SAME mount point (/mnt/disk3)

5. REBUILD from parity
   snapraid fix -d d3 -l /var/log/snapraid-fix.log
   # This reconstructs all files that were on d3

6. VERIFY the rebuild
   snapraid check -d d3
   snapraid scrub -p 100 -d d3

7. UPDATE parity for the rebuilt drive
   snapraid sync

8. RESTART services
   systemctl start plex jellyfin sonarr radarr

9. VERIFY everything
   snapraid status
   # Spot-check a few files

Expected time: Formatting: 1 minute. snapraid fix: 4-12 hours for a full drive depending on size and number of files. Sync: 2-6 hours.

Scaling: Adding Drives Without Downtime¶

# 1. Install and burn-in new drive (see Scenario 1)
# 2. Mount new drive at /mnt/diskN
# 3. Add to mergerfs pool:

# Option A: Live add (no remount)
xattr -w user.mergerfs.srcmounts '+>/mnt/disk6' /mnt/storage/.mergerfs

# Option B: Edit fstab and remount
# Edit source paths in fstab, then:
umount /mnt/storage
mount /mnt/storage
# WARNING: Stop services first if they have open files

# 4. Add to snapraid.conf
echo 'data d6 /mnt/disk6/' >> /etc/snapraid.conf

# 5. Sync parity
snapraid sync

mergerfs with the ff (first-found) or mfs (most-free-space) create policy will automatically start using the new drive for new files. Existing files stay where they are — no rebalancing needed.

Drive Replacement (Proactive)¶

When SMART shows degradation but the drive still works:

# 1. Check which files are on the degrading drive
snapraid list -d d2 | wc -l   # count files

# 2. Add a new replacement drive to the pool first

# 3. Move files from old drive to new drive
rsync -avh --remove-source-files /mnt/disk2/ /mnt/disk_new/

# 4. Update snapraid.conf: change d2's path to the new mount
# data d2 /mnt/disk_new/

# 5. Remove old drive from mergerfs and fstab

# 6. Run snapraid sync (parity updates for moved files)
snapraid sync