Linux Data Hoarding - Street-Level Ops¶
Real-world patterns and debugging techniques for building and maintaining a Linux data hoarding stack in production.
Quick Diagnosis Commands¶
# 1. Check overall pool health (mergerfs + individual drives)
df -h /mnt/disk* /mnt/storage
# 2. SnapRAID array status
snapraid status
# 3. SMART health across all drives
for d in /dev/sd?; do echo "=== $d ==="; smartctl -H "$d"; done
# 4. Check for drives running hot
for d in /dev/sd?; do echo "$d: $(smartctl -A "$d" | grep Temperature_Celsius | awk '{print $10}')C"; done
# 5. Find largest files consuming space
find /mnt/storage -type f -printf '%s %p\n' | sort -rn | head -20
Common Scenarios¶
Scenario 1: New Drive Setup Workflow¶
A new 16TB drive arrives. Here is the full workflow from unboxing to production.
# Step 1: Identify the new drive
lsblk -o NAME,SIZE,MODEL,SERIAL
# Note: new drive appears as /dev/sdX (e.g., /dev/sde)
# Step 2: SMART check — reject DOA drives immediately
smartctl -H /dev/sde
smartctl -a /dev/sde | head -30
# Step 3: Burn-in test (8-24 hours for large drives)
# DESTRUCTIVE — only on new, empty drives
smartctl -t long /dev/sde # Start SMART extended test
badblocks -wsv -b 4096 /dev/sde # Write+read all sectors
# Step 4: Check SMART after burn-in
smartctl -a /dev/sde | grep -E "Reallocated|Pending|Uncorrect"
# If any non-zero: RMA the drive
# Step 5: Partition (optional — many data hoarders skip partitioning)
# Whole-disk format is simpler and avoids partition table overhead
# If you want a partition table:
parted /dev/sde mklabel gpt
parted /dev/sde mkpart primary ext4 0% 100%
# Step 6: Format with a label
mkfs.ext4 -L disk5 /dev/sde # or /dev/sde1 if partitioned
# For large files (media), consider XFS:
# mkfs.xfs -L disk5 /dev/sde
# Step 7: Get the drive's stable identifier
ls -la /dev/disk/by-id/ | grep "$(smartctl -i /dev/sde | grep Serial | awk '{print $NF}')"
# Step 8: Create mount point and add to fstab
mkdir -p /mnt/disk5
# Use by-id for stable naming:
echo '/dev/disk/by-id/ata-WDC_WD161KFGX-68AFPN0_SERIALHERE /mnt/disk5 ext4 defaults,noatime,nofail 0 2' >> /etc/fstab
mount /mnt/disk5
# Step 9: Add to mergerfs pool (if using fstab, edit the mergerfs line)
# Before: /mnt/disk1:/mnt/disk2:/mnt/disk3:/mnt/disk4
# After: /mnt/disk1:/mnt/disk2:/mnt/disk3:/mnt/disk4:/mnt/disk5
# Step 10: Add to snapraid.conf
echo 'data d5 /mnt/disk5/' >> /etc/snapraid.conf
# Add a content file on the new drive
echo 'content /mnt/disk5/snapraid.content' >> /etc/snapraid.conf
# Step 11: Run snapraid sync to incorporate new drive
snapraid sync
# Step 12: Verify
snapraid status
df -h /mnt/disk5 /mnt/storage
Total time: ~24 hours (mostly burn-in), ~15 minutes of active work.
Scenario 2: snapraid.conf for a Typical 4+1 Array¶
A complete, production-ready configuration:
# /etc/snapraid.conf
# 4 data drives + 1 parity drive
# Parity drive MUST be >= largest data drive (16TB here)
# Parity — dedicated 16TB drive
parity /mnt/parity1/snapraid.parity
# Content files — at least 2 copies on different drives
content /var/snapraid.content
content /mnt/disk1/snapraid.content
content /mnt/disk3/snapraid.content
# Data drives — order is SACRED, never reorder after first sync
data d1 /mnt/disk1/
data d2 /mnt/disk2/
data d3 /mnt/disk3/
data d4 /mnt/disk4/
# Exclude patterns
exclude *.unrecoverable
exclude /tmp/
exclude /lost+found/
exclude *.!sync
exclude *.part
exclude /downloads/incomplete/
exclude .Trash-*/
exclude *.nfo
# Optional: exclude large transient files
exclude /torrents/
Scenario 3: Automated SnapRAID with Cron¶
Option A: snapraid-runner (recommended)¶
# Install snapraid-runner
cd /opt
git clone https://github.com/Chronial/snapraid-runner.git
cp snapraid-runner/snapraid-runner.conf.example snapraid-runner/snapraid-runner.conf
Edit /opt/snapraid-runner/snapraid-runner.conf:
[snapraid]
executable = /usr/bin/snapraid
config = /etc/snapraid.conf
deletethreshold = 40
touch = false
[logging]
file = /var/log/snapraid-runner.log
maxsize = 5000
[email]
sendon = error
from = snapraid@myserver.lan
to = admin@example.com
smtp = localhost
Cron entry:
# /etc/cron.d/snapraid
# Run sync daily at 3am, scrub weekly on Sunday at 5am
0 3 * * * root python3 /opt/snapraid-runner/snapraid-runner.py -c /opt/snapraid-runner/snapraid-runner.conf >/dev/null 2>&1
0 5 * * 0 root /usr/bin/snapraid scrub -p 12 -o 30 >>/var/log/snapraid-scrub.log 2>&1
The deletethreshold = 40 is critical: if snapraid diff reports more than 40 deleted files, the sync is aborted. This protects against accidental mass deletion (e.g., a drive failed to mount and SnapRAID sees all files as "deleted").
Option B: Simple bash script¶
#!/bin/bash
# /usr/local/bin/snapraid-sync.sh
set -euo pipefail
LOG="/var/log/snapraid-sync-$(date +%Y%m%d).log"
THRESHOLD=40
echo "=== SnapRAID sync $(date) ===" | tee "$LOG"
# Safety check: count deletions
DELETED=$(snapraid diff 2>&1 | grep -c "^remove" || true)
echo "Files pending deletion: $DELETED" | tee -a "$LOG"
if [ "$DELETED" -gt "$THRESHOLD" ]; then
echo "ABORT: $DELETED deletions exceeds threshold of $THRESHOLD" | tee -a "$LOG"
# Send alert
echo "SnapRAID sync aborted: $DELETED deletions" | mail -s "SnapRAID ALERT" admin@example.com
exit 1
fi
snapraid sync 2>&1 | tee -a "$LOG"
echo "=== Sync complete $(date) ===" | tee -a "$LOG"
Scenario 4: rclone Cloud Backup¶
Encrypt and sync critical data to Backblaze B2:
# 1. Configure B2 remote
rclone config
# Choose: New remote → name: b2 → type: b2 → enter account/key
# 2. Configure encryption layer on top
rclone config
# Choose: New remote → name: b2-crypt → type: crypt
# Remote to encrypt: b2:mybucket/encrypted
# Filename encryption: standard
# Directory name encryption: true
# 3. Sync critical data (not the entire pool — just irreplaceable files)
rclone sync /mnt/storage/documents b2-crypt:documents/ --progress --transfers 8
rclone sync /mnt/storage/photos b2-crypt:photos/ --progress --transfers 8
# 4. Verify the sync
rclone check /mnt/storage/documents b2-crypt:documents/
# 5. Automate via cron
# 0 4 * * * /usr/bin/rclone sync /mnt/storage/documents b2-crypt:documents/ --log-file /var/log/rclone-backup.log --log-level INFO
Cost estimate: Backblaze B2 charges $0.006/GB/month. 1TB offsite = ~$6/month.
Scenario 5: Monitoring Setup¶
smartd Configuration¶
# /etc/smartd.conf
# Monitor all SATA drives, email on issues, run short test weekly, long test monthly
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,45,50 -m admin@example.com
Breakdown:
- -a: Monitor all attributes
- -o on: Enable offline data collection
- -S on: Enable attribute autosave
- -n standby,q: Skip if drive is in standby (quiet mode)
- -s (S/../.././02|L/../../6/03): Short test daily at 2am, long test Saturday at 3am
- -W 4,45,50: Warn if temp diff >4C, info at 45C, critical at 50C
- -m: Email recipient
Disk Usage Monitoring Script¶
#!/bin/bash
# /usr/local/bin/disk-usage-check.sh
THRESHOLD=90
for mount in /mnt/disk*; do
USAGE=$(df --output=pcent "$mount" | tail -1 | tr -d ' %')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "WARNING: $mount is ${USAGE}% full" | \
mail -s "Disk space warning: $mount" admin@example.com
fi
done
# Also check mergerfs pool
POOL_USAGE=$(df --output=pcent /mnt/storage | tail -1 | tr -d ' %')
echo "Pool usage: ${POOL_USAGE}%"
Operational Patterns¶
Recovery Workflow: Drive Failure¶
1. IDENTIFY failed drive
smartctl -H /dev/sdX → FAILED
dmesg | grep -i error
2. STOP services using the pool
systemctl stop plex jellyfin sonarr radarr
3. CHECK what was on the failed drive
snapraid status
snapraid list -d d3 > /tmp/d3-files.txt # if d3 failed
4. REPLACE the physical drive
- Power down (or hot-swap if enclosure supports it)
- Install new drive in same bay
- Partition + format (same as new-drive workflow above)
- Mount at the SAME mount point (/mnt/disk3)
5. REBUILD from parity
snapraid fix -d d3 -l /var/log/snapraid-fix.log
# This reconstructs all files that were on d3
6. VERIFY the rebuild
snapraid check -d d3
snapraid scrub -p 100 -d d3
7. UPDATE parity for the rebuilt drive
snapraid sync
8. RESTART services
systemctl start plex jellyfin sonarr radarr
9. VERIFY everything
snapraid status
# Spot-check a few files
Expected time: Formatting: 1 minute. snapraid fix: 4-12 hours for a full drive depending on size and number of files. Sync: 2-6 hours.
Scaling: Adding Drives Without Downtime¶
# 1. Install and burn-in new drive (see Scenario 1)
# 2. Mount new drive at /mnt/diskN
# 3. Add to mergerfs pool:
# Option A: Live add (no remount)
xattr -w user.mergerfs.srcmounts '+>/mnt/disk6' /mnt/storage/.mergerfs
# Option B: Edit fstab and remount
# Edit source paths in fstab, then:
umount /mnt/storage
mount /mnt/storage
# WARNING: Stop services first if they have open files
# 4. Add to snapraid.conf
echo 'data d6 /mnt/disk6/' >> /etc/snapraid.conf
# 5. Sync parity
snapraid sync
mergerfs with the ff (first-found) or mfs (most-free-space) create policy will automatically start using the new drive for new files. Existing files stay where they are — no rebalancing needed.
Drive Replacement (Proactive)¶
When SMART shows degradation but the drive still works:
# 1. Check which files are on the degrading drive
snapraid list -d d2 | wc -l # count files
# 2. Add a new replacement drive to the pool first
# 3. Move files from old drive to new drive
rsync -avh --remove-source-files /mnt/disk2/ /mnt/disk_new/
# 4. Update snapraid.conf: change d2's path to the new mount
# data d2 /mnt/disk_new/
# 5. Remove old drive from mergerfs and fstab
# 6. Run snapraid sync (parity updates for moved files)
snapraid sync