Skip to content

Storage Operations - Street-Level Ops

Real-world patterns and gotchas from managing production storage systems.

Quick Diagnosis Commands

# What's using all the disk space?
df -hT                                    # Filesystem usage with types
du -xsh /* 2>/dev/null | sort -rh | head  # Top-level directories
find / -xdev -type f -size +500M -printf '%s %p\n' 2>/dev/null | sort -rn | head

> **Debug clue:** If `df -h` shows space available but you get "No space left on device," check `df -i` for inode exhaustion. Millions of tiny files (e.g., a mail queue gone wrong, a cache directory with one file per request) can exhaust inodes while barely touching disk capacity.

# Inode exhaustion (disk has space but can't create files)
df -i                                     # Inode usage
find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head  # Dirs with most files

# I/O bottleneck detection
iostat -xz 1 3                            # await > 20ms = slow
iotop -oPa                                # Which process is hammering I/O

# Check for disk errors
dmesg | grep -iE 'error|fail|i/o|sector|medium|reset'
smartctl -a /dev/sda | grep -E 'Reallocated|Pending|Uncorrectable|Error'

# LVM status at a glance
pvs && vgs && lvs

# RAID health
cat /proc/mdstat                          # Software RAID
storcli64 /c0/v0 show 2>/dev/null        # Hardware RAID (MegaRAID)

Gotcha: Disk Full but No Large Files

The disk is 100% full but du says you've only used 60%. Ghost space:

Common causes: 1. Deleted files still held open — a process has a file descriptor to a deleted file 2. Reserved blocks — ext4 reserves 5% for root by default

# Find deleted-but-open files
lsof +L1 | grep deleted
# Fix: restart the process holding the file, or truncate: echo '' > /proc/<PID>/fd/<FD>

# Check reserved blocks (ext4)
tune2fs -l /dev/sda1 | grep 'Reserved block count'
# Reduce reserved blocks (careful — root needs some):
tune2fs -m 1 /dev/sda1    # Set to 1% instead of 5%

Gotcha: XFS Can't Shrink

Remember: ext4 can grow and shrink. XFS can only grow. Mnemonic: XFS is a one-way street — you can drive forward (grow) but never back up (shrink).

You allocated too much space to an XFS logical volume. You want to shrink it. You can't. XFS does not support online or offline shrinking.

Fix: Create a new, smaller LV, copy the data, swap mounts. Or accept the overallocation. This is why you should start small and grow as needed.

# The only way to "shrink" XFS:
lvcreate -L 50G -n app_data_new data_vg
mkfs.xfs /dev/data_vg/app_data_new
mount /dev/data_vg/app_data_new /mnt/new
rsync -avHAX /data/ /mnt/new/
umount /data
umount /mnt/new
lvremove /dev/data_vg/app_data
lvrename data_vg app_data_new app_data
mount /dev/data_vg/app_data /data

Gotcha: NFS Stale File Handle

NFS server restarted or export changed. Clients show Stale file handle errors. Applications hang. ls /mnt/shared blocks forever.

Fix:

# Lazy unmount (allows cleanup without blocking)
umount -l /mnt/shared

# Remount
mount -t nfs nfs-server:/data/shared /mnt/shared

# If mount itself hangs, check if using 'hard' mount (default)
# Hard mounts retry forever. Use 'soft' only for non-critical data
mount -o soft,timeo=30 nfs-server:/data/shared /mnt/shared

Gotcha: LVM Snapshot Filling Up

LVM snapshots use copy-on-write. If the origin volume changes a lot while the snapshot exists, the snapshot fills up and becomes invalid — silently. Your backup from the snapshot is now corrupt.

Fix: Monitor snapshot usage and size snapshots generously:

# Check snapshot usage
lvs -o lv_name,lv_size,snap_percent,origin
# If snap_percent > 80%, extend or remove the snapshot

# Extend a snapshot (if VG has space)
lvextend -L +10G /dev/data_vg/app_snap

Pattern: Automated Disk Cleanup

#!/usr/bin/env bash
set -euo pipefail

THRESHOLD=85  # Percent

check_and_clean() {
    local mount=$1
    local usage
    usage=$(df "${mount}" | tail -1 | awk '{print $5}' | tr -d '%')

    if (( usage < THRESHOLD )); then
        return 0
    fi

    echo "WARN: ${mount} at ${usage}% — cleaning up"

    # Safe cleanups (least to most aggressive)
    # 1. Old logs
    find "${mount}/var/log" -name '*.gz' -mtime +30 -delete 2>/dev/null
    find "${mount}/var/log" -name '*.old' -delete 2>/dev/null

    # 2. Package manager cache
    yum clean all 2>/dev/null || apt-get clean 2>/dev/null

    # 3. Old journal entries
    journalctl --vacuum-time=7d 2>/dev/null

    # 4. Temp files
    find /tmp -type f -atime +7 -delete 2>/dev/null

    local new_usage
    new_usage=$(df "${mount}" | tail -1 | awk '{print $5}' | tr -d '%')
    echo "After cleanup: ${mount} at ${new_usage}%"
}

check_and_clean /
check_and_clean /var

Pattern: LVM Extend Workflow

#!/usr/bin/env bash
# Safe LV extension with pre-checks
set -euo pipefail

LV_PATH=$1       # e.g., /dev/data_vg/app_data
ADD_SIZE=$2       # e.g., 50G
MOUNT_POINT=$3    # e.g., /data

# Pre-checks
vg_free=$(vgs --noheadings -o vg_free --units g $(lvs --noheadings -o vg_name "${LV_PATH}" | tr -d ' ') | tr -d ' ')
echo "VG free space: ${vg_free}"

fs_type=$(findmnt -n -o FSTYPE "${MOUNT_POINT}")
echo "Filesystem type: ${fs_type}"

# Extend
lvextend -L "+${ADD_SIZE}" "${LV_PATH}"

# Grow filesystem
case "${fs_type}" in
    xfs)   xfs_growfs "${MOUNT_POINT}" ;;
    ext4)  resize2fs "${LV_PATH}" ;;
    *)     echo "Unknown filesystem: ${fs_type}"; exit 1 ;;
esac

# Verify
df -h "${MOUNT_POINT}"

Pattern: Disk Health Monitoring

#!/usr/bin/env bash
# SMART monitoring for all disks — run via cron daily
set -euo pipefail

for disk in $(lsblk -dno NAME | grep -E '^sd|^nvme'); do
    device="/dev/${disk}"

    # Skip if SMART not supported
    if ! smartctl -i "${device}" 2>/dev/null | grep -q 'SMART support is: Enabled'; then
        continue
    fi

    reallocated=$(smartctl -A "${device}" | awk '/Reallocated_Sector/{print $NF}')
    pending=$(smartctl -A "${device}" | awk '/Current_Pending/{print $NF}')
    uncorrectable=$(smartctl -A "${device}" | awk '/Offline_Uncorrectable/{print $NF}')

    if [[ "${reallocated:-0}" -gt 0 || "${pending:-0}" -gt 0 || "${uncorrectable:-0}" -gt 0 ]]; then
        echo "ALERT: ${device} — reallocated=${reallocated} pending=${pending} uncorrectable=${uncorrectable}"
        logger -t disk-monitor "DEGRADED: ${device} reallocated=${reallocated} pending=${pending}"
    fi
done

Emergency: RAID Array Degraded

# 1. Identify the problem
cat /proc/mdstat                              # Software RAID
storcli64 /c0/v0 show 2>/dev/null            # Hardware RAID

# 2. Find the failed disk
mdadm --detail /dev/md0 | grep -E 'State|dev'
storcli64 /c0/e0/s0-7 show | grep -E 'State|Size'

# 3. Check SMART on the suspect disk
smartctl -a /dev/sdc

# 4. For software RAID — mark failed, remove, replace
mdadm /dev/md0 --fail /dev/sdc
mdadm /dev/md0 --remove /dev/sdc
# Physically replace the disk, then:
mdadm /dev/md0 --add /dev/sdd

# 5. Monitor rebuild
watch -n 5 cat /proc/mdstat
# CRITICAL: Do NOT reboot during rebuild
# CRITICAL: Reduce I/O during rebuild to prevent double failure

War story: A RAID5 array lost a disk. During the rebuild (which takes hours on large arrays), a second disk failed from the increased I/O load — the rebuild stresses every surviving disk. The array was lost. This is why RAID6 (dual parity) or RAID10 exists for production. RAID5 with large disks is a coin flip during rebuild.

Scale note: RAID rebuild time scales with disk size, not data size. A 12TB disk in a RAID5 takes 12+ hours to rebuild even if only 1TB is used. During that window, one more disk failure means total data loss. For disks over 4TB, always use RAID6 or RAID10.

Emergency: Filesystem Corruption

# 1. Remount read-only immediately to prevent further damage
mount -o remount,ro /data

# 2. Check dmesg for clues
dmesg | tail -50 | grep -iE 'error|corrupt|i/o|ext4|xfs'

# 3. Unmount and repair
umount /data
# For ext4:
fsck.ext4 -y /dev/data_vg/app_data
# For XFS:
xfs_repair /dev/data_vg/app_data

# 4. Mount and verify
mount /data
ls -la /data/
# Check for lost+found — recovered fragments go here
ls /data/lost+found/