Disk & Storage Ops - Street-Level Ops¶
Real-world workflows for managing disks, storage, and I/O on production Linux systems.
Disk Space Emergency¶
The disk is full. Services are crashing. Pages are firing. Here is the sequence.
# Step 1: See what is full
df -h
# Output:
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 50G 50G 0 100% /
# /dev/vg0/lv_data 200G 180G 20G 90% /data
# Step 2: Find the biggest directories under the full mount
du -sh /* 2>/dev/null | sort -rh | head -20
# Drill deeper
du -sh /var/* | sort -rh | head -10
du -sh /var/log/* | sort -rh | head -10
# Step 3: Find large files (>100MB)
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head -20
# Step 4: Check for deleted-but-open files (space not freed because process holds fd)
lsof +L1 | grep deleted
# Output:
# java 12345 app 15w REG 253,0 42949672960 0 /var/log/app.log (deleted)
# ^^^ This 40GB file was deleted but the process still holds it open.
# Space will not be freed until the process closes the file or is restarted.
# Step 5: Truncate the open file descriptor to free space WITHOUT restarting
# Find the fd number from lsof output, then:
> /proc/12345/fd/15
# Or truncate a file that still exists on disk (frees space immediately)
truncate -s 0 /var/log/huge-application.log
# Step 6: Find and clean package manager cache
sudo apt clean # Debian/Ubuntu
sudo yum clean all # RHEL/CentOS
sudo journalctl --vacuum-size=100M # Trim systemd journal
Critical distinction: rm on an open file removes the directory entry but does not free space until the process closes the file descriptor. truncate -s 0 zeroes the file in place and frees space immediately, even if the file is open.
Adding a New Disk¶
A new disk has been attached (physical or cloud volume). Make it usable.
# Step 1: Detect the new disk
# Rescan SCSI bus (for hot-added disks on VMs)
echo "- - -" | sudo tee /sys/class/scsi_host/host*/scan
# Verify it appears
lsblk
# NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
# sda 8:0 0 100G 0 disk
# ├─sda1 8:1 0 99G 0 part /
# └─sda2 8:2 0 1G 0 part [SWAP]
# sdb 8:16 0 500G 0 disk <--- new disk, no partitions
# Step 2: Partition (GPT, single partition using full disk)
sudo parted /dev/sdb mklabel gpt
sudo parted /dev/sdb mkpart primary ext4 0% 100%
# Verify
lsblk /dev/sdb
# NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
# sdb 8:16 0 500G 0 disk
# └─sdb1 8:17 0 500G 0 part
# Step 3: Create filesystem
sudo mkfs.ext4 -L appdata /dev/sdb1
# Step 4: Create mount point and mount
sudo mkdir -p /data/appdata
sudo mount /dev/sdb1 /data/appdata
# Step 5: Add to fstab using UUID (survives device renaming)
UUID=$(blkid -s UUID -o value /dev/sdb1)
echo "UUID=$UUID /data/appdata ext4 defaults,noatime,nofail 0 2" \
| sudo tee -a /etc/fstab
# Step 6: Verify fstab is correct
sudo umount /data/appdata
sudo mount -a
df -h /data/appdata
LVM Extend Online¶
The most common LVM operation in production: extending a volume without downtime.
# Scenario: /data is 100G on LVM and running low. A new 200G disk was added.
# Step 1: Initialize the new disk as a PV
sudo pvcreate /dev/sdc
# Step 2: Add it to the existing volume group
sudo vgextend vg_data /dev/sdc
# Verify free space
sudo vgs
# VG #PV #LV #SN Attr VSize VFree
# vg_data 2 1 0 wz--n- 299.99g 200.00g
# Step 3: Extend the logical volume
sudo lvextend -L +200G /dev/vg_data/lv_appdata
# Or use all free space:
sudo lvextend -l +100%FREE /dev/vg_data/lv_appdata
# Step 4: Resize the filesystem (online, no unmount needed)
# For ext4:
sudo resize2fs /dev/vg_data/lv_appdata
# For XFS:
sudo xfs_growfs /data
# Or do it all in one step:
sudo lvextend -L +200G --resizefs /dev/vg_data/lv_appdata
# Verify
df -h /data
RAID Rebuild Monitoring¶
A disk failed and the array is rebuilding. Monitor it.
# Check current RAID status
cat /proc/mdstat
# Personalities : [raid1]
# md0 : active raid1 sda1[0] sdb1[2]
# 488253440 blocks super 1.2 [2/1] [U_]
# [=>...................] recovery = 8.3% (40625408/488253440) finish=62.4min speed=119536K/sec
# Detailed status
sudo mdadm --detail /dev/md0
# State : clean, degraded, recovering
# Rebuild Status : 8% complete
# Watch rebuild progress in real time
watch -n 5 cat /proc/mdstat
# Check rebuild speed limits
cat /proc/sys/dev/raid/speed_limit_min # default: 1000 KB/s
cat /proc/sys/dev/raid/speed_limit_max # default: 200000 KB/s
# Speed up rebuild (at the cost of I/O performance for applications)
echo 500000 | sudo tee /proc/sys/dev/raid/speed_limit_min
# After rebuild completes, verify with a check
sudo mdadm --action=check /dev/md0
Disk Replacement in RAID¶
A disk is showing SMART errors. Replace it proactively in a RAID 1 array.
# Step 1: Confirm which disk is failing
sudo smartctl -H /dev/sdb
# SMART overall-health self-assessment test result: FAILED
# Confirm it is part of the array
sudo mdadm --detail /dev/md0
# Step 2: Mark the disk as failed
sudo mdadm --manage /dev/md0 --fail /dev/sdb1
# Step 3: Remove from array
sudo mdadm --manage /dev/md0 --remove /dev/sdb1
# Step 4: Physically replace the disk (or detach the virtual disk)
# The new disk will appear as /dev/sdb (or similar)
# Step 5: Partition the new disk to match the original
sudo sfdisk -d /dev/sda | sudo sfdisk /dev/sdb
# Step 6: Add to array — rebuild starts automatically
sudo mdadm --manage /dev/md0 --add /dev/sdb1
# Step 7: Monitor rebuild
watch cat /proc/mdstat
# Step 8: Update mdadm config
sudo mdadm --detail --scan | sudo tee /etc/mdadm/mdadm.conf
sudo update-initramfs -u
NFS Mount Troubleshooting¶
NFS mount is failing or hanging. Systematic diagnosis.
# Step 1: Check if NFS services are running on the server
rpcinfo -p 192.168.1.10
# If this times out, firewall or rpcbind is down on the server
# Step 2: Check what the server is exporting
showmount -e 192.168.1.10
# If this fails: exportfs not applied, or NFS server not running
# Step 3: Check NFS statistics and errors
nfsstat -c # client-side stats
nfsstat -s # server-side stats (run on server)
# Step 4: Test basic connectivity
# NFSv4 uses only port 2049/tcp
nc -zv 192.168.1.10 2049
# NFSv3 also needs rpcbind (111) and dynamic mountd port
nc -zv 192.168.1.10 111
rpcinfo -p 192.168.1.10 | grep mountd
# Step 5: Try mounting with verbose debug
sudo mount -t nfs -o v4 192.168.1.10:/exports/shared /mnt/nfs -v
# Step 6: If mount hangs (stale NFS handle)
# Check for stale mounts
mount | grep nfs
# Force unmount a stale NFS mount
sudo umount -f /mnt/nfs
# If that fails:
sudo umount -l /mnt/nfs # lazy unmount
# Step 7: Common fixes
# Server side: re-export
sudo exportfs -ra
# Client side: clear stale mount and remount
sudo umount -l /mnt/nfs
sudo mount -a
# Step 8: Check for permission issues
# On server /etc/exports:
# /exports/shared 10.0.0.0/24(rw,sync,no_subtree_check)
# Make sure no space between client spec and options:
# WRONG: 10.0.0.0/24 (rw) <-- the space means: 10.0.0.0/24 gets default (ro),
# and everyone gets (rw)
# RIGHT: 10.0.0.0/24(rw)
Diagnosing I/O Bottleneck¶
Application is slow. You suspect disk I/O. Prove it and find the cause.
# Step 1: Check overall I/O wait
top
# Look at %wa (I/O wait) in the CPU line
# wa > 20% sustained = I/O bottleneck
# Or from vmstat
vmstat 2 5
# Look at the 'wa' column — percentage of CPU time waiting for I/O
# Step 2: Identify which devices are saturated
iostat -x 2 5
# Key columns to watch:
# %util — >80% sustained means the device is saturated
# await — average total I/O time in ms (queue + service)
# r_await — average read time
# w_await — average write time
# avgqu-sz — average queue length (>1 means I/O is queuing)
# Example output showing saturated disk:
# Device r/s w/s rMB/s wMB/s await %util
# sda 0.50 850 0.01 45.2 12.3 99.8 <--- saturated
# Step 3: Find which processes are doing the I/O
sudo iotop -o -b -n 3
# Output:
# Total DISK READ: 0.00 B/s | Total DISK WRITE: 45.23 M/s
# TID PRIO USER DISK READ DISK WRITE COMMAND
# 12345 be/4 postgres 0.00 B/s 40.12 M/s postgres: wal writer
# 6789 be/4 app 0.00 B/s 5.11 M/s java -jar app.jar
# Step 4: Deep dive with blktrace (for advanced analysis)
# Trace I/O on a specific device for 30 seconds
sudo blktrace -d /dev/sda -o - | blkparse -i - > /tmp/trace.txt &
sleep 30
kill %1
# Analyze the trace
blkparse -i /tmp/trace.txt | head -50
# Step 5: Check filesystem for errors that might cause slowness
# ext4
sudo dumpe2fs /dev/sda1 2>/dev/null | grep -i "mount count"
sudo dumpe2fs /dev/sda1 2>/dev/null | grep -i "error"
# Step 6: Check for I/O scheduler issues
cat /sys/block/sda/queue/scheduler
# [mq-deadline] none <-- mq-deadline is good for HDDs
# [none] mq-deadline <-- none is good for NVMe/SSDs
# Change scheduler (temporary)
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
SMART Health Checking¶
Proactive disk health monitoring.
# Quick health check across all disks
for disk in /dev/sd?; do
echo "=== $disk ==="
sudo smartctl -H "$disk" 2>/dev/null | grep -E "result|SMART"
done
# Full SMART report for a specific disk
sudo smartctl -a /dev/sda
# Check for pre-failure indicators
sudo smartctl -A /dev/sda | awk '
/Reallocated_Sector_Ct/ && $10 > 0 { print "WARNING: " $0 }
/Current_Pending_Sector/ && $10 > 0 { print "WARNING: " $0 }
/Offline_Uncorrectable/ && $10 > 0 { print "WARNING: " $0 }
/Reported_Uncorrect/ && $10 > 0 { print "WARNING: " $0 }
'
# Run a short self-test (takes ~2 minutes)
sudo smartctl -t short /dev/sda
# Check test result after it completes
sudo smartctl -l selftest /dev/sda
# For NVMe drives
sudo smartctl -a /dev/nvme0n1
# Key NVMe health fields:
# Percentage Used: 3% (replacement at 100%)
# Available Spare: 100% (bad when low)
# Media and Data Integrity Errors: 0 (any non-zero is bad)
# Set up smartd for continuous monitoring
# Edit /etc/smartd.conf:
# /dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com
# -a = monitor all attributes
# -S on = enable auto-save
# -s = schedule: short test daily at 2am, long test Saturdays at 3am
# -m = email on failure
sudo systemctl enable --now smartd
Deep Dive: SMART Attributes That Actually Predict Failure¶
Not all SMART attributes are created equal. The Backblaze hard drive studies (100K+ drives, multi-year data) and the Google disk failure research identified which attributes actually correlate with imminent drive failure and which are noise. Knowing the difference means you replace drives before data loss instead of chasing false alarms.
High-Correlation Attributes — The Ones That Matter¶
| Attribute | ID | Why It Matters |
|---|---|---|
| Reallocated Sector Count | 5 | Sectors remapped to spare area due to read errors. The #1 predictor of failure. Any non-zero and rising value warrants investigation. A high count means the drive is running out of spare sectors. |
| Current Pending Sector Count | 197 | Sectors the drive cannot read and is waiting to remap on next write. Active problem — these are sectors with data that may already be unrecoverable. |
| Uncorrectable Sector Count (Offline_Uncorrectable) | 198 | Sectors that failed remapping entirely. Data loss is occurring. The drive could not recover these sectors even during offline scan. |
| Reallocated Event Count | 196 | Number of times a remap operation was performed. The trend matters — a rising count means the drive is actively degrading, even if the total sector count looks small. |
| UDMA CRC Error Count | 199 | CRC errors on the data cable interface. May indicate a bad SATA cable, backplane issue, or controller port problem. Fix the cable first — if errors persist after cable swap, the drive is suspect. |
| Spin Retry Count | 10 | Motor failed to spin up to speed on first attempt and had to retry. Mechanical failure signal (HDD only). Any non-zero value on a drive that previously showed zero is a red flag. |
| Reported Uncorrectable Errors | 187 | ECC (Error-Correcting Code) failures that the drive's internal error correction could not fix. These are read errors that made it past the drive's own defenses. |
Low-Value Attributes — Commonly Monitored But Poor Predictors¶
These attributes are frequently included in monitoring dashboards but the failure studies show they are unreliable predictors:
- Temperature (194): Unless extreme (>60°C sustained for HDDs), temperature correlates weakly with failure. Drives fail at comfortable temperatures all the time.
- Power-On Hours (9): Old drives are not more likely to fail in the Backblaze data. Infant mortality (first 18 months) and manufacturing defects dominate. A drive with 50K hours is not necessarily closer to failure than one with 10K hours.
- Start/Stop Count (4): Number of power cycles. Irrelevant for always-on server drives. Marginally relevant for drives that are frequently powered down.
The Backblaze Rule of Thumb¶
Any non-zero value in attributes 5 (Reallocated Sector Count), 187 (Reported Uncorrectable), 197 (Current Pending Sector), or 198 (Offline Uncorrectable) warrants investigation and likely proactive replacement.
This simple rule catches the vast majority of predictable failures. Most healthy drives show zeros in all four for their entire lifespan.
Practical: Reading smartctl -a Output¶
The output has several sections. Here is what to focus on:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
# ^^^ FAILED here = replace immediately
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always 0
# ^^^ RAW_VALUE is the actual count. 0 = healthy.
# VALUE is normalized (100=best). When VALUE drops to THRESH, drive flags FAILED.
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 0
What to look at: 1. Overall health at the top. FAILED = stop reading, start replacing. 2. RAW_VALUE column for IDs 5, 187, 197, 198. Any non-zero = investigate. 3. TYPE column: "Pre-fail" attributes trigger the FAILED health status when VALUE hits THRESH. "Old_age" attributes are informational — the drive won't self-report failure based on these, so you must monitor them yourself.
Practical: Setting Up smartd for Automated Monitoring¶
# /etc/smartd.conf — one line per drive
# Monitor all SMART attributes, run short test daily at 2am,
# long test on Saturdays at 3am, email on any failure
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd_warning.sh
# For all drives (wildcard):
DEVICESCAN -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com
# Enable and start
sudo systemctl enable --now smartd
The -M exec option lets you run a custom script on failure — useful for sending alerts to Slack, PagerDuty, or a monitoring system instead of email.
NVMe: Different Attributes, Same Principles¶
NVMe drives use a completely different health reporting structure. There are no SMART attribute IDs — instead, NVMe exposes a health log page.
Key NVMe health fields:
| Field | What It Means | Action Threshold |
|---|---|---|
| Critical Warning | Bitmask of active warnings (spare low, temperature, reliability degraded, read-only, volatile backup failed) | Any non-zero bit = investigate immediately |
| Media and Data Integrity Errors | Unrecovered data integrity errors | Any non-zero = data corruption risk, plan replacement |
| Available Spare | Remaining spare capacity as percentage | Below Available Spare Threshold = replacement due |
| Percentage Used | Estimated drive life consumed based on actual writes vs rated endurance | Approaching 100% = endurance limit, plan replacement |
| Error Information Log Entries | Count of error log entries | Rising count without corresponding host-side I/O errors may indicate firmware issues |
NVMe drives are more straightforward to monitor because the fields are standardized and self-explanatory, unlike SATA SMART where attribute names and meanings vary by vendor.
Recovering from Read-Only Filesystem¶
The filesystem went read-only — usually due to I/O errors or filesystem corruption.
# Step 1: Confirm the filesystem is read-only
mount | grep "ro,"
touch /data/testfile
# touch: cannot touch '/data/testfile': Read-only file system
# Step 2: Check dmesg for the reason
dmesg | grep -i "error\|readonly\|read-only\|ext4\|xfs" | tail -20
# Common causes:
# EXT4-fs error: remounting filesystem read-only
# Buffer I/O error on dev sdb1, logical block 12345
# sd 2:0:0:0: [sdb] Sense Key : Medium Error
# Step 3: If caused by disk errors, check SMART
sudo smartctl -H /dev/sdb
sudo smartctl -A /dev/sdb | grep -i reallocated
# Step 4: Try remounting read-write
sudo mount -o remount,rw /data
# If that fails, the filesystem needs repair
# Step 5: Unmount and repair
sudo umount /data
# If umount fails (device busy):
sudo fuser -mv /data
# Kill processes or:
sudo umount -l /data
# For ext4:
sudo e2fsck -f /dev/sdb1
# Answer 'y' to fix errors, or use -y for auto-yes
# For XFS:
sudo xfs_repair /dev/sdb1
# Step 6: Remount
sudo mount /data
# Step 7: If the underlying disk is failing, replace it
# (see "Disk Replacement in RAID" or plan migration to new disk)
Partition Table Repair¶
The partition table is damaged but data may still be intact.
# Step 1: Do not write anything to the disk. Assess the damage.
sudo fdisk -l /dev/sdb
# If you see "doesn't contain a valid partition table" — table is gone
# Step 2: Use testdisk to scan and recover partitions
sudo testdisk /dev/sdb
# Select disk → Analyse → Quick Search
# testdisk will find lost partitions by scanning for filesystem signatures
# Review the detected partitions and write the recovered table
# Step 3: For GPT disks, the backup table may save you
sudo gdisk /dev/sdb
# If the primary GPT is damaged, gdisk will offer to use the backup GPT
# Commands: v (verify), p (print), w (write recovered table)
# Step 4: After recovery, immediately back up the partition table
# MBR:
sudo sfdisk -d /dev/sdb > /root/sdb-partition-backup.txt
# GPT:
sudo sgdisk --backup=/root/sdb-gpt-backup.bin /dev/sdb
# To restore a backed-up GPT table:
sudo sgdisk --load-backup=/root/sdb-gpt-backup.bin /dev/sdb
Cloud Disk Operations¶
Common operations on cloud-attached volumes (AWS EBS, GCP PD, Azure Disk).
# AWS: Extend an EBS volume (API side)
aws ec2 modify-volume --volume-id vol-0123456789 --size 500
# Wait for modification to complete
aws ec2 describe-volumes-modifications --volume-id vol-0123456789
# Then on the instance — grow the partition and filesystem
# If partition exists:
sudo growpart /dev/xvdf 1
# Resize filesystem
sudo resize2fs /dev/xvdf1 # ext4
sudo xfs_growfs /mount/point # XFS
# GCP: Resize a persistent disk
gcloud compute disks resize my-disk --size=500GB --zone=us-central1-a
# Then on the instance:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
# Verify
df -h
Disk Performance Benchmarking¶
Quick tests to establish baseline disk performance.
# Sequential write test (1GB)
dd if=/dev/zero of=/data/testfile bs=1M count=1024 conv=fdatasync
# 1073741824 bytes (1.1 GB) copied, 2.45 s, 438 MB/s
# Sequential read test (drop caches first)
echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/data/testfile of=/dev/null bs=1M
# 1073741824 bytes (1.1 GB) copied, 0.31 s, 3.5 GB/s
# Random I/O test with fio (install: apt install fio)
# Random read IOPS
fio --name=randread --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=1G --rw=randread --filename=/data/testfile
# Random write IOPS
fio --name=randwrite --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=1G --rw=randwrite --filename=/data/testfile
# Mixed 70/30 read-write (simulates database workload)
fio --name=mixed --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=1G --rw=randrw --rwmixread=70 \
--filename=/data/testfile
# Clean up
rm /data/testfile
Hardware RAID Operations (MegaCLI/storcli)¶
storcli Basics (Broadcom/LSI/Avago MegaRAID)¶
storcli /c0 show # controller summary
storcli /c0/vall show # all virtual drives
storcli /c0/eALL/sALL show # all physical drives
storcli /c0/eALL/sALL show rebuild # rebuild progress
MegaCLI (Older Syntax)¶
MegaCli -AdpAllInfo -aALL # adapter info
MegaCli -LDInfo -Lall -aALL # logical drives
MegaCli -PDList -aALL # physical drives
MegaCli -PDRbld -ShowProg -PhysDrv [E:S] -aALL # rebuild progress
HP Smart Array (ssacli)¶
Drive States¶
- Online: Healthy, in array
- Unconfigured Good: Healthy, available as spare
- Predictive Failure: SMART predicts failure -- replace proactively
- Failed: Array is degraded
- Hot Spare: Auto-replacement for failed drive
RAID Degraded Response Workflow¶
1. Alert fires: array degraded
2. Identify which disk failed:
- storcli /c0/eALL/sALL show
- cat /proc/mdstat
- dmesg | grep -i "error\|fault"
3. Verify hot spare status — did auto-rebuild start?
4. If no auto-rebuild: identify replacement disk
5. Verify correct slot (use LED identify):
storcli /c0/e32/s3 start locate
6. Hot-swap the disk
7. Monitor rebuild:
watch cat /proc/mdstat
storcli /c0/eALL/sALL show rebuild
8. Do NOT reboot during rebuild unless critical