Linux Storage: LVM, Filesystems, and Beyond
- lesson
- lvm
- ext4
- xfs
- zfs
- btrfs
- nfs
- iscsi
- smart
- i/o-performance
- kubernetes-storage
- device-mapper ---# Linux Storage — LVM, Filesystems, and Beyond
Topics: LVM, ext4, XFS, ZFS, Btrfs, NFS, iSCSI, SMART, I/O performance, Kubernetes storage, device mapper Level: L1-L2 (Foundations to Operations) Time: 90-120 minutes Prerequisites: None (everything explained from scratch)
The Mission¶
It is Thursday afternoon. Monitoring fires a warning: /data on db-prod-03 is at 95%
utilization. This is the PostgreSQL data volume — a 500GB logical volume on LVM, formatted
XFS, backed by two physical disks in a volume group. The database writes roughly 2GB/day.
You have about two days before writes start failing.
Your job: extend the volume without any downtime. Along the way, you will trace the entire Linux storage stack from block devices to Kubernetes PVCs, because this one operation touches all of it.
Part 1: The Storage Stack — What Sits Under Your Data¶
Before touching anything, understand what you are looking at. Run this on any Linux box and you will see the layers:
lsblk
# NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
# sda 8:0 0 500G 0 disk
# |-sda1 8:1 0 1G 0 part /boot
# |-sda2 8:2 0 499G 0 part
# |-vg_data-lv_pg 253:0 0 500G 0 lvm /data
# sdb 8:16 0 500G 0 disk
# |-sdb1 8:17 0 500G 0 part
# |-vg_data-lv_pg 253:0 0 500G 0 lvm /data
That output tells a story. Read it bottom-up:
Application (PostgreSQL)
writes to /data/pgdata/
|
XFS filesystem
translates file writes to block I/O
|
Logical Volume (lv_pg)
spans across two PVs via device mapper
|
Volume Group (vg_data)
pools two physical volumes
|
Physical Volumes (sda2, sdb1)
LVM metadata on raw partitions
|
Block Devices (sda, sdb)
SCSI disks exposed by the kernel
|
Physical Disks
SATA/SAS drives in slots 0 and 1
Every layer adds capability. LVM lets you resize without unmounting. The filesystem translates byte offsets to block addresses. The device mapper stitches multiple disks into one logical device. Understanding where you are in this stack is the difference between a 30-second fix and a 3-hour outage.
Name Origin: The
sdin/dev/sdastands for SCSI Disk. The letter suffix (a,b,c) is assigned in discovery order. NVMe drives use/dev/nvme0n1— controller 0, namespace 1. The namespace concept exists because NVMe supports multiple virtual drives per controller, though most consumer/server SSDs expose just one.
Part 2: LVM Deep Dive — The Mission Begins¶
Step 1: Assess the Situation¶
Before extending anything, see exactly what you have:
# Physical volumes — what raw disks feed LVM?
sudo pvs
# PV VG Fmt Attr PSize PFree
# /dev/sda2 vg_data lvm2 a-- 499.00g 0
# /dev/sdb1 vg_data lvm2 a-- 499.00g 0
# Volume group — what's the pool look like?
sudo vgs
# VG #PV #LV #SN Attr VSize VFree
# vg_data 2 1 0 wz--n- 998.00g 0
# Logical volumes — what slices exist?
sudo lvs
# LV VG Attr LSize Pool Origin Data% Meta%
# lv_pg vg_data -wi-ao---- 998.00g
Zero free space in the VG. The logical volume consumes everything. You cannot extend lv_pg
until you add more physical storage to vg_data.
Remember: The LVM stack is PVG — Physical volumes pour into Volume Groups, which you carve into Logical volumes. Think of it like a swimming pool: PVs are the water sources, the VG is the pool, and LVs are the lanes.
Step 2: Add a New Disk¶
A new 1TB disk has been attached (hot-added in the RAID controller, or a new EBS volume in the cloud). First, verify it appeared:
# Rescan the SCSI bus (for hot-added disks on VMs or bare metal)
echo "- - -" | sudo tee /sys/class/scsi_host/host*/scan
# Verify
lsblk
# ...
# sdc 8:32 0 1000G 0 disk <-- new disk, no partitions
Confirm you have the right disk. This matters more than you think:
# Cross-reference size, model, serial, and mount status
lsblk -o NAME,SIZE,MODEL,SERIAL,MOUNTPOINT
# sdc 1000G SAMSUNG_MZ7LH1T0 S4EWNX0T123456 (no mountpoint)
Gotcha: Device names shift. The disk that was
/dev/sdcyesterday might be/dev/sddtoday if a disk was added or removed. Always verify by serial number or/dev/disk/by-id/before running destructive operations. Partitioning the wrong disk is one of the most common ways to destroy production data.
Step 3: Initialize and Extend¶
# Create a physical volume on the new disk (no partition needed for LVM)
sudo pvcreate /dev/sdc
# Add it to the existing volume group
sudo vgextend vg_data /dev/sdc
# Verify — the VG now has free space
sudo vgs
# VG #PV #LV #SN Attr VSize VFree
# vg_data 3 1 0 wz--n- 1998.0g 1000.0g
Step 4: Extend the Logical Volume¶
# Add 500GB to the logical volume
sudo lvextend -L +500G /dev/vg_data/lv_pg
# Or use all free space
sudo lvextend -l +100%FREE /dev/vg_data/lv_pg
Step 5: Grow the Filesystem¶
This is where people get tripped up. The LV is bigger, but the filesystem does not know yet.
df -h still shows the old size.
# XFS — grow by mount point (online, no unmount)
sudo xfs_growfs /data
# Verify
df -h /data
# Filesystem Size Used Avail Use% Mounted on
# /dev/mapper/vg_data-lv_pg 1.5T 475G 1.1T 31% /data
Done. No downtime. PostgreSQL never noticed.
Under the Hood:
xfs_growfstells the XFS filesystem to re-read the device size and extend its allocation group structure to cover the new space. The filesystem is online the entire time — writes continue during the grow. This works because XFS journals metadata changes and the grow operation is itself journaled.
The one-liner shortcut for when you are in a hurry:
# Extend LV and resize filesystem in a single command
sudo lvextend -L +500G --resizefs /dev/vg_data/lv_pg
The --resizefs flag (or -r) detects the filesystem type and calls the right resize tool
automatically. This is the command you will use 90% of the time in production.
Flashcard Check: LVM Basics¶
| Question | Answer |
|---|---|
| What are the three layers of LVM? | PV (Physical Volume) -> VG (Volume Group) -> LV (Logical Volume) |
| How do you add a disk to an existing VG? | pvcreate /dev/sdX then vgextend vg_name /dev/sdX |
| What command extends an LV and resizes the filesystem in one step? | lvextend -L +SIZE --resizefs /dev/vg/lv |
What does pvs && vgs && lvs show you? |
A quick snapshot of all PVs, VGs, and LVs on the system |
Why run lsblk -o NAME,SIZE,SERIAL before pvcreate? |
To verify you are operating on the correct disk |
Part 3: Filesystems — Choosing Your Weapon¶
You extended the XFS volume. But why XFS? What if it had been ext4, or ZFS, or Btrfs? Each filesystem is a different set of tradeoffs. Here is when to reach for each one.
The Comparison Table¶
| Feature | ext4 | XFS | ZFS | Btrfs |
|---|---|---|---|---|
| Max volume | 1 EiB | 8 EiB | 256 ZiB | 16 EiB |
| Online grow | Yes | Yes | Yes | Yes |
| Online shrink | No (offline only) | No. Ever. | No | Yes |
| Checksums | Metadata only | Metadata only | Data + metadata | Data + metadata |
| Built-in snapshots | No (use LVM) | No (use LVM) | Yes | Yes |
| Compression | No | No | Yes (lz4, zstd) | Yes (zstd, lzo) |
| Best for | General purpose | Large files, RHEL | Data integrity | Flexible storage |
ext4: The Reliable Default¶
ext4 has been the Linux default since 2008. It evolved from ext3 (2001), ext2 (1993), and the original ext filesystem (1992, by Remy Card). When in doubt, use ext4.
Resize operations:
# Grow (online, mounted)
sudo resize2fs /dev/vg_data/lv_app
# Shrink (offline only — must unmount first!)
sudo umount /app
sudo e2fsck -f /dev/vg_data/lv_app # required before shrink
sudo resize2fs /dev/vg_data/lv_app 50G
sudo lvreduce -L 50G /dev/vg_data/lv_app
sudo mount /app
ext4 reserves 5% of space for root by default. On a 1TB data volume, that is 50GB of wasted space. Reduce it:
# Check current reservation
sudo tune2fs -l /dev/sdb1 | grep "Reserved block count"
# Set to 1% on data volumes (not root filesystem)
sudo tune2fs -m 1 /dev/sdb1
XFS: The Enterprise Workhorse¶
XFS was created by Silicon Graphics (SGI) in 1993 for IRIX, their Unix workstation OS. It was ported to Linux in 2001 and became the default in RHEL 7 (2014). Red Hat chose it for its superior performance with large files and parallel I/O.
Gotcha: XFS cannot shrink. Not online, not offline, not ever. There is no
xfs_shrinkfscommand. It does not exist. If you allocated too much space to an XFS volume and need it back, your only option is: create a new smaller volume, copy the data withrsync -avHAX, swap the mounts. This is why you should start small and grow as needed.
Key XFS tools:
# Filesystem info
sudo xfs_info /data
# meta-data=/dev/mapper/vg_data-lv_pg isize=512 agcount=32, agsize=8192000 blks
# = sectsz=512 attr=2, projid32bit=1
# data = bsize=4096 blocks=262144000, imaxpct=5
# naming =version 2 bsize=4096 ascii-ci=0, ftype=1
# log =internal log bsize=4096 blocks=128000, version=2
# realtime =none extsz=4096 blocks=0, rtextents=0
# Repair (must unmount first)
sudo umount /data
sudo xfs_repair /dev/vg_data/lv_pg
# Defragment (online)
sudo xfs_fsr /data
# Grow (online)
sudo xfs_growfs /data
Trivia: The
agin XFS output stands for "allocation group." XFS splits the filesystem into allocation groups that can be managed independently, which is why it handles parallel I/O so well. Multiple threads writing to different parts of the filesystem do not contend with each other. This design was revolutionary in 1993 and remains one of XFS's key advantages for database and media workloads.
ZFS: The Data Integrity Champion¶
ZFS was created by Sun Microsystems (Jeff Bonwick and team) in 2005. The name originally stood for "Zettabyte File System." It is not in the mainline Linux kernel due to license incompatibility (CDDL vs GPL), so you install it via OpenZFS.
ZFS combines the volume manager and filesystem into one layer. There is no separate LVM — the pool is the volume manager.
# Create a mirrored pool (RAID1 equivalent)
sudo zpool create datapool mirror /dev/sdb /dev/sdc
# Create a RAIDZ pool (RAID5 equivalent, single parity)
sudo zpool create datapool raidz /dev/sdb /dev/sdc /dev/sdd
# Create a dataset (like a subvolume — no fixed size, shares pool space)
sudo zfs create datapool/postgres
sudo zfs create datapool/backups
# Enable compression (lz4 is fast, zstd gives better ratio)
sudo zfs set compression=lz4 datapool/postgres
# Snapshot — instant, zero-cost at creation
sudo zfs snapshot datapool/postgres@before-migration
# Rollback — instant return to snapshot state
sudo zfs rollback datapool/postgres@before-migration
# Send/receive — incremental replication to another host
sudo zfs send datapool/postgres@snap1 | ssh backup-host zfs receive backuppool/postgres
# Pool health
sudo zpool status
# pool: datapool
# state: ONLINE
# scan: scrub repaired 0B in 01:23:45 with 0 errors
# config:
# NAME STATE READ WRITE CKSUM
# datapool ONLINE 0 0 0
# mirror-0 ONLINE 0 0 0
# sdb ONLINE 0 0 0
# sdc ONLINE 0 0 0
Under the Hood: ZFS checksums every block (data and metadata) using a Merkle tree. On every read, it verifies the checksum. If the checksum fails and the pool has redundancy (mirror or raidz), ZFS automatically repairs the corrupted block from a good copy. This catches bit rot, firmware bugs, and phantom writes — corruption modes that traditional RAID cannot detect. CERN's 2007 study found silent data corruption at roughly 1 bit flip per 10TB per year. Jeff Bonwick called bit rot "the silent killer of data."
ZFS vs LVM — when to use which:
| Scenario | Use ZFS | Use LVM |
|---|---|---|
| Data integrity is paramount | Yes — end-to-end checksums | No checksums on data |
| You need built-in compression | Yes — lz4 or zstd | Need filesystem-level (Btrfs) |
| RHEL/CentOS environment | Harder — not in mainline kernel | Native, well-supported |
| Kubernetes storage (Portworx, etc.) | Possible but uncommon | Common backing store |
| Simple volume extension | Pool-based, different workflow | lvextend --resizefs |
| Snapshots for backups | Instant, incremental send/receive | COW snapshots, but degrade I/O |
| You need to shrink a volume | Cannot shrink pools | ext4 can shrink offline |
Gotcha: ZFS uses the ARC (Adaptive Replacement Cache) which lives in RAM. On a system with 128GB RAM, ZFS might consume 80GB+ for ARC. This is by design and the memory is reclaimable, but it can surprise monitoring tools and trigger false OOM alerts. Tune with
zfs_arc_maxin/etc/modprobe.d/zfs.conf.
Btrfs: The Middle Ground¶
Btrfs (pronounced "butter-FS" or "better-FS") was started by Oracle in 2007. Facebook was its largest production user before switching to XFS for database workloads. It is the default on openSUSE and Fedora Workstation.
Key advantage: it is the only mainstream Linux filesystem that can shrink online. Key risk: its RAID 5/6 implementation has a documented write hole and is not recommended for production.
# Create a Btrfs filesystem
sudo mkfs.btrfs /dev/sdb1
# Mount with compression
sudo mount -o compress=zstd /dev/sdb1 /data
# Take a snapshot
sudo btrfs subvolume snapshot /data /data/.snapshots/2026-03-23
# Check integrity (online scrub)
sudo btrfs scrub start /data
sudo btrfs scrub status /data
# Filesystem usage (more accurate than df for CoW filesystems)
sudo btrfs filesystem usage /data
Flashcard Check: Filesystem Showdown¶
| Question | Answer |
|---|---|
| Can you shrink an XFS filesystem? | No. Not online, not offline. Create a new volume and migrate. |
| Which filesystem checksums both data and metadata? | ZFS and Btrfs. ext4 and XFS only checksum metadata. |
| What was ext4's predecessor chain? | ext (1992) -> ext2 (1993) -> ext3 (2001) -> ext4 (2008) |
| Why did RHEL choose XFS as default? | Superior large-file performance and parallel I/O from its allocation group design |
| What is ZFS ARC? | Adaptive Replacement Cache — ZFS's in-RAM read cache. Can consume most of available memory. |
Part 4: NFS and iSCSI — When Storage Lives on the Network¶
Your PostgreSQL server has local disks. But the application servers mount their shared config and uploads directory from an NFS server. And the SAN-backed volumes come over iSCSI. Both have their own failure modes.
NFS: Shared Files Over the Network¶
NFS lets multiple clients mount the same directory simultaneously. It operates at the file level — clients see files and directories, not raw blocks.
Server setup (quick version):
# Install NFS server
sudo dnf install nfs-utils # RHEL/Rocky
sudo apt install nfs-kernel-server # Debian/Ubuntu
# Export a directory
echo '/exports/shared 10.0.1.0/24(rw,sync,no_subtree_check)' | sudo tee -a /etc/exports
# Apply and start
sudo exportfs -ra
sudo systemctl enable --now nfs-server
Gotcha: Watch the space. In
/etc/exports,10.0.1.0/24(rw)means the 10.0.1.0/24 network gets read-write access. But10.0.1.0/24 (rw)(with a space before the parenthesis) means the 10.0.1.0/24 network gets default (read-only) access, and everyone else gets read-write. One space. Completely different security posture. This has bitten countless admins.
Client mount with sane options:
# Manual mount
sudo mount -t nfs -o hard,timeo=300,retrans=3 nfs-srv:/exports/shared /mnt/shared
# fstab entry — note _netdev and nofail
nfs-srv:/exports/shared /mnt/shared nfs hard,timeo=300,retrans=3,_netdev,nofail 0 0
NFSv3 vs NFSv4:
| Feature | NFSv3 | NFSv4 |
|---|---|---|
| Protocol | Stateless, UDP or TCP | Stateful, TCP only |
| Ports | 2049 + rpcbind (111) + dynamic mountd | 2049 only |
| Security | IP-based, basic AUTH_SYS | Kerberos (RPCSEC_GSS) |
| Firewall | Painful (multiple ports) | Simple (one port) |
| Performance | Proven | Better with compound operations, pNFS |
NFS performance tuning:
# Increase read/write block size (default 1MB, max varies)
mount -o rsize=1048576,wsize=1048576 nfs-srv:/share /mnt/share
# Check NFS statistics for retransmissions and timeouts
nfsstat -c
# Look for high retrans values — indicates network or server issues
# Use NFSv4.1 for parallel NFS (pNFS) on high-throughput workloads
mount -t nfs -o vers=4.1 nfs-srv:/share /mnt/share
iSCSI: Block Devices Over the Network¶
iSCSI exposes raw block devices over TCP/IP. The client (initiator) sees a local disk. Unlike NFS, only one client at a time can safely write to an iSCSI LUN (unless using a clustered filesystem).
Key vocabulary:
| Term | Meaning |
|---|---|
| Target | The server exposing storage (SAN array, Linux targetcli) |
| Initiator | The client consuming storage (your server) |
| LUN | Logical Unit Number — a specific volume on the target |
| IQN | iSCSI Qualified Name — unique identifier (e.g., iqn.2024.com.example:storage) |
| Portal | IP:port pair where the target listens (default port 3260) |
Initiator workflow:
# Discover targets on a portal
sudo iscsiadm -m discovery -t sendtargets -p 10.0.1.100
# Login to a target
sudo iscsiadm -m node -T iqn.2024.com.example:storage -p 10.0.1.100 --login
# The LUN appears as a new block device
lsblk
# sdd 0 50G 0 disk <-- this is the iSCSI LUN
# Partition, format, mount as normal
sudo mkfs.xfs /dev/sdd
sudo mount /dev/sdd /mnt/iscsi
# Make login persistent across reboots
sudo iscsiadm -m node -T iqn.2024.com.example:storage -p 10.0.1.100 \
--op update -n node.startup -v automatic
Multipath: Why It Matters¶
War Story: A storage admin configured an iSCSI target with two network paths for redundancy — 10.0.1.100 and 10.0.2.100. The initiator discovered both paths and the OS presented the same LUN as two separate block devices:
/dev/sddand/dev/sde. Without multipath, the admin did not realize these were the same physical disk. They formatted/dev/sddwith XFS and mounted it. A junior admin later saw the "unused"/dev/sdeand formatted that with ext4. The conflicting writes corrupted the SAN LUN. Recovery took 14 hours and the database restore lost 6 hours of transactions.The fix was simple: install and configure
multipath-toolsbefore connecting iSCSI paths. Multipath combines the two device paths into a single/dev/mapper/mpath0device and handles failover automatically.
# Install multipath tools
sudo dnf install device-mapper-multipath
# Generate default config
sudo mpathconf --enable
# View multipath topology
sudo multipath -ll
# mpath0 (360000000000000001) dm-3 ATA,VBOX HARDDISK
# size=50G features='0' hwhandler='0' wp=rw
# |-+- policy='service-time 0' prio=1 status=active
# | `- 3:0:0:1 sdd 8:48 active ready running
# `-+- policy='service-time 0' prio=1 status=enabled
# `- 4:0:0:1 sde 8:64 active ready running
# Use /dev/mapper/mpath0 — never the raw sd devices
sudo mkfs.xfs /dev/mapper/mpath0
Mental Model: NFS = file-level, many readers/writers, like a shared Google Doc. iSCSI = block-level, single writer, like plugging in a USB drive over the network. Choose NFS when multiple servers need the same files. Choose iSCSI when you need raw block performance (databases, VMs) and single-host access.
Part 5: SMART Monitoring — Predicting Disk Failure¶
The volume extension is done. But what about the disks themselves? Disks fail. The question is whether you find out before or after data is lost.
Reading smartctl Output¶
The output has dozens of attributes. Most are noise. Focus on four:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 0
Remember: The Backblaze rule of thumb (from 250,000+ drives): Any non-zero value in attributes 5, 187, 197, or 198 warrants investigation and likely proactive replacement. Mnemonic: 5, 187, 197, 198 — the four horsemen of disk failure. Most healthy drives show zeros in all four for their entire lifespan.
What each one means:
| Attribute | ID | What It Means | Non-Zero Action |
|---|---|---|---|
| Reallocated Sector Count | 5 | Bad sectors remapped to spare area | Rising = disk degrading. Plan replacement. |
| Reported Uncorrectable | 187 | ECC failures the drive could not fix | Read errors getting past internal defenses |
| Current Pending Sector | 197 | Sectors that failed reads, awaiting remap | Active problem — data at risk |
| Offline Uncorrectable | 198 | Sectors that failed remap entirely | Data loss occurring |
What about temperature and power-on hours? The Google disk failure study (2007, 100K+ drives) and Backblaze data both confirmed: temperature (unless extreme >60C), power-on hours, and start/stop count are poor predictors. Old drives are not more likely to fail. Do not waste alert capacity on these.
NVMe Health Monitoring¶
NVMe drives use different reporting. No attribute IDs — instead, a standardized health log:
| Field | Meaning | Action |
|---|---|---|
| Critical Warning | Bitmask of active warnings | Any non-zero bit = investigate now |
| Available Spare | Remaining spare capacity (%) | Below threshold = replacement due |
| Percentage Used | Endurance consumed | Approaching 100% = plan replacement |
| Media and Data Integrity Errors | Unrecovered data errors | Any non-zero = corruption risk |
Automated Monitoring with smartd¶
# /etc/smartd.conf — one line per drive (or DEVICESCAN for all)
DEVICESCAN -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com
# -a = monitor all attributes
# -s = short test daily at 2am, long test Saturdays at 3am
# -m = email on failure
sudo systemctl enable --now smartd
The Disk Replacement Workflow¶
SMART says a disk is failing. Here is the workflow:
1. Confirm: smartctl -H /dev/sdX (FAILED = replace immediately)
2. Verify it's in RAID/LVM (don't pull a standalone disk)
3. Identify the physical slot:
- storcli /c0/e32/s3 start locate (blink the LED)
- Or: lsblk -o NAME,SERIAL and match to the chassis label
4. Fail the disk in software:
- mdadm: mdadm /dev/md0 --fail /dev/sdX
- LVM: pvmove /dev/sdX (migrate data off first)
5. Hot-swap the physical disk
6. Partition the new disk to match: sfdisk -d /dev/sda | sfdisk /dev/sdX
7. Add to array/VG: mdadm /dev/md0 --add /dev/sdX
8. Monitor rebuild: watch cat /proc/mdstat
Part 6: I/O Performance — Proving It Is (or Isn't) the Disk¶
Your volume is extended and your disks are healthy. But users report the database is slow. Is it the disk? Prove it.
iostat: The First Tool You Reach For¶
iostat -x 2 5
# Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s await %util
# sda 12.0 850 0.05 45.2 0.00 120 12.3 99.8
# sdb 0.5 2.1 0.01 0.1 0.00 0 0.5 1.2
Reading the output:
| Column | What It Means | Alarm Threshold |
|---|---|---|
await |
Average I/O latency in ms (queue + service) | >10ms for SSD, >20ms for HDD |
%util |
Percentage of time device is busy | >80% sustained = saturated |
r/s, w/s |
IOPS (reads and writes per second) | Compare to device spec |
avgqu-sz |
Average queue depth | >1 sustained means I/O is queuing |
In the output above, sda is at 99.8% utilization with 12.3ms await. It is saturated.
iotop: Which Process Is Hammering the Disk?¶
sudo iotop -o -b -n 3
# Total DISK READ: 0.00 B/s | Total DISK WRITE: 45.23 M/s
# TID PRIO USER DISK READ DISK WRITE COMMAND
# 12345 be/4 postgres 0.00 B/s 40.12 M/s postgres: wal writer
# 6789 be/4 app 0.00 B/s 5.11 M/s java -jar app.jar
PostgreSQL's WAL writer is responsible for 88% of the write load. Now you know where to look.
fio: Benchmarking for Real¶
dd is not a benchmark. Use fio to establish actual disk capabilities:
# Random read IOPS (simulates database reads)
fio --name=randread --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=1G --rw=randread --filename=/data/fiotest
# Random write IOPS (simulates database writes)
fio --name=randwrite --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=1G --rw=randwrite --filename=/data/fiotest
# Mixed 70/30 read-write (simulates OLTP workload)
fio --name=oltp --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=1G --rw=randrw --rwmixread=70 \
--filename=/data/fiotest
# Sequential write (simulates backup or bulk load)
fio --name=seqwrite --ioengine=libaio --direct=1 --bs=1M \
--iodepth=4 --size=4G --rw=write --filename=/data/fiotest
Interpreting fio output:
randread: (groupid=0, jobs=1): err= 0
read: IOPS=45.2k, BW=176MiB/s (185MB/s)
slat (usec): min=1, max=234, avg= 3.12
clat (usec): min=78, max=12456, avg=704.23
lat (usec): min=80, max=12458, avg=707.35
IOPS=45.2k— 45,200 random reads per second. Excellent for NVMe, impossible for HDD.clat avg=704.23usec — average completion latency of 0.7ms. Good for SSD.- Compare these numbers to your device spec. A modern NVMe should deliver 100K+ random IOPS. An HDD tops out at 100-200.
I/O Schedulers¶
The kernel I/O scheduler determines how requests are ordered before reaching the device:
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] none bfq
# Change scheduler (temporary, until reboot)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
| Scheduler | Best For | Why |
|---|---|---|
none |
NVMe SSDs | NVMe has its own internal scheduling; kernel scheduler adds latency |
mq-deadline |
SATA SSDs, HDDs | Prevents starvation, good latency guarantees |
bfq |
Desktop HDDs | Fair bandwidth distribution, good for interactive workloads |
Trivia: The old
cfq(Completely Fair Queuing) scheduler was the Linux default for over a decade. It was replaced bymq-deadlineandbfqin kernel 5.0 (2019) becausecfqwas designed for single-queue rotational disks and performed poorly on multi-queue NVMe devices.
Part 7: Kubernetes Storage — From LVM to PVCs¶
Everything so far has been on bare metal or VMs. But if you run Kubernetes, the storage abstraction adds another layer. Kubernetes does not manage storage directly — it provides a framework (PV, PVC, StorageClass) that delegates to storage backends.
The Kubernetes Storage Stack¶
Pod (container writes to /var/lib/postgresql/data)
|
Volume Mount (volumeMounts in pod spec)
|
PVC (PersistentVolumeClaim — "I need 100Gi of fast-ssd storage")
|
PV (PersistentVolume — "Here's a 100Gi volume backed by EBS/Portworx/NFS")
|
StorageClass (provisioner: ebs.csi.aws.com, type: gp3)
|
CSI Driver (creates actual volume in the backend)
|
Actual Storage (EBS volume, Portworx volume, NFS export, iSCSI LUN)
StorageClass: The PV Factory¶
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/portworx-volume
parameters:
repl: "3"
io_profile: "db"
priority_io: "high"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
WaitForFirstConsumer delays volume creation until a pod actually needs it. This ensures the
volume lands in the same availability zone or on the same node as the pod — critical for
zone-local storage like EBS or Portworx.
Portworx in Practice¶
Portworx is a software-defined storage layer that runs on your Kubernetes nodes and presents their local disks as a distributed storage pool.
# Cluster health
pxctl status
pxctl cluster list
# Volume operations
pxctl volume list
pxctl volume inspect <vol-id>
pxctl volume create pg-data --size 100 --repl 3 --io_profile db
# Alert review
pxctl alerts show
Why Portworx over raw local disks? Replication (your volume exists on 3 nodes), rack awareness (replicas spread across failure domains), and automatic reattach (if a node dies, the volume is accessible from another node that has a replica). For databases on Kubernetes, this is the layer that makes StatefulSets actually survivable.
MinIO: Object Storage for Kubernetes¶
MinIO provides S3-compatible object storage. Where Portworx gives you block storage (RWO volumes for databases), MinIO gives you object storage (HTTP API for blobs, backups, artifacts).
# Check cluster health
mc admin info myminio
# Create a bucket and upload
mc mb myminio/db-backups
mc cp /tmp/pg_dump_2026-03-23.sql.gz myminio/db-backups/
# List objects
mc ls myminio/db-backups/
Mental Model: Think of Kubernetes storage in three tiers: - Block (RWO): Portworx, EBS, local disks. One pod writes. For databases. - File (RWX): NFS, EFS, CephFS. Many pods read/write. For shared configs, uploads. - Object (S3 API): MinIO, S3. HTTP access. For backups, artifacts, large blobs.
Picking the wrong tier is a common mistake. Do not use a PVC for storing backup archives (use object storage). Do not use NFS for a high-IOPS database (use block storage).
Expanding a PVC¶
Just like extending an LV on bare metal, you can expand a PVC in Kubernetes — if the
StorageClass has allowVolumeExpansion: true:
# Expand the PVC to 200Gi
kubectl patch pvc data-postgres-0 -n production \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Check progress
kubectl describe pvc data-postgres-0 -n production
# Conditions:
# Type Status
# FileSystemResizePending True <-- CSI driver expanded the volume,
# waiting for pod to trigger fs resize
Most CSI drivers handle filesystem resize automatically when the pod restarts. Some do it online. Check your driver's documentation.
Gotcha: PVC expansion is one-way. You cannot shrink a PVC. Just like XFS cannot shrink, Kubernetes PVCs cannot shrink. The parallel is not a coincidence — many PVCs are backed by XFS volumes.
Part 8: LVM Thin Provisioning — The Double-Edged Sword¶
Standard LVM allocates all space upfront. Thin provisioning allocates on demand, letting you overcommit — promise more space than physically exists.
# Create a thin pool (200GB physical, can allocate more virtually)
sudo lvcreate -L 200G --thinpool thin_pool vg_data
# Create thin volumes (total 500GB from a 200GB pool)
sudo lvcreate -V 200G --thin -n app_data vg_data/thin_pool
sudo lvcreate -V 200G --thin -n db_data vg_data/thin_pool
sudo lvcreate -V 100G --thin -n logs vg_data/thin_pool
# Monitor pool usage — this is critical
sudo lvs -o lv_name,lv_size,data_percent,pool_lv vg_data
# LV LSize Data% Pool
# thin_pool 200.00g 45.23
# app_data 200.00g thin_pool
# db_data 200.00g thin_pool
# logs 100.00g thin_pool
War Story: A thin-provisioned LVM pool hit 100% overnight when a backup job ran alongside normal database writes. Every thin volume in the pool received I/O errors simultaneously. Three VMs paused, two databases corrupted, and the backup itself was incomplete. Total overcommit was 2.5:1. The monitoring alert was set at 90% — too late, because the backup wrote 15% of pool capacity in 20 minutes. After that incident, the team moved the threshold to 80% and added a rate-of-change alert.
Auto-extend thin pools:
This tells LVM to automatically extend the thin pool by 20% when it reaches 80% usage — but only if the VG has free space. It is a safety net, not a substitute for capacity planning.
Part 9: LVM Snapshots — Consistent Backups Without Downtime¶
LVM snapshots create a point-in-time copy using copy-on-write. Every write to the original volume copies the old block to the snapshot first.
# Create a 20GB snapshot of the PostgreSQL volume
sudo lvcreate -L 20G -s -n pg_snap /dev/vg_data/lv_pg
# Mount it read-only for backup
sudo mount -o ro /dev/vg_data/pg_snap /mnt/snap
tar czf /backup/pg_$(date +%Y%m%d).tar.gz -C /mnt/snap .
# Monitor snapshot usage — if it fills, it's invalidated
sudo lvs -o lv_name,data_percent,snap_percent
# Clean up immediately after backup
sudo umount /mnt/snap
sudo lvremove -f /dev/vg_data/pg_snap
Gotcha: LVM snapshots degrade I/O performance because every write to the origin triggers a copy-on-write. And if the snapshot fills up, it becomes invalid silently — your backup is corrupt. Size snapshots generously (at least 20% of origin for short-lived operations) and remove them immediately after use. Long-lived LVM snapshots are a performance and reliability hazard.
Exercises¶
Exercise 1: Quick Win — Read the Storage Stack (2 minutes)¶
Run these commands on any Linux system. No changes needed, read-only.
What to look for
- `lsblk` shows the hierarchy: disks -> partitions -> LVM -> mount points - `pvs/vgs/lvs` shows the LVM layout and free space - `df -hT` shows used/available space with filesystem types If there is no LVM on the system, `pvs` will output nothing. That is normal for systems using plain partitions.Exercise 2: Simulate a Volume Extension (10 minutes)¶
On a test system (VM or lab), create a small LVM setup and extend it:
# Create two 1GB files as loop devices (simulates disks)
dd if=/dev/zero of=/tmp/disk1.img bs=1M count=1024
dd if=/dev/zero of=/tmp/disk2.img bs=1M count=1024
LOOP1=$(sudo losetup --find --show /tmp/disk1.img)
LOOP2=$(sudo losetup --find --show /tmp/disk2.img)
# Build the LVM stack
sudo pvcreate $LOOP1
sudo vgcreate test_vg $LOOP1
sudo lvcreate -L 500M -n test_lv test_vg
sudo mkfs.ext4 /dev/test_vg/test_lv
sudo mkdir -p /mnt/test
sudo mount /dev/test_vg/test_lv /mnt/test
df -h /mnt/test
# Now extend it with the second disk
sudo pvcreate $LOOP2
sudo vgextend test_vg $LOOP2
sudo lvextend -l +100%FREE --resizefs /dev/test_vg/test_lv
df -h /mnt/test
Expected result
The filesystem should double in size from ~500MB to ~1.5GB. The `--resizefs` flag handles the `resize2fs` call automatically. If you used XFS instead of ext4, `xfs_growfs` would be called under the hood. Cleanup:Exercise 3: fio Benchmark Comparison (15 minutes)¶
Benchmark your system's storage and compare HDD vs SSD vs NVMe:
# Run random 4K read test
fio --name=randread --ioengine=libaio --direct=1 --bs=4k \
--iodepth=32 --size=256M --rw=randread --filename=/tmp/fiotest
# Note the IOPS and latency
# Then compare: HDD should give ~150 IOPS, SSD ~50K, NVMe ~200K+
Interpreting results
The key numbers in fio output: - **IOPS**: higher is better. Database workloads need high random IOPS. - **clat avg**: completion latency. Lower is better. <1ms for SSD, <0.1ms for NVMe. - **BW (bandwidth)**: matters for sequential workloads (backups, video). If your NVMe shows only 10K IOPS, check: is fio using `--direct=1`? Without it, results include page cache, not actual device performance.Exercise 4: Judgment Call — Filesystem Selection¶
A team asks you to set up storage for three different workloads. Which filesystem for each?
- PostgreSQL database on RHEL 9, 2TB volume, needs online expansion
- Media processing pipeline, 50TB of large video files
- Home NAS with snapshots, compression, and data integrity checks
Recommended answers
1. **XFS** — RHEL default, excellent large-file and parallel I/O performance. Online grow supported. Cannot shrink, but databases rarely need to shrink volumes. 2. **XFS** — again, designed for exactly this. SGI created it for their media workstations. Allocation groups handle parallel I/O from multiple processing threads. 3. **ZFS** — built-in snapshots (`zfs snapshot`), built-in compression (`zfs set compression=lz4`), end-to-end checksums for data integrity. Or Btrfs if you want to stay in-kernel without OpenZFS.Cheat Sheet¶
LVM Quick Reference¶
| Task | Command |
|---|---|
| List PVs, VGs, LVs | pvs && vgs && lvs |
| Initialize a disk for LVM | pvcreate /dev/sdX |
| Add disk to VG | vgextend vg_name /dev/sdX |
| Extend LV + filesystem | lvextend -L +SIZE --resizefs /dev/vg/lv |
| Use all free space | lvextend -l +100%FREE --resizefs /dev/vg/lv |
| Create snapshot | lvcreate -L SIZE -s -n snap_name /dev/vg/lv_origin |
| Check snapshot usage | lvs -o lv_name,data_percent,snap_percent |
| Remove snapshot | lvremove /dev/vg/snap_name |
Filesystem Operations¶
| Task | ext4 | XFS |
|---|---|---|
| Create | mkfs.ext4 /dev/X |
mkfs.xfs /dev/X |
| Grow (online) | resize2fs /dev/X |
xfs_growfs /mountpoint |
| Shrink | resize2fs /dev/X SIZE (offline) |
Not possible |
| Repair | e2fsck -f /dev/X (unmount) |
xfs_repair /dev/X (unmount) |
| Check info | tune2fs -l /dev/X |
xfs_info /mountpoint |
I/O Diagnosis¶
| Task | Command |
|---|---|
| Device saturation | iostat -x 2 (watch await and %util) |
| Per-process I/O | iotop -o |
| I/O scheduler | cat /sys/block/sdX/queue/scheduler |
| Benchmark IOPS | fio --name=t --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --size=1G --rw=randread |
SMART Monitoring¶
| Task | Command |
|---|---|
| Health check | smartctl -H /dev/sdX |
| Full report | smartctl -a /dev/sdX |
| Key attributes | Check IDs 5, 187, 197, 198 — any non-zero = investigate |
| Self-test | smartctl -t short /dev/sdX (~2 min) |
| NVMe health | nvme smart-log /dev/nvme0n1 |
Kubernetes Storage¶
| Task | Command |
|---|---|
| List storage classes | kubectl get sc |
| List PVs | kubectl get pv --sort-by=.spec.capacity.storage |
| List PVCs | kubectl get pvc -A |
| Debug pending PVC | kubectl describe pvc NAME -n NAMESPACE |
| Expand PVC | kubectl patch pvc NAME -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}' |
| Portworx status | pxctl status |
Takeaways¶
-
The LVM extend workflow is the single most important storage operation you will perform in production.
pvcreate->vgextend->lvextend --resizefs. Memorize it. -
XFS cannot shrink. Start volumes small and grow. If you need shrink capability, use ext4 or Btrfs. This single constraint drives many design decisions.
-
SMART attributes 5, 187, 197, and 198 are the four horsemen of disk failure. Non-zero in any of them means the disk is dying. Everything else is noise.
-
Multipath is mandatory for iSCSI in production. Without it, the OS sees duplicate block devices for the same LUN. Someone will format the "spare" device and corrupt your data.
-
LVM snapshots are temporary. Remove them within hours. They degrade I/O and silently corrupt if they fill up. For long-lived snapshots, use ZFS or Btrfs.
-
Kubernetes storage is just LVM/NFS/iSCSI with an API layer on top. PVC expansion works the same way as
lvextend --resizefs— the CSI driver does the same steps under the hood.
Related Lessons¶
- The Disk That Filled Up — emergency response when root fills to 100%
- RAID: Why Your Disks Will Fail — RAID levels, rebuild risk, hardware RAID ops
- Server Hardware: When the Blinky Lights Matter — physical disk identification, LED blinking, slot mapping
- Kubernetes Debugging: When Pods Won't Behave — PVC stuck pending, mount errors
- Strace: Reading the Matrix — tracing I/O syscalls when disk behavior is unexplained