Skip to content

Linux Storage: LVM, Filesystems, and Beyond

  • lesson
  • lvm
  • ext4
  • xfs
  • zfs
  • btrfs
  • nfs
  • iscsi
  • smart
  • i/o-performance
  • kubernetes-storage
  • device-mapper ---# Linux Storage — LVM, Filesystems, and Beyond

Topics: LVM, ext4, XFS, ZFS, Btrfs, NFS, iSCSI, SMART, I/O performance, Kubernetes storage, device mapper Level: L1-L2 (Foundations to Operations) Time: 90-120 minutes Prerequisites: None (everything explained from scratch)


The Mission

It is Thursday afternoon. Monitoring fires a warning: /data on db-prod-03 is at 95% utilization. This is the PostgreSQL data volume — a 500GB logical volume on LVM, formatted XFS, backed by two physical disks in a volume group. The database writes roughly 2GB/day. You have about two days before writes start failing.

Your job: extend the volume without any downtime. Along the way, you will trace the entire Linux storage stack from block devices to Kubernetes PVCs, because this one operation touches all of it.


Part 1: The Storage Stack — What Sits Under Your Data

Before touching anything, understand what you are looking at. Run this on any Linux box and you will see the layers:

lsblk
# NAME              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
# sda                 8:0    0   500G  0 disk
# |-sda1              8:1    0     1G  0 part /boot
# |-sda2              8:2    0   499G  0 part
#   |-vg_data-lv_pg 253:0    0   500G  0 lvm  /data
# sdb                 8:16   0   500G  0 disk
# |-sdb1              8:17   0   500G  0 part
#   |-vg_data-lv_pg 253:0    0   500G  0 lvm  /data

That output tells a story. Read it bottom-up:

Application (PostgreSQL)
  writes to /data/pgdata/
    |
XFS filesystem
  translates file writes to block I/O
    |
Logical Volume (lv_pg)
  spans across two PVs via device mapper
    |
Volume Group (vg_data)
  pools two physical volumes
    |
Physical Volumes (sda2, sdb1)
  LVM metadata on raw partitions
    |
Block Devices (sda, sdb)
  SCSI disks exposed by the kernel
    |
Physical Disks
  SATA/SAS drives in slots 0 and 1

Every layer adds capability. LVM lets you resize without unmounting. The filesystem translates byte offsets to block addresses. The device mapper stitches multiple disks into one logical device. Understanding where you are in this stack is the difference between a 30-second fix and a 3-hour outage.

Name Origin: The sd in /dev/sda stands for SCSI Disk. The letter suffix (a, b, c) is assigned in discovery order. NVMe drives use /dev/nvme0n1 — controller 0, namespace 1. The namespace concept exists because NVMe supports multiple virtual drives per controller, though most consumer/server SSDs expose just one.


Part 2: LVM Deep Dive — The Mission Begins

Step 1: Assess the Situation

Before extending anything, see exactly what you have:

# Physical volumes — what raw disks feed LVM?
sudo pvs
#   PV         VG      Fmt  Attr PSize   PFree
#   /dev/sda2  vg_data lvm2 a--  499.00g     0
#   /dev/sdb1  vg_data lvm2 a--  499.00g     0

# Volume group — what's the pool look like?
sudo vgs
#   VG      #PV #LV #SN Attr   VSize   VFree
#   vg_data   2   1   0 wz--n- 998.00g     0

# Logical volumes — what slices exist?
sudo lvs
#   LV    VG      Attr       LSize   Pool Origin Data%  Meta%
#   lv_pg vg_data -wi-ao---- 998.00g

Zero free space in the VG. The logical volume consumes everything. You cannot extend lv_pg until you add more physical storage to vg_data.

Remember: The LVM stack is PVGPhysical volumes pour into Volume Groups, which you carve into Logical volumes. Think of it like a swimming pool: PVs are the water sources, the VG is the pool, and LVs are the lanes.

Step 2: Add a New Disk

A new 1TB disk has been attached (hot-added in the RAID controller, or a new EBS volume in the cloud). First, verify it appeared:

# Rescan the SCSI bus (for hot-added disks on VMs or bare metal)
echo "- - -" | sudo tee /sys/class/scsi_host/host*/scan

# Verify
lsblk
# ...
# sdc               8:32   0  1000G  0 disk   <-- new disk, no partitions

Confirm you have the right disk. This matters more than you think:

# Cross-reference size, model, serial, and mount status
lsblk -o NAME,SIZE,MODEL,SERIAL,MOUNTPOINT
# sdc  1000G  SAMSUNG_MZ7LH1T0  S4EWNX0T123456  (no mountpoint)

Gotcha: Device names shift. The disk that was /dev/sdc yesterday might be /dev/sdd today if a disk was added or removed. Always verify by serial number or /dev/disk/by-id/ before running destructive operations. Partitioning the wrong disk is one of the most common ways to destroy production data.

Step 3: Initialize and Extend

# Create a physical volume on the new disk (no partition needed for LVM)
sudo pvcreate /dev/sdc

# Add it to the existing volume group
sudo vgextend vg_data /dev/sdc

# Verify — the VG now has free space
sudo vgs
#   VG      #PV #LV #SN Attr   VSize   VFree
#   vg_data   3   1   0 wz--n- 1998.0g 1000.0g

Step 4: Extend the Logical Volume

# Add 500GB to the logical volume
sudo lvextend -L +500G /dev/vg_data/lv_pg

# Or use all free space
sudo lvextend -l +100%FREE /dev/vg_data/lv_pg

Step 5: Grow the Filesystem

This is where people get tripped up. The LV is bigger, but the filesystem does not know yet. df -h still shows the old size.

# XFS — grow by mount point (online, no unmount)
sudo xfs_growfs /data

# Verify
df -h /data
# Filesystem               Size  Used Avail Use% Mounted on
# /dev/mapper/vg_data-lv_pg 1.5T  475G  1.1T  31% /data

Done. No downtime. PostgreSQL never noticed.

Under the Hood: xfs_growfs tells the XFS filesystem to re-read the device size and extend its allocation group structure to cover the new space. The filesystem is online the entire time — writes continue during the grow. This works because XFS journals metadata changes and the grow operation is itself journaled.

The one-liner shortcut for when you are in a hurry:

# Extend LV and resize filesystem in a single command
sudo lvextend -L +500G --resizefs /dev/vg_data/lv_pg

The --resizefs flag (or -r) detects the filesystem type and calls the right resize tool automatically. This is the command you will use 90% of the time in production.

Flashcard Check: LVM Basics

Question Answer
What are the three layers of LVM? PV (Physical Volume) -> VG (Volume Group) -> LV (Logical Volume)
How do you add a disk to an existing VG? pvcreate /dev/sdX then vgextend vg_name /dev/sdX
What command extends an LV and resizes the filesystem in one step? lvextend -L +SIZE --resizefs /dev/vg/lv
What does pvs && vgs && lvs show you? A quick snapshot of all PVs, VGs, and LVs on the system
Why run lsblk -o NAME,SIZE,SERIAL before pvcreate? To verify you are operating on the correct disk

Part 3: Filesystems — Choosing Your Weapon

You extended the XFS volume. But why XFS? What if it had been ext4, or ZFS, or Btrfs? Each filesystem is a different set of tradeoffs. Here is when to reach for each one.

The Comparison Table

Feature ext4 XFS ZFS Btrfs
Max volume 1 EiB 8 EiB 256 ZiB 16 EiB
Online grow Yes Yes Yes Yes
Online shrink No (offline only) No. Ever. No Yes
Checksums Metadata only Metadata only Data + metadata Data + metadata
Built-in snapshots No (use LVM) No (use LVM) Yes Yes
Compression No No Yes (lz4, zstd) Yes (zstd, lzo)
Best for General purpose Large files, RHEL Data integrity Flexible storage

ext4: The Reliable Default

ext4 has been the Linux default since 2008. It evolved from ext3 (2001), ext2 (1993), and the original ext filesystem (1992, by Remy Card). When in doubt, use ext4.

Resize operations:

# Grow (online, mounted)
sudo resize2fs /dev/vg_data/lv_app

# Shrink (offline only — must unmount first!)
sudo umount /app
sudo e2fsck -f /dev/vg_data/lv_app    # required before shrink
sudo resize2fs /dev/vg_data/lv_app 50G
sudo lvreduce -L 50G /dev/vg_data/lv_app
sudo mount /app

ext4 reserves 5% of space for root by default. On a 1TB data volume, that is 50GB of wasted space. Reduce it:

# Check current reservation
sudo tune2fs -l /dev/sdb1 | grep "Reserved block count"

# Set to 1% on data volumes (not root filesystem)
sudo tune2fs -m 1 /dev/sdb1

XFS: The Enterprise Workhorse

XFS was created by Silicon Graphics (SGI) in 1993 for IRIX, their Unix workstation OS. It was ported to Linux in 2001 and became the default in RHEL 7 (2014). Red Hat chose it for its superior performance with large files and parallel I/O.

Gotcha: XFS cannot shrink. Not online, not offline, not ever. There is no xfs_shrinkfs command. It does not exist. If you allocated too much space to an XFS volume and need it back, your only option is: create a new smaller volume, copy the data with rsync -avHAX, swap the mounts. This is why you should start small and grow as needed.

Key XFS tools:

# Filesystem info
sudo xfs_info /data
# meta-data=/dev/mapper/vg_data-lv_pg isize=512  agcount=32, agsize=8192000 blks
#          =                          sectsz=512  attr=2, projid32bit=1
# data     =                          bsize=4096  blocks=262144000, imaxpct=5
# naming   =version 2                bsize=4096  ascii-ci=0, ftype=1
# log      =internal log             bsize=4096  blocks=128000, version=2
# realtime =none                     extsz=4096  blocks=0, rtextents=0

# Repair (must unmount first)
sudo umount /data
sudo xfs_repair /dev/vg_data/lv_pg

# Defragment (online)
sudo xfs_fsr /data

# Grow (online)
sudo xfs_growfs /data

Trivia: The ag in XFS output stands for "allocation group." XFS splits the filesystem into allocation groups that can be managed independently, which is why it handles parallel I/O so well. Multiple threads writing to different parts of the filesystem do not contend with each other. This design was revolutionary in 1993 and remains one of XFS's key advantages for database and media workloads.

ZFS: The Data Integrity Champion

ZFS was created by Sun Microsystems (Jeff Bonwick and team) in 2005. The name originally stood for "Zettabyte File System." It is not in the mainline Linux kernel due to license incompatibility (CDDL vs GPL), so you install it via OpenZFS.

ZFS combines the volume manager and filesystem into one layer. There is no separate LVM — the pool is the volume manager.

# Create a mirrored pool (RAID1 equivalent)
sudo zpool create datapool mirror /dev/sdb /dev/sdc

# Create a RAIDZ pool (RAID5 equivalent, single parity)
sudo zpool create datapool raidz /dev/sdb /dev/sdc /dev/sdd

# Create a dataset (like a subvolume — no fixed size, shares pool space)
sudo zfs create datapool/postgres
sudo zfs create datapool/backups

# Enable compression (lz4 is fast, zstd gives better ratio)
sudo zfs set compression=lz4 datapool/postgres

# Snapshot — instant, zero-cost at creation
sudo zfs snapshot datapool/postgres@before-migration

# Rollback — instant return to snapshot state
sudo zfs rollback datapool/postgres@before-migration

# Send/receive — incremental replication to another host
sudo zfs send datapool/postgres@snap1 | ssh backup-host zfs receive backuppool/postgres

# Pool health
sudo zpool status
#   pool: datapool
#  state: ONLINE
#   scan: scrub repaired 0B in 01:23:45 with 0 errors
# config:
#     NAME        STATE     READ WRITE CKSUM
#     datapool    ONLINE       0     0     0
#       mirror-0  ONLINE       0     0     0
#         sdb     ONLINE       0     0     0
#         sdc     ONLINE       0     0     0

Under the Hood: ZFS checksums every block (data and metadata) using a Merkle tree. On every read, it verifies the checksum. If the checksum fails and the pool has redundancy (mirror or raidz), ZFS automatically repairs the corrupted block from a good copy. This catches bit rot, firmware bugs, and phantom writes — corruption modes that traditional RAID cannot detect. CERN's 2007 study found silent data corruption at roughly 1 bit flip per 10TB per year. Jeff Bonwick called bit rot "the silent killer of data."

ZFS vs LVM — when to use which:

Scenario Use ZFS Use LVM
Data integrity is paramount Yes — end-to-end checksums No checksums on data
You need built-in compression Yes — lz4 or zstd Need filesystem-level (Btrfs)
RHEL/CentOS environment Harder — not in mainline kernel Native, well-supported
Kubernetes storage (Portworx, etc.) Possible but uncommon Common backing store
Simple volume extension Pool-based, different workflow lvextend --resizefs
Snapshots for backups Instant, incremental send/receive COW snapshots, but degrade I/O
You need to shrink a volume Cannot shrink pools ext4 can shrink offline

Gotcha: ZFS uses the ARC (Adaptive Replacement Cache) which lives in RAM. On a system with 128GB RAM, ZFS might consume 80GB+ for ARC. This is by design and the memory is reclaimable, but it can surprise monitoring tools and trigger false OOM alerts. Tune with zfs_arc_max in /etc/modprobe.d/zfs.conf.

Btrfs: The Middle Ground

Btrfs (pronounced "butter-FS" or "better-FS") was started by Oracle in 2007. Facebook was its largest production user before switching to XFS for database workloads. It is the default on openSUSE and Fedora Workstation.

Key advantage: it is the only mainstream Linux filesystem that can shrink online. Key risk: its RAID 5/6 implementation has a documented write hole and is not recommended for production.

# Create a Btrfs filesystem
sudo mkfs.btrfs /dev/sdb1

# Mount with compression
sudo mount -o compress=zstd /dev/sdb1 /data

# Take a snapshot
sudo btrfs subvolume snapshot /data /data/.snapshots/2026-03-23

# Check integrity (online scrub)
sudo btrfs scrub start /data
sudo btrfs scrub status /data

# Filesystem usage (more accurate than df for CoW filesystems)
sudo btrfs filesystem usage /data

Flashcard Check: Filesystem Showdown

Question Answer
Can you shrink an XFS filesystem? No. Not online, not offline. Create a new volume and migrate.
Which filesystem checksums both data and metadata? ZFS and Btrfs. ext4 and XFS only checksum metadata.
What was ext4's predecessor chain? ext (1992) -> ext2 (1993) -> ext3 (2001) -> ext4 (2008)
Why did RHEL choose XFS as default? Superior large-file performance and parallel I/O from its allocation group design
What is ZFS ARC? Adaptive Replacement Cache — ZFS's in-RAM read cache. Can consume most of available memory.

Part 4: NFS and iSCSI — When Storage Lives on the Network

Your PostgreSQL server has local disks. But the application servers mount their shared config and uploads directory from an NFS server. And the SAN-backed volumes come over iSCSI. Both have their own failure modes.

NFS: Shared Files Over the Network

NFS lets multiple clients mount the same directory simultaneously. It operates at the file level — clients see files and directories, not raw blocks.

Server setup (quick version):

# Install NFS server
sudo dnf install nfs-utils   # RHEL/Rocky
sudo apt install nfs-kernel-server   # Debian/Ubuntu

# Export a directory
echo '/exports/shared 10.0.1.0/24(rw,sync,no_subtree_check)' | sudo tee -a /etc/exports

# Apply and start
sudo exportfs -ra
sudo systemctl enable --now nfs-server

Gotcha: Watch the space. In /etc/exports, 10.0.1.0/24(rw) means the 10.0.1.0/24 network gets read-write access. But 10.0.1.0/24 (rw) (with a space before the parenthesis) means the 10.0.1.0/24 network gets default (read-only) access, and everyone else gets read-write. One space. Completely different security posture. This has bitten countless admins.

Client mount with sane options:

# Manual mount
sudo mount -t nfs -o hard,timeo=300,retrans=3 nfs-srv:/exports/shared /mnt/shared

# fstab entry — note _netdev and nofail
nfs-srv:/exports/shared  /mnt/shared  nfs  hard,timeo=300,retrans=3,_netdev,nofail  0  0

NFSv3 vs NFSv4:

Feature NFSv3 NFSv4
Protocol Stateless, UDP or TCP Stateful, TCP only
Ports 2049 + rpcbind (111) + dynamic mountd 2049 only
Security IP-based, basic AUTH_SYS Kerberos (RPCSEC_GSS)
Firewall Painful (multiple ports) Simple (one port)
Performance Proven Better with compound operations, pNFS

NFS performance tuning:

# Increase read/write block size (default 1MB, max varies)
mount -o rsize=1048576,wsize=1048576 nfs-srv:/share /mnt/share

# Check NFS statistics for retransmissions and timeouts
nfsstat -c
# Look for high retrans values — indicates network or server issues

# Use NFSv4.1 for parallel NFS (pNFS) on high-throughput workloads
mount -t nfs -o vers=4.1 nfs-srv:/share /mnt/share

iSCSI: Block Devices Over the Network

iSCSI exposes raw block devices over TCP/IP. The client (initiator) sees a local disk. Unlike NFS, only one client at a time can safely write to an iSCSI LUN (unless using a clustered filesystem).

Key vocabulary:

Term Meaning
Target The server exposing storage (SAN array, Linux targetcli)
Initiator The client consuming storage (your server)
LUN Logical Unit Number — a specific volume on the target
IQN iSCSI Qualified Name — unique identifier (e.g., iqn.2024.com.example:storage)
Portal IP:port pair where the target listens (default port 3260)

Initiator workflow:

# Discover targets on a portal
sudo iscsiadm -m discovery -t sendtargets -p 10.0.1.100

# Login to a target
sudo iscsiadm -m node -T iqn.2024.com.example:storage -p 10.0.1.100 --login

# The LUN appears as a new block device
lsblk
# sdd  0  50G  0 disk   <-- this is the iSCSI LUN

# Partition, format, mount as normal
sudo mkfs.xfs /dev/sdd
sudo mount /dev/sdd /mnt/iscsi

# Make login persistent across reboots
sudo iscsiadm -m node -T iqn.2024.com.example:storage -p 10.0.1.100 \
    --op update -n node.startup -v automatic

Multipath: Why It Matters

War Story: A storage admin configured an iSCSI target with two network paths for redundancy — 10.0.1.100 and 10.0.2.100. The initiator discovered both paths and the OS presented the same LUN as two separate block devices: /dev/sdd and /dev/sde. Without multipath, the admin did not realize these were the same physical disk. They formatted /dev/sdd with XFS and mounted it. A junior admin later saw the "unused" /dev/sde and formatted that with ext4. The conflicting writes corrupted the SAN LUN. Recovery took 14 hours and the database restore lost 6 hours of transactions.

The fix was simple: install and configure multipath-tools before connecting iSCSI paths. Multipath combines the two device paths into a single /dev/mapper/mpath0 device and handles failover automatically.

# Install multipath tools
sudo dnf install device-mapper-multipath

# Generate default config
sudo mpathconf --enable

# View multipath topology
sudo multipath -ll
# mpath0 (360000000000000001) dm-3 ATA,VBOX HARDDISK
# size=50G features='0' hwhandler='0' wp=rw
# |-+- policy='service-time 0' prio=1 status=active
# | `- 3:0:0:1 sdd 8:48 active ready running
# `-+- policy='service-time 0' prio=1 status=enabled
#   `- 4:0:0:1 sde 8:64 active ready running

# Use /dev/mapper/mpath0 — never the raw sd devices
sudo mkfs.xfs /dev/mapper/mpath0

Mental Model: NFS = file-level, many readers/writers, like a shared Google Doc. iSCSI = block-level, single writer, like plugging in a USB drive over the network. Choose NFS when multiple servers need the same files. Choose iSCSI when you need raw block performance (databases, VMs) and single-host access.


Part 5: SMART Monitoring — Predicting Disk Failure

The volume extension is done. But what about the disks themselves? Disks fail. The question is whether you find out before or after data is lost.

Reading smartctl Output

sudo smartctl -a /dev/sda

The output has dozens of attributes. Most are noise. Focus on four:

ID# ATTRIBUTE_NAME          FLAG   VALUE WORST THRESH TYPE     UPDATED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033 100   100   010    Pre-fail Always       0
187 Reported_Uncorrect       0x0032 100   100   000    Old_age  Always       0
197 Current_Pending_Sector   0x0012 100   100   000    Old_age  Always       0
198 Offline_Uncorrectable    0x0010 100   100   000    Old_age  Offline      0

Remember: The Backblaze rule of thumb (from 250,000+ drives): Any non-zero value in attributes 5, 187, 197, or 198 warrants investigation and likely proactive replacement. Mnemonic: 5, 187, 197, 198 — the four horsemen of disk failure. Most healthy drives show zeros in all four for their entire lifespan.

What each one means:

Attribute ID What It Means Non-Zero Action
Reallocated Sector Count 5 Bad sectors remapped to spare area Rising = disk degrading. Plan replacement.
Reported Uncorrectable 187 ECC failures the drive could not fix Read errors getting past internal defenses
Current Pending Sector 197 Sectors that failed reads, awaiting remap Active problem — data at risk
Offline Uncorrectable 198 Sectors that failed remap entirely Data loss occurring

What about temperature and power-on hours? The Google disk failure study (2007, 100K+ drives) and Backblaze data both confirmed: temperature (unless extreme >60C), power-on hours, and start/stop count are poor predictors. Old drives are not more likely to fail. Do not waste alert capacity on these.

NVMe Health Monitoring

NVMe drives use different reporting. No attribute IDs — instead, a standardized health log:

sudo smartctl -a /dev/nvme0n1
# Or:
sudo nvme smart-log /dev/nvme0n1
Field Meaning Action
Critical Warning Bitmask of active warnings Any non-zero bit = investigate now
Available Spare Remaining spare capacity (%) Below threshold = replacement due
Percentage Used Endurance consumed Approaching 100% = plan replacement
Media and Data Integrity Errors Unrecovered data errors Any non-zero = corruption risk

Automated Monitoring with smartd

# /etc/smartd.conf — one line per drive (or DEVICESCAN for all)
DEVICESCAN -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com

# -a = monitor all attributes
# -s = short test daily at 2am, long test Saturdays at 3am
# -m = email on failure

sudo systemctl enable --now smartd

The Disk Replacement Workflow

SMART says a disk is failing. Here is the workflow:

1. Confirm: smartctl -H /dev/sdX   (FAILED = replace immediately)
2. Verify it's in RAID/LVM (don't pull a standalone disk)
3. Identify the physical slot:
   - storcli /c0/e32/s3 start locate   (blink the LED)
   - Or: lsblk -o NAME,SERIAL and match to the chassis label
4. Fail the disk in software:
   - mdadm: mdadm /dev/md0 --fail /dev/sdX
   - LVM: pvmove /dev/sdX (migrate data off first)
5. Hot-swap the physical disk
6. Partition the new disk to match: sfdisk -d /dev/sda | sfdisk /dev/sdX
7. Add to array/VG: mdadm /dev/md0 --add /dev/sdX
8. Monitor rebuild: watch cat /proc/mdstat

Part 6: I/O Performance — Proving It Is (or Isn't) the Disk

Your volume is extended and your disks are healthy. But users report the database is slow. Is it the disk? Prove it.

iostat: The First Tool You Reach For

iostat -x 2 5
# Device   r/s    w/s   rMB/s  wMB/s  rrqm/s  wrqm/s  await  %util
# sda      12.0   850    0.05   45.2    0.00    120     12.3   99.8
# sdb       0.5    2.1   0.01    0.1    0.00      0      0.5    1.2

Reading the output:

Column What It Means Alarm Threshold
await Average I/O latency in ms (queue + service) >10ms for SSD, >20ms for HDD
%util Percentage of time device is busy >80% sustained = saturated
r/s, w/s IOPS (reads and writes per second) Compare to device spec
avgqu-sz Average queue depth >1 sustained means I/O is queuing

In the output above, sda is at 99.8% utilization with 12.3ms await. It is saturated.

iotop: Which Process Is Hammering the Disk?

sudo iotop -o -b -n 3
# Total DISK READ:   0.00 B/s | Total DISK WRITE:  45.23 M/s
#   TID  PRIO  USER     DISK READ  DISK WRITE  COMMAND
# 12345 be/4  postgres   0.00 B/s  40.12 M/s   postgres: wal writer
#  6789 be/4  app        0.00 B/s   5.11 M/s   java -jar app.jar

PostgreSQL's WAL writer is responsible for 88% of the write load. Now you know where to look.

fio: Benchmarking for Real

dd is not a benchmark. Use fio to establish actual disk capabilities:

# Random read IOPS (simulates database reads)
fio --name=randread --ioengine=libaio --direct=1 --bs=4k \
    --iodepth=32 --size=1G --rw=randread --filename=/data/fiotest

# Random write IOPS (simulates database writes)
fio --name=randwrite --ioengine=libaio --direct=1 --bs=4k \
    --iodepth=32 --size=1G --rw=randwrite --filename=/data/fiotest

# Mixed 70/30 read-write (simulates OLTP workload)
fio --name=oltp --ioengine=libaio --direct=1 --bs=4k \
    --iodepth=32 --size=1G --rw=randrw --rwmixread=70 \
    --filename=/data/fiotest

# Sequential write (simulates backup or bulk load)
fio --name=seqwrite --ioengine=libaio --direct=1 --bs=1M \
    --iodepth=4 --size=4G --rw=write --filename=/data/fiotest

Interpreting fio output:

randread: (groupid=0, jobs=1): err= 0
  read: IOPS=45.2k, BW=176MiB/s (185MB/s)
    slat (usec): min=1, max=234, avg= 3.12
    clat (usec): min=78, max=12456, avg=704.23
     lat (usec): min=80, max=12458, avg=707.35
  • IOPS=45.2k — 45,200 random reads per second. Excellent for NVMe, impossible for HDD.
  • clat avg=704.23 usec — average completion latency of 0.7ms. Good for SSD.
  • Compare these numbers to your device spec. A modern NVMe should deliver 100K+ random IOPS. An HDD tops out at 100-200.

I/O Schedulers

The kernel I/O scheduler determines how requests are ordered before reaching the device:

# Check current scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] none bfq

# Change scheduler (temporary, until reboot)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
Scheduler Best For Why
none NVMe SSDs NVMe has its own internal scheduling; kernel scheduler adds latency
mq-deadline SATA SSDs, HDDs Prevents starvation, good latency guarantees
bfq Desktop HDDs Fair bandwidth distribution, good for interactive workloads

Trivia: The old cfq (Completely Fair Queuing) scheduler was the Linux default for over a decade. It was replaced by mq-deadline and bfq in kernel 5.0 (2019) because cfq was designed for single-queue rotational disks and performed poorly on multi-queue NVMe devices.


Part 7: Kubernetes Storage — From LVM to PVCs

Everything so far has been on bare metal or VMs. But if you run Kubernetes, the storage abstraction adds another layer. Kubernetes does not manage storage directly — it provides a framework (PV, PVC, StorageClass) that delegates to storage backends.

The Kubernetes Storage Stack

Pod (container writes to /var/lib/postgresql/data)
  |
Volume Mount (volumeMounts in pod spec)
  |
PVC (PersistentVolumeClaim — "I need 100Gi of fast-ssd storage")
  |
PV (PersistentVolume — "Here's a 100Gi volume backed by EBS/Portworx/NFS")
  |
StorageClass (provisioner: ebs.csi.aws.com, type: gp3)
  |
CSI Driver (creates actual volume in the backend)
  |
Actual Storage (EBS volume, Portworx volume, NFS export, iSCSI LUN)

StorageClass: The PV Factory

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/portworx-volume
parameters:
  repl: "3"
  io_profile: "db"
  priority_io: "high"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

WaitForFirstConsumer delays volume creation until a pod actually needs it. This ensures the volume lands in the same availability zone or on the same node as the pod — critical for zone-local storage like EBS or Portworx.

Portworx in Practice

Portworx is a software-defined storage layer that runs on your Kubernetes nodes and presents their local disks as a distributed storage pool.

# Cluster health
pxctl status
pxctl cluster list

# Volume operations
pxctl volume list
pxctl volume inspect <vol-id>
pxctl volume create pg-data --size 100 --repl 3 --io_profile db

# Alert review
pxctl alerts show

Why Portworx over raw local disks? Replication (your volume exists on 3 nodes), rack awareness (replicas spread across failure domains), and automatic reattach (if a node dies, the volume is accessible from another node that has a replica). For databases on Kubernetes, this is the layer that makes StatefulSets actually survivable.

MinIO: Object Storage for Kubernetes

MinIO provides S3-compatible object storage. Where Portworx gives you block storage (RWO volumes for databases), MinIO gives you object storage (HTTP API for blobs, backups, artifacts).

# Check cluster health
mc admin info myminio

# Create a bucket and upload
mc mb myminio/db-backups
mc cp /tmp/pg_dump_2026-03-23.sql.gz myminio/db-backups/

# List objects
mc ls myminio/db-backups/

Mental Model: Think of Kubernetes storage in three tiers: - Block (RWO): Portworx, EBS, local disks. One pod writes. For databases. - File (RWX): NFS, EFS, CephFS. Many pods read/write. For shared configs, uploads. - Object (S3 API): MinIO, S3. HTTP access. For backups, artifacts, large blobs.

Picking the wrong tier is a common mistake. Do not use a PVC for storing backup archives (use object storage). Do not use NFS for a high-IOPS database (use block storage).

Expanding a PVC

Just like extending an LV on bare metal, you can expand a PVC in Kubernetes — if the StorageClass has allowVolumeExpansion: true:

# Expand the PVC to 200Gi
kubectl patch pvc data-postgres-0 -n production \
    -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Check progress
kubectl describe pvc data-postgres-0 -n production
# Conditions:
#   Type                      Status
#   FileSystemResizePending   True    <-- CSI driver expanded the volume,
#                                         waiting for pod to trigger fs resize

Most CSI drivers handle filesystem resize automatically when the pod restarts. Some do it online. Check your driver's documentation.

Gotcha: PVC expansion is one-way. You cannot shrink a PVC. Just like XFS cannot shrink, Kubernetes PVCs cannot shrink. The parallel is not a coincidence — many PVCs are backed by XFS volumes.


Part 8: LVM Thin Provisioning — The Double-Edged Sword

Standard LVM allocates all space upfront. Thin provisioning allocates on demand, letting you overcommit — promise more space than physically exists.

# Create a thin pool (200GB physical, can allocate more virtually)
sudo lvcreate -L 200G --thinpool thin_pool vg_data

# Create thin volumes (total 500GB from a 200GB pool)
sudo lvcreate -V 200G --thin -n app_data vg_data/thin_pool
sudo lvcreate -V 200G --thin -n db_data vg_data/thin_pool
sudo lvcreate -V 100G --thin -n logs vg_data/thin_pool

# Monitor pool usage — this is critical
sudo lvs -o lv_name,lv_size,data_percent,pool_lv vg_data
#   LV        LSize   Data%  Pool
#   thin_pool 200.00g  45.23
#   app_data  200.00g         thin_pool
#   db_data   200.00g         thin_pool
#   logs      100.00g         thin_pool

War Story: A thin-provisioned LVM pool hit 100% overnight when a backup job ran alongside normal database writes. Every thin volume in the pool received I/O errors simultaneously. Three VMs paused, two databases corrupted, and the backup itself was incomplete. Total overcommit was 2.5:1. The monitoring alert was set at 90% — too late, because the backup wrote 15% of pool capacity in 20 minutes. After that incident, the team moved the threshold to 80% and added a rate-of-change alert.

Auto-extend thin pools:

# In /etc/lvm/lvm.conf:
# thin_pool_autoextend_threshold = 80
# thin_pool_autoextend_percent = 20

This tells LVM to automatically extend the thin pool by 20% when it reaches 80% usage — but only if the VG has free space. It is a safety net, not a substitute for capacity planning.


Part 9: LVM Snapshots — Consistent Backups Without Downtime

LVM snapshots create a point-in-time copy using copy-on-write. Every write to the original volume copies the old block to the snapshot first.

# Create a 20GB snapshot of the PostgreSQL volume
sudo lvcreate -L 20G -s -n pg_snap /dev/vg_data/lv_pg

# Mount it read-only for backup
sudo mount -o ro /dev/vg_data/pg_snap /mnt/snap
tar czf /backup/pg_$(date +%Y%m%d).tar.gz -C /mnt/snap .

# Monitor snapshot usage — if it fills, it's invalidated
sudo lvs -o lv_name,data_percent,snap_percent

# Clean up immediately after backup
sudo umount /mnt/snap
sudo lvremove -f /dev/vg_data/pg_snap

Gotcha: LVM snapshots degrade I/O performance because every write to the origin triggers a copy-on-write. And if the snapshot fills up, it becomes invalid silently — your backup is corrupt. Size snapshots generously (at least 20% of origin for short-lived operations) and remove them immediately after use. Long-lived LVM snapshots are a performance and reliability hazard.


Exercises

Exercise 1: Quick Win — Read the Storage Stack (2 minutes)

Run these commands on any Linux system. No changes needed, read-only.

lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT
sudo pvs && sudo vgs && sudo lvs
df -hT
What to look for - `lsblk` shows the hierarchy: disks -> partitions -> LVM -> mount points - `pvs/vgs/lvs` shows the LVM layout and free space - `df -hT` shows used/available space with filesystem types If there is no LVM on the system, `pvs` will output nothing. That is normal for systems using plain partitions.

Exercise 2: Simulate a Volume Extension (10 minutes)

On a test system (VM or lab), create a small LVM setup and extend it:

# Create two 1GB files as loop devices (simulates disks)
dd if=/dev/zero of=/tmp/disk1.img bs=1M count=1024
dd if=/dev/zero of=/tmp/disk2.img bs=1M count=1024
LOOP1=$(sudo losetup --find --show /tmp/disk1.img)
LOOP2=$(sudo losetup --find --show /tmp/disk2.img)

# Build the LVM stack
sudo pvcreate $LOOP1
sudo vgcreate test_vg $LOOP1
sudo lvcreate -L 500M -n test_lv test_vg
sudo mkfs.ext4 /dev/test_vg/test_lv
sudo mkdir -p /mnt/test
sudo mount /dev/test_vg/test_lv /mnt/test
df -h /mnt/test

# Now extend it with the second disk
sudo pvcreate $LOOP2
sudo vgextend test_vg $LOOP2
sudo lvextend -l +100%FREE --resizefs /dev/test_vg/test_lv
df -h /mnt/test
Expected result The filesystem should double in size from ~500MB to ~1.5GB. The `--resizefs` flag handles the `resize2fs` call automatically. If you used XFS instead of ext4, `xfs_growfs` would be called under the hood. Cleanup:
sudo umount /mnt/test
sudo lvremove -f /dev/test_vg/test_lv
sudo vgremove test_vg
sudo pvremove $LOOP1 $LOOP2
sudo losetup -d $LOOP1 $LOOP2
rm /tmp/disk1.img /tmp/disk2.img

Exercise 3: fio Benchmark Comparison (15 minutes)

Benchmark your system's storage and compare HDD vs SSD vs NVMe:

# Run random 4K read test
fio --name=randread --ioengine=libaio --direct=1 --bs=4k \
    --iodepth=32 --size=256M --rw=randread --filename=/tmp/fiotest

# Note the IOPS and latency
# Then compare: HDD should give ~150 IOPS, SSD ~50K, NVMe ~200K+
Interpreting results The key numbers in fio output: - **IOPS**: higher is better. Database workloads need high random IOPS. - **clat avg**: completion latency. Lower is better. <1ms for SSD, <0.1ms for NVMe. - **BW (bandwidth)**: matters for sequential workloads (backups, video). If your NVMe shows only 10K IOPS, check: is fio using `--direct=1`? Without it, results include page cache, not actual device performance.

Exercise 4: Judgment Call — Filesystem Selection

A team asks you to set up storage for three different workloads. Which filesystem for each?

  1. PostgreSQL database on RHEL 9, 2TB volume, needs online expansion
  2. Media processing pipeline, 50TB of large video files
  3. Home NAS with snapshots, compression, and data integrity checks
Recommended answers 1. **XFS** — RHEL default, excellent large-file and parallel I/O performance. Online grow supported. Cannot shrink, but databases rarely need to shrink volumes. 2. **XFS** — again, designed for exactly this. SGI created it for their media workstations. Allocation groups handle parallel I/O from multiple processing threads. 3. **ZFS** — built-in snapshots (`zfs snapshot`), built-in compression (`zfs set compression=lz4`), end-to-end checksums for data integrity. Or Btrfs if you want to stay in-kernel without OpenZFS.

Cheat Sheet

LVM Quick Reference

Task Command
List PVs, VGs, LVs pvs && vgs && lvs
Initialize a disk for LVM pvcreate /dev/sdX
Add disk to VG vgextend vg_name /dev/sdX
Extend LV + filesystem lvextend -L +SIZE --resizefs /dev/vg/lv
Use all free space lvextend -l +100%FREE --resizefs /dev/vg/lv
Create snapshot lvcreate -L SIZE -s -n snap_name /dev/vg/lv_origin
Check snapshot usage lvs -o lv_name,data_percent,snap_percent
Remove snapshot lvremove /dev/vg/snap_name

Filesystem Operations

Task ext4 XFS
Create mkfs.ext4 /dev/X mkfs.xfs /dev/X
Grow (online) resize2fs /dev/X xfs_growfs /mountpoint
Shrink resize2fs /dev/X SIZE (offline) Not possible
Repair e2fsck -f /dev/X (unmount) xfs_repair /dev/X (unmount)
Check info tune2fs -l /dev/X xfs_info /mountpoint

I/O Diagnosis

Task Command
Device saturation iostat -x 2 (watch await and %util)
Per-process I/O iotop -o
I/O scheduler cat /sys/block/sdX/queue/scheduler
Benchmark IOPS fio --name=t --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --size=1G --rw=randread

SMART Monitoring

Task Command
Health check smartctl -H /dev/sdX
Full report smartctl -a /dev/sdX
Key attributes Check IDs 5, 187, 197, 198 — any non-zero = investigate
Self-test smartctl -t short /dev/sdX (~2 min)
NVMe health nvme smart-log /dev/nvme0n1

Kubernetes Storage

Task Command
List storage classes kubectl get sc
List PVs kubectl get pv --sort-by=.spec.capacity.storage
List PVCs kubectl get pvc -A
Debug pending PVC kubectl describe pvc NAME -n NAMESPACE
Expand PVC kubectl patch pvc NAME -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
Portworx status pxctl status

Takeaways

  1. The LVM extend workflow is the single most important storage operation you will perform in production. pvcreate -> vgextend -> lvextend --resizefs. Memorize it.

  2. XFS cannot shrink. Start volumes small and grow. If you need shrink capability, use ext4 or Btrfs. This single constraint drives many design decisions.

  3. SMART attributes 5, 187, 197, and 198 are the four horsemen of disk failure. Non-zero in any of them means the disk is dying. Everything else is noise.

  4. Multipath is mandatory for iSCSI in production. Without it, the OS sees duplicate block devices for the same LUN. Someone will format the "spare" device and corrupt your data.

  5. LVM snapshots are temporary. Remove them within hours. They degrade I/O and silently corrupt if they fill up. For long-lived snapshots, use ZFS or Btrfs.

  6. Kubernetes storage is just LVM/NFS/iSCSI with an API layer on top. PVC expansion works the same way as lvextend --resizefs — the CSI driver does the same steps under the hood.