Skip to content

Thinking Out Loud: Linux Performance

A senior SRE's internal monologue while working through a real Linux performance issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

The database team reports that their PostgreSQL server (bare metal, 64 cores, 256GB RAM) has degraded query performance. The p99 latency for reads went from 5ms to 80ms over the past 6 hours. No schema changes, no traffic increase. They've checked query plans and everything looks the same. They're asking infra to look at the host.

The Monologue

Database query latency increased 16x with no query plan changes. If the queries are the same and the plans are the same, the host is the problem. Let me start with the 60-second performance checklist.

uptime && dmesg -T | tail -5 && vmstat 1 3 && mpstat -P ALL 1 1 && iostat -xz 1 1

Load average: 18 on 64 cores — barely loaded. No kernel errors in dmesg. Vmstat shows minimal swap, low I/O wait. Mpstat shows cores averaging 3% utilization. But iostat — there it is. sda showing 98% utilization with average wait time of 45ms. The disk is pegged.

But wait — this is a database server. It should be on NVMe, not SATA. Let me check what sda actually is.

lsblk -d -o NAME,ROTA,TYPE,SIZE,MODEL,TRAN

sda is a 2TB spinning disk (ROTA=1). The NVMe drives are nvme0n1 and nvme1n1. The database data directory should be on the NVMe RAID. Let me check.

df -h /var/lib/postgresql/data

It's on /dev/md0 which is... let me check.

cat /proc/mdstat

md0 is a RAID-1 of nvme0n1p1 and nvme1n1p1. Good — the database is on the NVMe RAID. So why is sda at 98% utilization and affecting performance?

Mental Model: I/O Scheduler Interactions

On Linux, I/O subsystem contention isn't always direct. Heavy I/O on one device can starve CPU or memory resources (page cache pressure, kernel I/O threads, interrupt handling) that affect other devices' performance. Also check if anything on the saturated device shares a filesystem, uses the same swap, or triggers kernel writeback pressure.

Let me figure out what's hammering sda.

sudo iotop -b -o -n 1 | head -20

A process called rsync is doing 180MB/s writes to sda. And below it, pg_dump is feeding it. Someone is running a full database backup to the spinning disk. The I/O is consuming all the disk bandwidth on sda, but why would that affect the NVMe drives?

Let me check for dirty page pressure.

cat /proc/meminfo | grep -i dirty

Dirty: 4.2GB, Writeback: 2.1GB. The dirty_background_ratio is probably the default 10%, which on a 256GB machine is 25GB. We're only at 4.2GB dirty, so that's not the trigger. But let me check something else — the writeback is happening on sda, and if the I/O scheduler is using lots of kernel threads...

Actually, let me look at this differently. What if the backup process is reading from the NVMe (PostgreSQL data) and writing to the SATA disk? The read side might be evicting page cache that PostgreSQL relies on.

sudo perf stat -e cache-misses -p $(pgrep postgres | head -1) sleep 5

Hmm, let me try a different approach. Let me check if the page cache dropped.

cat /proc/meminfo | grep -E "^(Cached|Buffers|MemAvailable)"

Cached: 42GB. On a 256GB machine running a database, I'd expect most of the RAM to be page cache. 42GB is low. Where's the rest?

free -h

Used: 180GB. Cached: 42GB. The database shared buffers plus the backup process must be consuming the rest. Let me check PostgreSQL shared_buffers.

sudo -u postgres psql -c "SHOW shared_buffers;"

shared_buffers = 64GB. Plus the pg_dump process is reading the entire database sequentially, which causes the kernel to cache those pages and evict the "hot" pages that PostgreSQL was relying on for index lookups. This is the classic "sequential scan evicts cache" problem.

Mental Model: Page Cache Pollution

Sequential I/O (like backups, ETL reads, file copies) fills the page cache with data that will only be read once, evicting pages from random-access workloads (like database index lookups) that need to stay cached. On Linux, use fadvise or ionice to limit the cache pollution impact of sequential operations.

The fix is to make the pg_dump/rsync process polite about I/O. But first, let me reduce the immediate impact.

sudo ionice -c 3 -p $(pgrep pg_dump)
sudo ionice -c 3 -p $(pgrep rsync)

That sets both processes to "idle" I/O class — they only get disk bandwidth when nothing else needs it. But the cache damage is already done. The PostgreSQL hot pages have been evicted and need to be read back from NVMe.

Let me also drop the caches so the kernel repopulates from actual access patterns instead of the backup's sequential pattern.

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

That's aggressive, but on a database server, the hot working set will re-warm quickly from the NVMe. Better to start fresh than to leave the polluted cache.

sudo -u postgres psql -c "SELECT pg_prewarm('critical_table_pkey');" 2>/dev/null || echo "pg_prewarm extension not available"

If pg_prewarm is available, I can explicitly warm the critical indexes. Let me check...

Extension not available. That's fine — the database will warm itself through normal query traffic. The p99 will be elevated for a few minutes as indexes get re-cached, then settle back to normal.

Now, for the long-term fix: the backup script needs to be changed. I'll add ionice -c 3 and also use PGBACKREST or pg_basebackup with rate limiting instead of raw pg_dump, and schedule it during off-peak hours.

cat /etc/cron.d/postgres-backup

Ah — the backup runs at midnight via cron with no I/O limiting. And midnight isn't off-peak for this workload (it's a global service). Let me check when the low-traffic window is... that's a question for the team, but for now I'll add ionice to the cron entry and leave a note.

What Made This Senior-Level

Junior Would... Senior Does... Why
See low CPU and say "host looks fine" Check ALL resources systematically (USE method) and find the saturated disk Performance issues are rarely where you first look
Not connect backup I/O to database latency Recognize that sequential backup I/O pollutes the page cache used by the database Understanding the Linux page cache model reveals non-obvious interactions
Kill the backup process Use ionice to reduce its priority while letting it finish Killing a backup mid-stream can leave partial files and you'll just have to re-run it
Not think about cache warming after the fix Drop caches to clear pollution and consider pg_prewarm for critical indexes The cache doesn't fix itself instantly — actively managing re-warming speeds recovery

Key Heuristics Used

  1. 60-Second Performance Checklist: uptime, dmesg, vmstat, mpstat, iostat in one pass — this covers CPU, memory, disk, and kernel errors in under a minute.
  2. Page Cache Pollution: Sequential I/O evicts random-access hot pages. Always run backups and ETL with ionice and consider posix_fadvise(DONTNEED) to prevent cache pollution.
  3. Indirect Resource Contention: A saturated resource (SATA disk) can degrade performance on an unrelated resource (NVMe) through shared kernel subsystems like page cache and writeback.

Cross-References

  • Primer — Linux memory management, I/O subsystem, and process scheduling fundamentals
  • Street Ops — The 60-second performance checklist and iotop/perf debugging workflows
  • Footguns — Backup jobs without ionice, default I/O scheduling on database servers, and page cache eviction