Thinking Out Loud: Linux Ops¶

A senior SRE's internal monologue while working through a real Linux operations task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation¶

A production web server running Ubuntu 22.04 has become increasingly sluggish over the past 48 hours. Response times are 5x normal. No deploys, no config changes, no traffic increase. The monitoring shows CPU is at 35% — so it's not CPU-bound. Something else is going on.

The Monologue¶

Slow server, CPU is fine, no obvious change. Let me get the lay of the land. When I hear "slow but CPU is fine," my first thought is always I/O or memory.

uptime

load average: 12.47, 11.89, 10.22. Load average of 12 on a 4-core machine. That's 3x the core count, which means processes are spending a lot of time waiting. But CPU is only 35%, so they're not waiting for CPU — they're waiting for something else. Disk I/O or memory pressure causing swapping.

vmstat 1 5

Look at that wa column — I/O wait is at 45%. And si/so (swap in/swap out) are both non-zero. 2400 pages/sec swap out. We're actively swapping. That explains everything.

Mental Model: The USE Method (Utilization, Saturation, Errors)¶

For every resource (CPU, memory, disk, network), check three things: Utilization (how busy is it?), Saturation (is there a queue?), Errors (any failures?). High load average with low CPU utilization almost always means I/O saturation — processes are stuck waiting for disk or network, not for CPU time.

Let me check memory.

free -h

Total: 16G. Used: 15.2G. Available: 380M. Swap: 4G used out of 4G. We're fully into swap. The server has almost no free memory. What's consuming it?

ps aux --sort=-%mem | head -15

The top consumer is java with 8.2GB RSS. That's our monitoring agent. Wait — 8.2GB for a monitoring agent? That's not right. Let me check when this started.

ps -p $(pgrep -f 'java.*monitoring') -o pid,etime,rss,%mem --no-headers

Running for 47 days, 8.4GB RSS. This is a classic Java memory leak. The JVM heap is growing without bound. Let me check the JVM flags.

cat /proc/$(pgrep -f 'java.*monitoring')/cmdline | tr '\0' ' '

No -Xmx flag. The JVM is using the default maximum heap, which on a 16GB machine is 25% of RAM (4GB). But the RSS is 8.4GB — that means the JVM is using way more than its heap. The off-heap (native memory, thread stacks, JIT compilation cache, memory-mapped files) is larger than the heap.

I've seen this before with Elasticsearch and monitoring agents. Native memory leaks in JNI libraries or excessive memory-mapped file usage. But I need to fix the symptom now and root-cause later.

Option 1: Restart the Java process. Fast, fixes the immediate problem, but it'll leak again. Option 2: Set proper -Xmx and restart. Better, but doesn't fix the native memory leak. Option 3: Restart with proper flags AND set a systemd memory limit as a safety net.

Option 3. Let me do it properly.

systemctl status monitoring-agent

It's a systemd service. Let me add a memory limit via a drop-in override.

sudo systemctl edit monitoring-agent

# /etc/systemd/system/monitoring-agent.service.d/override.conf
[Service]
Environment="JAVA_OPTS=-Xmx2g -Xms512m -XX:MaxMetaspaceSize=256m"
MemoryMax=4G
MemoryHigh=3G

This sets the JVM heap max to 2GB and puts a hard systemd cgroup limit of 4GB. If the process exceeds 4GB, systemd will OOM-kill it and restart it automatically. The MemoryHigh at 3GB will trigger throttling before the hard kill.

Mental Model: Defense in Depth for Memory¶

Never rely on application-level memory limits alone. Set limits at multiple layers: JVM heap (-Xmx), cgroup limits (systemd MemoryMax), and OS-level controls (swap priority, oom_score_adj). If the application leaks past its own limit, the cgroup catches it. If the cgroup doesn't catch it, the OOM killer handles it — but you want to be the one choosing which process gets killed, not the kernel.

sudo systemctl daemon-reload
sudo systemctl restart monitoring-agent

Let me watch the memory recover.

watch -n 2 'free -h; echo; vmstat 1 1 | tail -1'

Swap usage is dropping as pages get reclaimed. Available memory is climbing. Give it a minute...

Okay, available memory is back to 8GB, swap usage is dropping. Let me check if the web server response times are recovering.

curl -o /dev/null -s -w "time_total: %{time_total}\n" http://localhost:8080/health

200ms. Down from the 2-3 seconds it was probably taking during the swap thrashing. Let me check the web server logs for request timing.

tail -20 /var/log/nginx/access.log | awk '{print $NF}'

Response times are back to normal range (50-150ms). Good. But I want to make sure the swap is fully drained. Active swapping can persist even after memory is freed if the swapped-out pages aren't accessed.

sudo swapoff -a && sudo swapon -a

That forces all swap pages back into RAM. Aggressive, but we have plenty of free memory now. Let me verify.

free -h

Swap: 0B used. Available memory: 6.2GB. We're healthy.

One more thing — I want to set vm.swappiness lower so the kernel is less eager to swap in the future. The default is 60, which is too aggressive for a server.

echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.d/99-performance.conf
sudo sysctl -p /etc/sysctl.d/99-performance.conf

Mental Model: Swappiness as a Policy Knob¶

vm.swappiness controls how aggressively the kernel moves pages to swap vs dropping file caches. On servers where response latency matters, set it low (10-20). On desktops, the default (60) is fine. Setting it to 0 doesn't disable swap — it just makes the kernel try harder to keep things in RAM until pressure is critical.

Let me also check if there are other memory hogs that could cause this again.

ps aux --sort=-%mem | head -10

The monitoring agent is now at 520MB. The web server is at 2.1GB. Everything else is small. We're in good shape. I'll set up a memory alert at 85% utilization so we catch this before it impacts performance next time.

What Made This Senior-Level¶

Junior Would...	Senior Does...	Why
See 35% CPU and say "the server is fine"	Recognize that load average 12 on 4 cores with low CPU means I/O wait or memory pressure	CPU utilization alone is misleading — you need to check all resources
Just restart the Java process	Restart with proper memory limits AND add a cgroup safety net	Fixing the symptom without adding guardrails means it will happen again
Not think about swap behavior after freeing memory	`swapoff/swapon` to force swap drain AND lower `vm.swappiness`	Swap pages persist until accessed, and the default swappiness is too aggressive for servers
Not know that RSS can exceed `-Xmx` in Java	Understand JVM native memory, off-heap allocations, and how they escape heap limits	Java memory management has multiple layers — the heap is only one of them

Key Heuristics Used¶

The USE Method: For each resource, check Utilization, Saturation, and Errors. High load + low CPU = I/O or memory saturation.
Defense in Depth for Memory: Set limits at application level (-Xmx), cgroup level (systemd MemoryMax), and OS level (swappiness, OOM score).
Fix Then Prevent: Restart the process to fix the symptom, add resource limits to prevent recurrence, add monitoring to detect early.

Cross-References¶

Primer — Linux filesystem, process management, and systemd basics
Street Ops — Performance triage commands and memory debugging workflows
Footguns — Java RSS exceeding Xmx, default swappiness on servers, and unmonitored memory growth