The Mysterious Latency Spike

lesson
latency-diagnosis
cpu
disk-i/o
gc
networking
throttling
noisy-neighbors
l2 ---# The Mysterious Latency Spike

Topics: latency diagnosis, CPU, disk I/O, GC, networking, throttling, noisy neighbors Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic Linux command line

The Mission¶

Your dashboard shows a latency spike: p99 response time jumped from 50ms to 2 seconds at 2:47 PM and returned to normal at 2:52 PM. No deploys. No config changes. No alerts fired except the latency alert itself.

Five minutes of mystery. Five minutes that, if they happen during a demo or a peak traffic period, cost real money. You need to find the cause — not just "it fixed itself" but what happened and will it happen again?

Latency spikes have at least seven completely different root causes, and the fix for each is different. This lesson teaches you to diagnose them systematically.

The Diagnostic Framework: USE + The Four Resources¶

For every resource, check three things (Brendan Gregg's USE Method):

Utilization — how busy is it?
Saturation — is there a queue (more work than capacity)?
Errors — is anything failing?

Apply USE to the four resources that cause latency:

CPU      → Is the process CPU-bound? Is it being throttled?
Memory   → Is the system swapping? Is GC pausing the app?
Disk I/O → Is the disk slow? Is something else competing for I/O?
Network  → Is there packet loss? Is a connection timing out?

Suspect 1: CPU Throttling (cgroups / Kubernetes)¶

The most common cause of mysterious latency spikes in containerized environments. Your app runs fine for 90ms of its 100ms CPU quota, then gets throttled for the remaining 10ms. From the app's perspective, it froze for no reason.

# Check CPU throttling (Kubernetes pod)
cat /sys/fs/cgroup/cpu.stat
# → nr_throttled 45231        ← how many times throttled
# → throttled_usec 892340000  ← total microseconds throttled

# Check CPU limits
cat /sys/fs/cgroup/cpu.max
# → 100000 100000             ← 100ms quota per 100ms period = 1 CPU
# → 50000 100000              ← 50ms per 100ms = 0.5 CPU

If nr_throttled is high, your app is being rate-limited by its cgroup CPU quota. The fix isn't always "increase the limit" — it might be:

A burst of CPU (GC, JIT compilation) exceeding the quota
An inefficient code path that occasionally uses more CPU
The limit is set too tight for the workload's variance

# Kubernetes: check CPU limit vs actual usage
kubectl top pod myapp
# → NAME    CPU    MEMORY
# → myapp   450m   256Mi    ← using 450m with 500m limit = tight

# Check throttling events
kubectl exec myapp -- cat /sys/fs/cgroup/cpu.stat | grep throttled

Gotcha: CPU limits in Kubernetes use CFS (Completely Fair Scheduler) bandwidth control. A pod with limits.cpu: "1" gets 100ms of CPU per 100ms period. If a request takes 150ms of CPU, it gets throttled at 100ms and the remaining 50ms runs in the next period — adding 100ms of latency from the app's perspective. This looks like random slowness with no obvious cause.

Mental Model: CPU throttling is like a car with a speed limiter. It goes fast, then hits the limit and coasts until the next allowance period. From the passenger's perspective, the car inexplicably hesitated. The driver's dashboard shows nothing wrong.

Suspect 2: Garbage Collection Pauses¶

JVM, Go, Python, .NET — any garbage-collected runtime will occasionally pause your application to clean up memory. Major GC pauses can take hundreds of milliseconds.

# Java: check GC logs
# Add to JVM args: -Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags
grep "Pause" /tmp/gc.log
# → [2026-03-22T14:47:12] GC(42) Pause Full (Ergonomics) 1.2s
#                                                          ↑ 1.2 seconds!

# Go: enable GC tracing
GODEBUG=gctrace=1 ./myapp 2>&1 | grep gc
# → gc 42 @180.050s 2%: 0.021+150+0.003 ms clock, ...
#                         ↑ 150ms pause

GC pauses correlate with memory pressure — when the heap fills up, the GC has to work harder. The spike at 2:47 PM might be a traffic burst that allocated more objects, triggering a full GC.

# Check memory usage pattern
kubectl top pod myapp --containers
# If memory is near the limit when spikes happen → GC suspect

# For JVM: reduce heap pressure
# -XX:MaxGCPauseMillis=200  (target max pause time)
# -XX:+UseG1GC              (better for latency-sensitive apps)
# -XX:+UseZGC               (sub-millisecond pauses, Java 15+)

Suspect 3: Disk I/O Contention¶

Something is hammering the disk — a backup, a log rotation, another container, or a database vacuum. Your app's reads/writes queue behind the I/O storm.

# Check disk I/O latency
iostat -xz 1 5
# → Device   r/s   w/s   rkB/s   wkB/s   await  %util
# → sda      5.0   850   20      340000  45.2    98%
#                    ↑ 850 writes/sec       ↑ 45ms avg latency  ↑ 98% utilized

# Who's doing all the I/O?
iotop -o
# → PID    DISK WRITE    COMMAND
# → 5678   340 MB/s      pg_dump
#                          ↑ found it — a backup

Metric	Meaning	Trouble threshold
`%util`	How busy the device is	>80% sustained
`await`	Average I/O latency (ms)	>20ms for SSDs, >50ms for HDD
`avgqu-sz`	Average queue depth	>4 = queuing

# Check if your process is affected
strace -e trace=read,write -T -p $(pgrep myapp) 2>&1 | head -20
# → read(3, "...", 4096)  = 4096 <0.000023>   ← 23µs, fine
# → write(7, "...", 512)  = 512  <0.045123>   ← 45ms! Disk is slow

Gotcha: iowait in top is NOT a measure of disk I/O performance. It means the CPU was idle AND waiting for I/O. A system with lots of CPU headroom can show high iowait without any actual problem. Use iostat -xz for real disk latency.

Suspect 4: Network Issues¶

A backend dependency (database, cache, external API) is slow or dropping packets. Your app waits for the response, and latency spikes.

# Check for retransmissions (sign of packet loss)
ss -ti | grep -E "retrans|rto"
# → retrans:0/5     ← 5 retransmissions (bad)

# Check connection to specific backend
mtr -n --report database-host
# Look for: packet loss at any hop, high latency at specific hops

# DNS resolution slow?
time dig database-host
# → real 0m2.100s   ← 2 seconds for DNS! Normal is <50ms

# Check for conntrack table full (silent packet drops)
dmesg | grep conntrack
# → nf_conntrack: table full, dropping packet

Gotcha: DNS timeouts are invisible at the application level. Your app calls connect("database-host", 5432) and it takes 5 seconds because the DNS resolver is slow. The connection itself is fast once the IP is resolved. Add DNS lookup time to your latency breakdown: curl -w "DNS: %{time_namelookup}s\n" ...

Suspect 5: Noisy Neighbor¶

On shared infrastructure (cloud VMs, multi-tenant Kubernetes), another workload is consuming resources that your app needs.

# CPU steal time (virtualized environments)
top
# → %Cpu(s):  20 us,  5 sy,  0 ni, 55 id,  0 wa,  0 hi,  0 si, 20 st
#                                                                  ↑ 20% steal!

# st > 5% = hypervisor is taking CPU for other tenants
# Your app is losing 20% of its CPU to neighbors
# Fix: upgrade to dedicated instance type or reduce contention

In Kubernetes, noisy neighbor comes from pods without resource limits:

# Find pods without limits (these can consume unlimited resources)
kubectl get pods -A -o json | jq -r '
    .items[] | select(.spec.containers[].resources.limits == null)
    | .metadata.namespace + "/" + .metadata.name'

Gotcha: CPU steal time is invisible to application profiling. perf and strace don't show hypervisor time — from their perspective, your function just took longer. Only top and /proc/stat show steal time. If you're profiling on a cloud VM and can't explain where time goes, check st first.

Suspect 6: Transparent Huge Pages (THP)¶

Linux's Transparent Huge Pages feature can cause latency spikes when it defragments memory to create 2MB pages. This happens in the background but can stall allocations for 10-100ms.

# Check if THP is enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
# → [always] madvise never

# Check defragmentation
cat /sys/kernel/mm/transparent_hugepage/defrag
# → [always] defer defer+madvise madvise never

If always is set, disable THP for latency-sensitive workloads (especially databases):

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Gotcha: Redis, MongoDB, and Oracle explicitly recommend disabling THP. The database community discovered this the hard way — THP defragmentation caused periodic 100ms stalls that looked like disk I/O problems but were actually memory management.

Suspect 7: Application-Level Issues¶

Sometimes the latency spike is in your code — a slow query, a lock contention, an unbounded loop, or an external API timeout.

# Check for lock contention (Java)
jstack $(pgrep java) | grep -A5 "BLOCKED"
# → "http-handler-42" BLOCKED on java.util.concurrent.ConcurrentHashMap

# Check for slow queries (PostgreSQL)
psql -c "SELECT pid, now() - query_start AS duration, query
         FROM pg_stat_activity
         WHERE state = 'active' AND query_start < now() - interval '1 second'
         ORDER BY duration DESC;"

# Check for external API timeouts
# Add timing to your HTTP client
curl -w "Connect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
     -o /dev/null -s https://external-api.com/endpoint

The Diagnostic Ladder¶

Latency spike occurred
│
├── When did it happen exactly?
│   Check Grafana/Prometheus for the spike timestamp
│
├── CPU throttling?
│   cat /sys/fs/cgroup/cpu.stat → nr_throttled high?
│   └── Yes → Increase CPU limit or optimize CPU usage
│
├── GC pause?
│   Check GC logs → Pause > 100ms?
│   └── Yes → Tune GC, increase heap, use ZGC
│
├── Disk I/O?
│   iostat -xz → %util > 80%? await high?
│   └── Yes → iotop to find culprit → reschedule or throttle
│
├── Network?
│   ss -ti → retransmissions? DNS slow? conntrack full?
│   └── Yes → Fix DNS, increase conntrack, fix packet loss
│
├── Noisy neighbor?
│   top → steal time > 5%?
│   └── Yes → Dedicated instance or reduce contention
│
├── THP defrag?
│   /sys/kernel/mm/transparent_hugepage/enabled → always?
│   └── Yes → Set to never for latency-sensitive workloads
│
└── Application?
    Slow queries? Lock contention? External API timeout?
    └── Profile with perf, jstack, pg_stat_activity

Flashcard Check¶

Q1: Pod CPU usage is 450m with a 500m limit. Why does latency spike?

CPU throttling. CFS gives 50ms of CPU per 100ms period (500m = 0.5 CPU). Bursts that exceed the quota get throttled, adding up to 100ms of delay per period.

Q2: iowait in top is high. Is the disk slow?

Not necessarily. iowait means CPU was idle while waiting for I/O. Use iostat -xz (await and %util) for actual disk latency.

Q3: top shows 20% steal time. What does that mean?

The hypervisor is giving 20% of your CPU time to other tenants. Your app gets 20% less CPU. Invisible to application profiling tools.

Q4: Redis has periodic 100ms stalls. Where would you look first?

Transparent Huge Pages. Redis explicitly recommends disabling THP. Check /sys/kernel/mm/transparent_hugepage/enabled.

Q5: How do you check if a latency spike was GC-related?

Check GC logs. For JVM: -Xlog:gc* and grep for "Pause". For Go: GODEBUG=gctrace=1. Correlate pause timestamps with the latency spike time.

Exercises¶

Exercise 1: Build a latency diagnostic dashboard (think)¶

Design a monitoring dashboard with panels that cover all 7 suspects. What metrics would you display?

One approach

1. **Request latency** (p50, p95, p99) — the symptom 2. **CPU usage + throttle count** — `container_cpu_throttled_seconds_total` 3. **GC pause duration** — JVM/Go runtime metrics 4. **Disk I/O await** — `node_disk_io_time_seconds_total` 5. **Network retransmissions** — `node_netstat_Tcp_RetransSegs` 6. **DNS lookup time** — application metric 7. **CPU steal time** — `node_cpu_steal_seconds_total` 8. **Memory pressure** — `/proc/pressure/memory` or `container_memory_working_set_bytes`

Exercise 2: Check your system right now (hands-on)¶

# CPU throttling (containers)
cat /sys/fs/cgroup/cpu.stat 2>/dev/null | grep throttled

# Disk latency
iostat -xz 1 3 2>/dev/null || echo "iostat not installed (sysstat package)"

# Network retransmissions
ss -ti 2>/dev/null | grep retrans | head -5

# Steal time
grep steal /proc/stat 2>/dev/null

# THP status
cat /sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null

Cheat Sheet¶

Quick Latency Diagnosis¶

Suspect	Check command	What to look for
CPU throttle	`cat /sys/fs/cgroup/cpu.stat`	`nr_throttled` high
GC pause	GC logs / runtime metrics	Pause > 100ms
Disk I/O	`iostat -xz 1`	`await` > 20ms, `%util` > 80%
I/O source	`iotop -o`	Which process is writing
Network	`ss -ti`	Retransmissions > 0
DNS	`time dig hostname`	> 50ms
Conntrack	`dmesg \\| grep conntrack`	"table full"
Steal time	`top` (st column)	> 5%
THP	`cat /.../transparent_hugepage/enabled`	`[always]`
Slow queries	`pg_stat_activity`	Duration > 1s

Takeaways¶

CPU throttling is the #1 invisible cause. Containers with tight CPU limits get throttled silently. nr_throttled is the metric nobody checks.
Check steal time on cloud VMs. It's invisible to profilers. If st > 5%, your code isn't the problem — the hypervisor is taking your CPU.
iowait is not disk latency. Use iostat -xz for actual disk performance. iotop tells you who's responsible.
THP causes latency in databases. Redis, MongoDB, and Oracle all recommend disabling it. Memory defragmentation stalls allocations for 10-100ms.
Diagnose by elimination, not intuition. The ladder matters. CPU → GC → Disk → Network → Noisy neighbor → THP → Application. Don't jump to "it must be the database" without checking the infrastructure first.

Out of Memory — when memory pressure causes latency (then kills the process)
Connection Refused — when network issues cause failures instead of slowness
The Disk That Filled Up — when I/O contention is caused by a full disk