Mental Model: USE Method¶

Category: Debugging & Diagnosis Origin: Brendan Gregg (Netflix performance engineer), formalized ~2012 One-liner: For every resource in the system, check Utilization, Saturation, and Errors — in that order.

The Model¶

The USE Method gives you a structured checklist to investigate performance problems. Without a framework, engineers gravitate toward familiar tools or gut instinct — running top, eyeballing CPU, and concluding "seems fine." The USE Method forces you to be systematic: enumerate every resource first, then apply the same three-question test to each one.

The three dimensions: Utilization is the percentage of time a resource is busy doing work (as opposed to idle). A CPU at 90% utilization is busy 90% of the time. Saturation means the resource has more demand than it can immediately serve — work is queuing. A resource can be fully utilized without being saturated (steady throughput) or saturated at surprisingly low utilization if the queue is pathological. Errors are discrete failure events the resource is generating: hardware errors, dropped packets, disk I/O errors, CPU machine-check exceptions.

The key insight is that these three signals each tell you something different. Utilization alone misses saturation — a disk at 60% utilization with a 500ms I/O queue is your bottleneck. Saturation alone misses errors — you might be saturated because you're retrying failed operations. Checking all three, for every resource, ensures you don't skip the real cause because it lives in an unfamiliar dimension.

The method's boundary condition: it applies to resource-based bottlenecks. It does not directly help you debug logic errors, misconfigured applications, or latency caused by external dependencies that are not resources you can instrument. For service-layer problems — where the resource pool is healthy but requests are still slow — pivot to the RED Method.

Applied to a Linux system, your resource checklist typically includes: CPUs (per-core), memory, storage devices (per disk, then filesystem), network interfaces (per NIC, per direction), hardware interrupt controllers, and bus bandwidth (PCIe, memory bus). In Kubernetes, extend this to: node-level resources, container CPU/memory limits vs requests, and pod-level scheduling saturation (Pending pods, eviction events).

Visual¶

Resource Inventory
┌─────────────────────────────────────────────────────────────────┐
│  For EACH resource:                                             │
│                                                                 │
│  CPU ──────────┬──── Utilization?  (% busy)                    │
│  Memory ───────┤──── Saturation?   (queue depth / wait time)   │
│  Disk(s) ──────┤──── Errors?       (discrete failures)         │
│  Network NIC ──┤                                               │
│  PCIe Bus ─────┘                                               │
│                                                                 │
│  Signal matrix:                                                 │
│  ┌──────────────┬──────────┬────────────┬────────────────────┐ │
│  │ Resource     │ Util %   │ Saturation │ Errors             │ │
│  ├──────────────┼──────────┼────────────┼────────────────────┤ │
│  │ CPU          │ 92%  ⚠   │ load > nCPU│ mce: 0             │ │
│  │ Memory       │ 78%      │ 0 swap     │ 0                  │ │
│  │ sda (disk)   │ 15%      │ await 180ms│ iotop errors: 0    │ │
│  │ eth0         │ 3%       │ 0 drops    │ rx_errors: 0       │ │
│  └──────────────┴──────────┴────────────┴────────────────────┘ │
│                                                                 │
│  Reading: CPU util high AND load > nCPU = CPU-bound process.   │
│           Disk await 180ms = I/O saturation (investigate next) │
└─────────────────────────────────────────────────────────────────┘

Linux Tool Reference¶

Building a USE table for a Linux host requires knowing which command surfaces each signal. This reference maps resource to tool to metric name:

Resource	Utilization	Saturation	Errors
CPU	`mpstat -P ALL 1` → `%idle` inverse	`vmstat 1` → `r` (run queue)	`dmesg \| grep mce` machine-check exceptions
Memory	`free -m` → used/total	`vmstat 1` → `si`, `so` (swap in/out)	`dmesg \| grep -i "oom\\|edac"`
Disk (per device)	`iostat -xz 1` → `%util`	`iostat -xz 1` → `await`, `aqu-sz`	`smartctl -a /dev/sdX` → reallocated sectors
Network NIC	`sar -n DEV 1` → `rxkB/s`, `txkB/s` vs link speed	`ip -s link` → TX/RX dropped	`ethtool -S ethX \| grep err`
PCIe / bus	`perf stat -e cache-misses` → memory bus pressure	Implicit from memory latency	`lspci -vv` → error counters

In Kubernetes environments, add these to the checklist:

Resource	Utilization	Saturation	Errors
Node CPU	`kubectl top node`	`kubectl describe node` → `Conditions` field	`kubectl get events --field-selector reason=OOMKilling`
Node Memory	`kubectl top node`	`kubectl describe node` → memory pressure condition	`kubectl get events --field-selector reason=Evicted`
Pod CPU	`kubectl top pod -A`	`kubectl describe pod` → throttling in `cpu.stat`	n/a (limits are soft)
Pod Memory	`kubectl top pod -A`	cgroup memory.limit_in_bytes vs RSS	`kubectl describe pod` → OOMKilled in `lastState`

In Prometheus, the USE signals map to these metric families: - Utilization: node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_disk_io_time_seconds_total - Saturation: node_load1 / node_load15, node_vmstat_pgpgin, node_disk_io_time_weighted_seconds_total - Errors: node_network_receive_errs_total, node_disk_read_errors_total, hardware exporter metrics

When to Reach for This¶

A node, VM, or container is slow and you don't know why — start here before anything else
Early in an incident to rapidly rule out resource exhaustion as the cause
Performance regression: something was fast last week, slow this week — walk the resource list
Capacity planning review: you want to know which resources are approaching saturation before users feel it
After a hardware change, kernel upgrade, or cloud instance type migration
When an application team reports "the server is slow" and you need to triage infrastructure vs application

When NOT to Use This¶

When the problem is clearly in application logic (wrong query plan, infinite retry loop, bad algorithm) — USE tells you resources are fine, not why the application misbehaves
Debugging network connectivity failures (packet loss between two hosts, firewall drops) — USE describes a NIC resource, not a network path; use traceroute, tcpdump, conntrack
When you're investigating a security incident — USE looks for performance resource signals, not anomalous access patterns or unauthorized commands
As a replacement for tracing or profiling: USE narrows to the resource, but profiling tells you which code path is consuming it

Applied Examples¶

Example 1: Node under mysterious load — Kubernetes worker node¶

A Kubernetes worker node has NotReady events and pod evictions spiking. The on-call engineer opens a dashboard.

Utilization check: CPU: 18%. Memory: 94%. Network: 6%. Disk: 41%. Memory is in the high range but not obviously catastrophic.

Saturation check: CPU: load average 2.1 on an 8-core node — fine. Memory: vmstat shows si/so (swap in/swap out) both nonzero and climbing. Disk: iostat -x shows await at 320ms on the root device — far above healthy (<20ms for SSD). The swap activity is saturating the disk.

Error check: dmesg | grep -i "oom\|memory" reveals OOM killer events targeting application pods. smartctl shows 0 reallocated sectors.

Conclusion: Memory saturation → kernel using swap → disk I/O saturation → pod response times degrade → kubelet health checks fail → evictions. The fix is increasing node memory or lowering pod memory limits. The CPU number was a red herring entirely.

Example 2: Network performance degradation — bare metal NIC¶

A database replication job that normally completes in 4 minutes is now taking 22 minutes. The database DBA rules out query issues.

Utilization: sar -n DEV 1 shows eth0 throughput at 940 Mbps on a 1 Gbps NIC — 94% utilized. This is high but not itself a smoking gun.

Saturation: ip -s link show eth0 reveals TX errors: 0, dropped: 14882, overruns: 0. The TX ring buffer is dropping frames because the NIC cannot drain fast enough — saturation confirmed.

Errors: ethtool -S eth0 | grep -i err shows tx_timeout events climbing. These are distinct from drops: timeouts mean the driver waited too long for a descriptor to free.

Conclusion: NIC is saturated at 94% utilization with a TX ring buffer overflow. Resolution: tune ethtool -G eth0 tx 4096 to increase ring buffer size, then revisit whether the replication job needs bandwidth throttling or a dedicated 10GbE interface.

Adapting USE to Cloud Environments¶

On AWS, GCP, or Azure, the physical resource layer is abstracted, but the USE Method still applies — the resources are just virtualized:

Resource	Cloud equivalent	Where to observe
CPU utilization	vCPU % (instance metrics)	CloudWatch `CPUUtilization`, Stackdriver `compute.googleapis.com/instance/cpu/utilization`
CPU saturation	CPU steal time (virtualization overhead)	`sar -u ALL` → `%steal` column; high steal = hypervisor contention
Memory utilization	Instance RAM %	node_exporter → Prometheus, or cloud memory metrics
Memory saturation	Swap, OOM events	CloudWatch `SwapUsage`, instance OOM kill events
Network utilization	Instance network bandwidth vs. limit	CloudWatch `NetworkIn/Out` vs. instance type limit
Network saturation	Packet drops, throttling	`NetworkPacketsDrop`, ENI-level metrics
Disk utilization	EBS / persistent disk IOPS %	CloudWatch `VolumeReadOps` + `VolumeWriteOps` vs. provisioned IOPS
Disk saturation	`VolumeQueueLength` > 1 = I/O queue building	EBS queue depth metric
Disk errors	`VolumeIdleTime`, `VolumeReadBytes` anomalies	EBS health check

Instance type limits matter: Cloud instances have network bandwidth and IOPS limits that are per-instance-type, not just per-NIC. A c5.large is limited to 10 Gbps network baseline, 4,750 Mbps EBS baseline. Exceeding these limits causes throttling (saturation) that does not appear as dropped packets — it appears as latency. The USE table for cloud instances must account for these limits, which requires knowing the instance type spec sheet.

Burstable instances (T-series on AWS): T-class instances have CPU credits. When credits are exhausted, the instance is throttled to its baseline CPU (which may be as low as 5-20% of one vCPU). A t3.micro at 100% CPU utilization burns credits rapidly; once exhausted, it will appear as a suddenly low-throughput host. The USE signal: CPU utilization drops suddenly (because the instance can no longer burst), while workload demand has not changed. Monitor CPUCreditBalance as part of the USE matrix for burstable instances.

The Junior vs Senior Gap¶

Junior	Senior
Runs `top`, sees CPU at 30%, concludes "CPU is fine, must be application"	Runs through all resources before drawing any conclusion
Checks the first metric that looks interesting and digs deep immediately	Checks all resources shallowly first to build a complete picture
Conflates utilization and saturation — a "not busy" disk can still be the bottleneck	Knows a disk at 30% utilization with 400ms await is a saturated disk
Looks at aggregate system metrics; misses that one specific device (e.g., sdb vs sda) is the culprit	Enumerates per-device, per-NIC, per-core metrics
Stops when one suspicious metric is found	Uses errors as a third confirming signal before committing to a hypothesis
Reports "the server seems slow"	Reports "eth0 is at 94% utilization with TX ring buffer saturation — this is the bottleneck"

USE Method in Practice: A Worked Checklist¶

For an unknown performance problem on a Linux system, the following sequence builds the complete USE table efficiently without redundant tool invocations:

# 1. CPU — utilization and saturation together
mpstat -P ALL 1 3          # per-core %idle (invert for utilization)
vmstat 1 5                  # r column = run-queue depth (saturation)
cat /proc/loadavg           # load average vs nCPU (saturation summary)

# 2. Memory — utilization and saturation together
free -m                     # used/total + swap usage
vmstat 1 5                  # si/so columns = swap activity (saturation)
cat /proc/meminfo           # MemAvailable, Dirty, Writeback detail

# 3. Disk — per device, all three signals
iostat -xz 1 5              # %util, await, aqu-sz per device
dmesg | tail -50 | grep -iE "error|failed|io err"  # errors

# 4. Network — per NIC, all three signals
sar -n DEV 1 5              # rxkB/s, txkB/s vs link speed
ip -s link                  # TX/RX dropped frames (saturation signal)
ethtool -S eth0 2>/dev/null | grep -i err  # NIC-level errors

# 5. Hardware errors (cross-cutting)
dmesg | grep -iE "mce|edac|hardware error|corrected"

This sequence runs in under 2 minutes and produces a complete USE table. From the table, the resource with the highest saturation or any nonzero errors is the first investigation target.

Common Pitfalls and Anti-Patterns¶

Skipping the resource inventory step. The most common failure mode is jumping straight to checking "the obvious suspects" (CPU, then disk) without first listing every resource. A NIC error counter or a hardware interrupt rate that's never checked is where the real bottleneck hides. Write the inventory list on paper or in a scratchpad before starting.

Treating utilization as the only signal. A host at 20% CPU utilization that has load average > N (where N is the number of cores) is CPU-saturated — processes are waiting to run despite low utilization. This happens when there are more runnable processes than schedulable time, not because any individual process runs long. vmstat r column tells you what utilization doesn't.

Ignoring per-device granularity. iostat for "disk" shows aggregate stats. /dev/sda and /dev/sdb may have radically different saturation levels. A RAID array may show low utilization at the RAID level while one member disk is saturated. Always drill per-device, per-NIC, per-core.

Confusing saturation metrics. await in iostat is total I/O latency including queue wait time. svctm (service time, deprecated but available) is only the device service time. High await with low svctm = queueing saturation. High both = device is slow. The distinction matters for whether you need fewer I/O operations or a faster device.

Stopping at "nothing looks bad." If all resources show normal utilization, saturation, and errors, the problem is not resource-based — it is in the application layer, an external dependency, or a correctness bug. USE has given you a negative result, which is valuable: don't waste more time on infrastructure. Pivot to RED, distributed tracing, or application profiling.

USE at Different System Layers¶

The USE Method can be applied at multiple levels of abstraction within the same system. Applying it at each layer gives you a complete picture of where constraints exist:

Hardware layer: Physical CPUs (cores, sockets, NUMA topology), physical memory (DIMMs, channels), physical NICs (ports, speeds), physical disks (platters, SSDs, NVMe). These are the ultimate limits — everything above them is constrained by what hardware can provide.

Kernel layer: CPU scheduler (per-cgroup, per-process), memory management (virtual memory, page cache, huge pages, cgroup memory limits), block I/O scheduler (per-queue, merged vs. direct), network stack (socket buffers, conntrack table, iptables chain processing). Kernel-layer saturation often manifests as high system CPU time rather than user CPU time.

Container/cgroup layer: In Kubernetes, each pod has CPU requests/limits and memory requests/limits. CPU limits are enforced via CFS bandwidth throttling — a container "at limit" will have periods where its processes are forcibly paused. Memory limits are enforced via OOM kill. These are independent saturation points from the underlying node.

Application layer: Connection pools, thread pools, request queues within the application are also resources with utilization, saturation, and error signals — but they require application-level instrumentation to observe. A database connection pool at 100% utilization (all connections in use) is saturated even if the database host has headroom.

The insight is that a bottleneck at any layer propagates upward as latency. A saturated kernel CFS scheduler shows up as application latency. A saturated database connection pool shows up as service p99 latency. USE applied only at the host level misses bottlenecks in the layers above the kernel.

Connections¶

Complements: RED Method (USE for infrastructure resources, RED for service/request metrics — run both simultaneously during an incident for full coverage)
Complements: Differential Diagnosis (USE generates the candidate list of hypotheses; Differential Diagnosis is the framework for eliminating them systematically)
Tensions: Five Whys (USE stops at identifying which resource; Five Whys pushes further to why that resource is exhausted — use USE first, then Five Whys to trace root cause)
Topic Packs: observability, prometheus, linux-performance
Case Studies: node-pressure-evictions (USE reveals memory saturation as the eviction driver), oom-killer-events (USE matrix shows memory errors and saturation before OOM kill)