Mental Model: Queueing Theory¶

Category: System Behavior Origin: Agner Krarup Erlang, 1909 (telephone network traffic analysis); formalized by Leonard Kleinrock in the 1960s One-liner: As utilization approaches 100%, queue length and response time grow toward infinity non-linearly — systems become dangerously unstable well before they are "full."

The Model¶

Queueing Theory studies systems where requests arrive, wait if a server is busy, get served, and depart. The fundamental insight is that queue length and response time are non-linear functions of utilization — not linear. A server running at 50% utilization behaves very differently from one at 90%, even though it has twice the spare capacity. This non-linearity is the model's key contribution to operational intuition.

For the simplest model (M/M/1 queue: Poisson arrivals, exponential service times, one server), the mean response time is R = S / (1 - ρ), where S is the mean service time and ρ (rho) is utilization (arrival rate λ divided by service rate μ). At ρ = 0.5, R = 2S — twice the service time. At ρ = 0.8, R = 5S. At ρ = 0.9, R = 10S. At ρ = 0.95, R = 20S. The curve is a hyperbola: gradual through 70%, then steep, then nearly vertical as ρ → 1.

The "knee of the curve" — where response time begins its steep climb — is typically around 70-80% utilization. This is the basis for the common SRE rule that you should alert at 70% CPU or memory and plan capacity at 80%. Below the knee, adding more load has modest impact; above it, small additions in load produce large increases in response time. This is why systems that "seem fine" at 85% utilization can collapse to unusability when traffic increases by just 20%.

Variance matters. The M/M/1 formula assumes exponential service times (high variance). Real systems with more predictable service times (M/D/1 — deterministic service) have a less steep curve and can operate at higher utilization safely. Systems with high service time variance (long tail queries, GC pauses, network retries) hit the knee earlier than the formula predicts. When your workload has high variance, treat the 70% threshold as 60%.

Boundary conditions: the M/M/1 model assumes a single server, infinite queue, and a stable arrival process. Real systems have multiple servers (M/M/c), finite queues (requests are dropped at capacity), and non-Poisson arrivals (bursty, correlated). These extensions change the exact numbers but not the core insight: utilization above ~80% produces strongly nonlinear queue growth. The operational lessons transfer even when the exact formula does not.

Visual¶

Mean Response Time  R = S / (1 - ρ)

  ρ (utilization) → Response Time (multiples of service time S)
  ──────────────────────────────────────────────────────────────
  ρ = 0.10  →  1.11× S     (10% util, nearly zero wait)
  ρ = 0.50  →  2.00× S     (50% util, 1 service time of wait)
  ρ = 0.70  →  3.33× S     (70% util — knee begins here)
  ρ = 0.80  →  5.00× S     (80% util — 4× wait vs. 0 load)
  ρ = 0.90  → 10.00× S     (90% util — response time 3× of 70%)
  ρ = 0.95  → 20.00× S
  ρ = 0.99  →100.00× S     (nearly saturated)
  ρ = 1.00  →      ∞       (queue grows without bound)

Response Time (× service time)
  20 │                                               *
  18 │
  15 │                                          *
  12 │
  10 │                                     *
   8 │
   6 │                               *
   4 │                         *
   2 │──────────────────*──────
   1 │────────*─────────
     └──────────────────────────────────────────── Utilization ρ
       10%  30%  50%  70%  80%  90%  95%  99%

           ↑ knee (~70-80%) — alert threshold here

Queue length  L_q = ρ² / (1 - ρ)

  ρ = 0.50: L_q = 0.25 / 0.50 = 0.5 items waiting
  ρ = 0.80: L_q = 0.64 / 0.20 = 3.2 items waiting
  ρ = 0.90: L_q = 0.81 / 0.10 = 8.1 items waiting
  ρ = 0.95: L_q = 0.90 / 0.05 = 18  items waiting

When to Reach for This¶

When setting CPU, memory, or request-rate alert thresholds — the 70% rule is not arbitrary conservatism, it is the knee of the response time curve
When explaining to product management why running at 90% CPU is dangerous even when "nothing is broken yet"
When designing autoscaling targets: scale-out should trigger at 60-70% utilization, not 90%, because scale-out takes time and the queue grows exponentially in that window
When diagnosing "mysterious" latency spikes: if utilization was above 80% when the spike began, saturation is likely the cause without needing further evidence
When sizing worker fleets, database connection pools, or Kafka consumer groups — every processing resource is a queueing system

When NOT to Use This¶

For bursty, short-duration traffic where the burst ends before the queue builds significantly — the M/M/1 formula assumes steady state; transient bursts may not fully saturate even at high instantaneous utilization
As the only framework when service time variance is the primary problem: a GC pause that occasionally extends service time from 10ms to 2s creates queue buildup not from high average utilization but from extreme variance — the fix is GC tuning, not capacity
For systems with effective load shedding: a system that actively drops requests when the queue exceeds a threshold has a bounded queue; the formula's "queue grows to infinity" behavior does not apply if you are deliberately refusing requests at the door

Applied Examples¶

Example 1: Kubernetes Node Autoscaling Target¶

A Kubernetes cluster runs a CPU-bound Python API. Each request takes approximately 50ms of CPU time. The service rate per pod is 20 RPS (1000ms / 50ms). With 10 pods, total service rate is 200 RPS.

Current traffic is 160 RPS. ρ = 160/200 = 0.80. Mean response time: S/(1-ρ) = 50ms / 0.20 = 250ms. Latency SLO is 500ms.

At 200 RPS (ρ = 1.0) the system collapses. But the SLO breaks before full saturation: at what ρ does R exceed 500ms? 500 = 50/(1-ρ) → 1-ρ = 0.1 → ρ = 0.90. The SLO breaks at 90% utilization (180 RPS), not at 100%.

The HPA target should be set to 70% utilization (140 RPS per 200 capacity), leaving a buffer. At 70%, R = 50/0.30 = 167ms — well within SLO and with room for traffic spikes to absorb before autoscaling completes (which takes 1-3 minutes for new pods to become ready). Setting HPA target at 80% is too tight: a 30-second traffic spike could breach the SLO before new pods are ready.

Example 2: Database Connection Pool and Saturation¶

A PostgreSQL database processes queries at an average of 10ms each (service rate: 100 queries/second). A connection pool of 80 connections means maximum concurrency of 80. If queries arrive at 70 RPS, ρ = 0.70, and mean query response time from the connection pool perspective is 10/(1-0.7) = 33ms total (including wait in pool). P99 will be substantially higher.

Traffic grows to 85 RPS. ρ = 0.85. Mean response time: 10/0.15 = 67ms — doubled from 33ms with a modest utilization increase. The application's database calls now take twice as long. This slows request handling, which — via Little's Law — increases concurrency (threads waiting on slow DB calls), which exhausts the application thread pool. The incident looks like an application crash but the root cause is a database utilization crossing 80%.

The diagnostic signature: p99 database latency spikes simultaneously with application thread pool exhaustion, and traffic increased only modestly just before the incident. The fix is either to add read replicas (increase μ), optimize slow queries (reduce S), or add a read-through cache (reduce λ to the database).

The Junior vs Senior Gap¶

Junior	Senior
Sets autoscaling target at 80-90% CPU "to be efficient"	Sets autoscaling target at 60-70% CPU, understanding that scaling takes time and the queue grows superlinearly above the knee
Treats high utilization as "healthy resource usage"	Treats utilization above 80% as an early warning indicator, regardless of whether latency has degraded yet
Adds capacity reactively after SLO breaches	Identifies the ρ at which the SLO would be breached and builds alerts to trigger before that point
Confused by "we only increased traffic 10% and latency doubled"	Immediately checks utilization — a 10% traffic increase from 80% to 88% produces roughly 3-4× latency increase per the formula

Connections¶

Complements: Little's Law — Little's Law (L = λW) describes the number of items in the system; Queueing Theory explains why W grows non-linearly as ρ increases; the two are used together to size pools and predict saturation
Complements: Amdahl's Law — if a queue is building at a specific resource (database, lock, single-threaded component), Amdahl's Law explains why adding replicas upstream of that resource does not fix the queue
Tensions: Graceful Degradation — graceful degradation (shedding load) is what a well-designed system does when ρ approaches 1; it converts "queue → infinity" behavior into "controlled rejection," trading some availability for system stability
Topic Packs: load-testing, kubernetes
Case Studies: node-pressure-evictions (node memory utilization past the knee triggers evictions, which cause cascading pod restarts — a queueing saturation failure at the node resource level)