Load Testing: Finding the Breaking Point
- lesson
- load-testing
- performance
- capacity-planning
- kubernetes
- observability
- linux-internals ---# Load Testing: Finding the Breaking Point
Topics: load testing, performance, capacity planning, Kubernetes, observability, Linux internals Level: L1-L2 (Foundations to Operations) Time: 75-90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's October. Black Friday is five weeks away. Your VP of Engineering just forwarded an email from the business side: "Marketing is projecting 10x normal traffic on Black Friday. Can our systems handle it?" The VP's reply to you is one line: "Can they?"
You don't know. Nobody knows. The system has never seen 10x traffic. Staging is half the size
of production. The last "load test" was someone running curl in a for loop eight months
ago.
Your job over the next five weeks: find out if the system survives 10x traffic, and if it doesn't, figure out exactly where it breaks so the team can fix it before the turkey lands.
By the end of this lesson you'll understand: - The difference between load, stress, soak, and spike testing (and when each one matters) - How to write a realistic k6 load test (not a hello-world) - Little's Law -- the one equation that ties concurrency, throughput, and latency together - Why most benchmarks lie (the coordinated omission problem) - How to read load test results and know what the numbers actually mean - Where systems actually break: CPU, memory, connection pools, database locks, network - How to run load tests in CI/CD and against production safely
Part 1: What Kind of Test Do You Need?¶
Not all load tests answer the same question. Picking the wrong type is like checking your oil when the tire is flat.
| Test Type | Traffic Shape | Duration | The Question It Answers |
|---|---|---|---|
| Load | Constant or ramp to target | 10-30 min | "Does it work at expected traffic?" |
| Stress | Ramp past expected limits | 30-60 min | "Where does it break?" |
| Soak | Sustained expected load | 4-24 hours | "Does it leak memory or connections?" |
| Spike | Instant jump to 10x | 5-10 min | "Does autoscaling actually work?" |
| Breakpoint | Slow ramp until failure | 60-120 min | "What's the exact capacity ceiling?" |
For Black Friday, you need all of them -- in order:
- Load test first: does the system behave at current traffic? (Baseline.)
- Stress test: ramp to 10x. Where does it crack?
- Fix whatever broke. Repeat.
- Spike test: simulate the midnight rush. Does autoscaling kick in fast enough?
- Soak test: run at 3x for 8 hours. Any memory leaks hiding?
Gotcha: A 10-minute stress test won't catch a connection pool leak that only manifests after 2 hours of sustained load. The soak test exists specifically for this -- it's the test that finds the slow bleed.
Part 2: Little's Law -- The One Equation You Need¶
Before you touch a load testing tool, you need one mental model. Everything else builds on it.
Mental Model: Little's Law states that in a stable system:
L = lambda x W
Or in human terms: Concurrency = Throughput x Latency
- L (concurrency): number of requests in flight at any moment
- lambda (throughput): requests completed per second
- W (latency): average time to complete one request
This is not an approximation. It's a mathematical law proven by John Little in 1961. It holds for any stable system -- a web server, a grocery store checkout, a highway.
Worked Example 1: Black Friday Capacity¶
Your system currently handles 200 requests per second with 100ms average latency.
Twenty requests in flight at any time. Your connection pool is set to 50. Plenty of headroom.
Now Black Friday hits. Traffic goes to 2,000 RPS. Latency stays at 100ms (optimistic):
Your connection pool of 50 just became the bottleneck. Requests queue. Latency climbs. And when latency climbs, things get worse:
Worked Example 2: The Death Spiral¶
At 2,000 RPS, the connection pool saturates. Latency increases to 500ms:
One thousand requests in flight. Your application threads are exhausted. The request queue overflows. Timeouts cascade. Latency jumps to 5 seconds:
This is how systems die. Little's Law predicts the death spiral: higher latency means more concurrency, which means more queueing, which means higher latency. Once you're past the tipping point, the system cannot recover without shedding load.
Trivia: John D.C. Little proved his law in 1961 at MIT. The elegant thing about it: he proved it requires no assumptions about the arrival distribution, service distribution, or order of service. It works for M/M/1 queues, web servers, drive-throughs, and emergency rooms. It's one of the most general results in queueing theory.
Interview Bridge: "Explain how a system can be fine at 1,000 RPS but completely collapse at 1,200 RPS." Little's Law + connection pool exhaustion is the textbook answer. The transition from "working" to "dead" is non-linear because of the feedback loop between latency and concurrency.
Flashcard Check #1¶
Q1: Your service handles 500 RPS at 200ms latency. How many concurrent requests are in flight?
500 x 0.2 = 100 concurrent requests.
Q2: That same service has a max thread pool of 80. What happens when traffic increases?
At 400 RPS (80 / 0.2s), the thread pool saturates. Additional requests queue, latency increases, and the death spiral begins unless you shed load (rate limiting, circuit breaker).
Q3: Little's Law says Concurrency = Throughput x Latency. If you double throughput and latency stays flat, what happens to concurrency?
It doubles. You need twice the connection pool, twice the threads, twice the database connections. This is why "just add more traffic" breaks things that seemed fine.
Part 3: Your First Real Load Test (k6 Walkthrough)¶
Enough theory. Let's write a test that actually models your e-commerce system under Black Friday conditions.
Name Origin: k6 was created by Load Impact, a Swedish company founded in 2010. They open-sourced k6 in 2017, deliberately choosing JavaScript (ES6) scripts over JMeter's XML-heavy GUI approach. Grafana Labs acquired them in 2021, recognizing that load testing and observability are two halves of the same problem. The name "k6" is short for "kilo-6" -- six being the HTTP status code family for... actually, the origin is unclear. The team has said it just sounded good.
The Script¶
This isn't a hello-world. This models a real user flow: browse products, add to cart, checkout. Different actions happen at different rates, just like real traffic.
import http from 'k6/http';
import { check, sleep, group } from 'k6';
import { SharedArray } from 'k6/data';
import { Counter, Rate, Trend } from 'k6/metrics';
// --- Custom metrics (beyond k6 defaults) ---
const checkoutErrors = new Counter('checkout_errors');
const checkoutSuccess = new Rate('checkout_success_rate');
const checkoutDuration = new Trend('checkout_duration');
// --- Test data: loaded once, shared across all VUs ---
const products = new SharedArray('products', function () {
return JSON.parse(open('./data/products.json'));
// e.g., [{"id": "prod-001", "name": "Widget"}, ...]
});
const users = new SharedArray('users', function () {
return JSON.parse(open('./data/users.json'));
// e.g., [{"id": "user-042", "token": "eyJhb..."}, ...]
});
// --- Configuration ---
const BASE_URL = __ENV.BASE_URL || 'https://staging.shop.example.com';
export const options = {
scenarios: {
// 90% of users just browse
browse: {
executor: 'ramping-arrival-rate',
startRate: 50,
timeUnit: '1s',
preAllocatedVUs: 100,
maxVUs: 500,
stages: [
{ duration: '2m', target: 50 }, // warm up
{ duration: '5m', target: 500 }, // ramp to 10x
{ duration: '10m', target: 500 }, // hold at 10x
{ duration: '2m', target: 0 }, // cool down
],
exec: 'browseFlow',
},
// 10% of users check out
checkout: {
executor: 'ramping-arrival-rate',
startRate: 5,
timeUnit: '1s',
preAllocatedVUs: 50,
maxVUs: 200,
stages: [
{ duration: '2m', target: 5 },
{ duration: '5m', target: 50 },
{ duration: '10m', target: 50 },
{ duration: '2m', target: 0 },
],
exec: 'checkoutFlow',
},
},
thresholds: {
http_req_duration: ['p(95)<500', 'p(99)<2000'],
http_req_failed: ['rate<0.01'],
checkout_success_rate: ['rate>0.99'],
checkout_duration: ['p(95)<3000'],
},
};
Let's break down the key decisions:
| Decision | Why |
|---|---|
ramping-arrival-rate executor |
Models real traffic: users don't slow down when your server does (avoids coordinated omission) |
| Two scenarios (browse + checkout) | Real traffic isn't uniform -- 90% browse, 10% buy. Different endpoints, different load profiles |
SharedArray for test data |
Loaded once, shared across VUs. Without it, 500 VUs each load the file = OOM on the test runner |
maxVUs: 500 |
Hard safety cap. Even if you misconfigure the rate, you can't accidentally DDoS |
Thresholds on p(95) and p(99) |
Averages lie. p95 tells you what 1 in 20 users experiences. p99 tells you what 1 in 100 sees |
Now the user flows:
export function browseFlow() {
const product = products[Math.floor(Math.random() * products.length)];
group('browse', () => {
const listRes = http.get(`${BASE_URL}/api/products?page=1&limit=20`);
check(listRes, {
'product list 200': (r) => r.status === 200,
'has products': (r) => r.json('items').length > 0,
});
const detailRes = http.get(`${BASE_URL}/api/products/${product.id}`);
check(detailRes, { 'product detail 200': (r) => r.status === 200 });
});
sleep(Math.random() * 2 + 1); // 1-3s think time
}
export function checkoutFlow() {
const user = users[__VU % users.length];
const product = products[Math.floor(Math.random() * products.length)];
const params = {
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${user.token}`,
},
timeout: '10s', // fail fast -- don't let VUs hang
};
group('checkout', () => {
const cartRes = http.post(
`${BASE_URL}/api/cart`,
JSON.stringify({ product_id: product.id, quantity: 1 }),
params
);
check(cartRes, { 'add to cart': (r) => r.status === 201 });
if (cartRes.status !== 201) { checkoutErrors.add(1); return; }
const start = Date.now();
const orderRes = http.post(
`${BASE_URL}/api/orders`,
JSON.stringify({ cart_id: cartRes.json('cart_id') }),
params
);
checkoutDuration.add(Date.now() - start);
checkoutSuccess.add(orderRes.status === 201);
check(orderRes, {
'order created': (r) => r.status === 201,
'has order id': (r) => r.json('order_id') !== undefined,
});
if (orderRes.status !== 201) { checkoutErrors.add(1); }
});
sleep(Math.random() * 3 + 2); // 2-5s think time
}
Gotcha: Notice the
timeout: '10s'in the checkout params. Without it, k6 waits indefinitely for a response. When the server starts queuing under load, your VUs get stuck waiting instead of measuring the failure. Always set timeouts that match your SLOs.
Run it:
# Against staging
k6 run -e BASE_URL=https://staging.shop.example.com load-test.js
# With real-time Grafana output
k6 run --out influxdb=http://localhost:8086/k6 load-test.js
# With JSON output for later analysis
k6 run --out json=results.json load-test.js
Part 4: Why Most Benchmarks Lie (Coordinated Omission)¶
You run your beautiful load test. The results look great: p99 latency is 800ms, error rate is 0.2%. You report to the VP: "We can handle 10x." Black Friday arrives. The system collapses at 4x. What happened?
Coordinated omission happened.
Mental Model: Imagine you're timing how long it takes to get coffee at a cafe. You walk in, order, time the wait, get your coffee: 3 minutes. You do this 100 times. Average: 3 minutes. Sounds accurate.
But here's what you missed: every time the cafe was slow, you waited in line before ordering. You didn't time the line -- only the order-to-delivery. A real customer who walked in during the rush waited 3 minutes in line + 3 minutes for the order = 6 minutes total. Your benchmark said 3 minutes. Reality was 6.
That's coordinated omission.
How It Works in Load Testing¶
With VU-based execution (the default in most tools), each VU does request-sleep-request in a loop. When the server slows to 3 seconds per request, each VU sends one-third as many requests. You think you're testing at 100 RPS. You're actually testing at 33 RPS. The load test automatically backed off at exactly the moment things got interesting.
Your reported latencies look great because you measured fewer requests during the slow period. You measured 33 responses at 3s instead of the 100 that real users would have sent.
Trivia: Gil Tene, CTO of Azul Systems, coined the term "coordinated omission" in a 2013 presentation. He demonstrated that most benchmarking tools systematically underreported latency by 10-100x. The word "coordinated" refers to the fact that the load generator and the system under test are inadvertently cooperating to hide the problem -- when the server slows down, the generator slows down too.
The Fix: Arrival Rate Executors¶
// BAD: VU-based -- backs off when server slows
export const options = {
vus: 100,
duration: '5m',
};
// GOOD: Arrival rate -- maintains target rate regardless
export const options = {
scenarios: {
load: {
executor: 'constant-arrival-rate',
rate: 100, // 100 iterations/second, period
timeUnit: '1s',
duration: '10m',
preAllocatedVUs: 100,
maxVUs: 500, // k6 spins up more VUs as requests back up
},
},
};
With constant-arrival-rate, k6 sends exactly 100 requests per second. If the server slows
down, k6 spawns more VUs to maintain the rate. If it hits maxVUs and still can't keep up,
it reports dropped iterations -- which is exactly what happens to real users (they get
timeouts or errors).
Remember: If you only take one thing from this lesson: always use arrival-rate executors for load tests that model real traffic. VU-based tests are fine for modeling "50 people logged in simultaneously" -- but not for "what happens at 2,000 RPS."
Part 5: Reading the Results -- What the Numbers Actually Mean¶
Your test ran. k6 printed a wall of stats. Here's how to read them.
data_received..................: 847 MB 4.5 MB/s
data_sent......................: 126 MB 668 kB/s
http_req_blocked...............: avg=1.2ms p(95)=3.4ms
http_req_connecting............: avg=0.8ms p(95)=2.1ms
http_req_duration..............: avg=127ms p(50)=89ms p(95)=340ms p(99)=1.2s
http_req_failed................: 0.34% ✓ 1,203 ✗ 351,447
http_req_receiving.............: avg=2.1ms p(95)=8.4ms
http_req_sending...............: avg=0.3ms p(95)=0.8ms
http_req_tls_handshaking.......: avg=5.2ms p(95)=12.3ms
http_req_waiting...............: avg=119ms p(50)=82ms p(95)=325ms p(99)=1.1s
http_reqs......................: 352,650 1,869/s
iteration_duration.............: avg=2.1s p(95)=3.8s
iterations.....................: 352,650 1,869/s
vus............................: 312 min=12 max=487
vus_max........................: 500
The Metrics That Matter¶
| Metric | What It Tells You | What to Worry About |
|---|---|---|
http_req_duration p(50) |
Median latency. What a typical user sees | Baseline. Compare across runs |
http_req_duration p(95) |
What 1 in 20 users sees | This is your SLO target |
http_req_duration p(99) |
What 1 in 100 users sees | Tail latency. Connection pools, GC pauses |
http_req_failed |
Error rate | Should be < 0.1% for a healthy service |
http_reqs (rate) |
Actual throughput (RPS) | Did you hit your target rate? |
vus max |
Peak concurrent VUs | If it hit maxVUs, you had dropped iterations |
http_req_blocked |
Time waiting for a free TCP connection | High = connection pool exhaustion on test runner |
http_req_connecting |
Time establishing TCP connection | High = DNS or network issues |
http_req_tls_handshaking |
TLS negotiation time | High = CPU-bound TLS or cert chain too long |
http_req_waiting |
Server processing time (TTFB) | The "real" latency minus network overhead |
Reading the Gap Between Percentiles¶
The gap between p50 and p99 tells you a story:
This means most requests are fast (89ms) but some hit a completely different code path. Common causes: - Cache miss (cache hit = 89ms, cache miss = 1.2s with a database round trip) - GC pause (most requests dodge GC, unlucky ones wait for a full GC cycle) - Connection pool exhaustion (most requests get a connection, some wait in the queue) - Cold starts (first request to a scaled-up pod or Lambda function)
This is a healthy system under load -- everyone gets roughly the same experience.
Gotcha: A service averaging 100ms but with 5% of requests taking 5,000ms is broken for 1 in 20 users. Average latency would report 345ms -- technically "fast." p95 reports 5,000ms -- the truth. Averages lie, percentiles don't.
Part 6: Where Systems Actually Break¶
You've found the breaking point. Now you need to figure out what broke. This is where load testing crosses into Linux performance diagnosis.
The Bottleneck Hierarchy¶
When a system degrades under load, work your way down:
| Layer | Common Bottlenecks |
|---|---|
| Application | Thread pools, connection pools, memory leaks |
| Database | Lock contention, slow queries, connection limits |
| OS / Kernel | CPU saturation, memory pressure, file descriptors, socket limits |
| Infrastructure | Disk I/O, network bandwidth, load balancer limits |
The Diagnostic Checklist (Run While Your Load Test Is Running)¶
# CPU: any core at 100% = single-threaded bottleneck
mpstat -P ALL 1 3
# Memory: si/so > 0 = swapping (bad). 'b' column > 0 = blocked on I/O
vmstat 1 5
# Disk: %util near 100% or await > 10ms (SSD) = storage bottleneck
iostat -xz 1 3
# Network: high timewait = connection churn
ss -s
# File descriptors: used near max = ulimit problem
cat /proc/sys/fs/file-nr
# Database connections: active near max_connections = pool bottleneck
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Database slow queries
psql -c "SELECT query, mean_exec_time, calls
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
Under the Hood: On Linux, every network connection is a file descriptor. The default per-process limit (
ulimit -n) is often 1024. A web server handling 2,000 concurrent connections needs at least 2,000 file descriptors -- plus some for log files, database connections, and other I/O. The system-wide limit (/proc/sys/fs/file-max) is separate and usually much higher. Both must be sufficient.
The Connection Pool War Story¶
War Story: A team ran a stress test against their API service. At 500 RPS, everything was fine. At 800 RPS, latency gradually climbed from 100ms to 30 seconds over 10 minutes, then errors cascaded. CPU was at 40%. Memory was fine. Disk was idle.
The culprit: a database connection pool sized at 20. At 500 RPS with 40ms database latency, Little's Law gives 500 x 0.04 = 20 concurrent database connections -- exactly the pool size. At 800 RPS: 800 x 0.04 = 32 connections needed, but only 20 available. Requests queued for a connection. Queue wait added to latency. Higher latency meant more requests in flight (Little's Law again). The pool could never catch up.
The fix was a one-line config change: pool size from 20 to 50. But the team only found it because they correlated the load test timeline with database connection metrics in Grafana. Without that dashboard, they would have blamed "the network" and wasted days.
Remember: Little's Law predicts connection pool exhaustion. Required connections = throughput x per-request database time. If that number exceeds your pool size, requests queue and the death spiral begins. Always calculate this before your load test.
Flashcard Check #2¶
Q4: CPU is at 40% during a load test but latency is climbing. What should you check next?
Connection pools (database, HTTP client), thread pools, and downstream service latency. Low CPU with high latency means the application is waiting, not computing.
Q5: iostat -xz 1 shows %util at 95% on your database server's disk. What does that mean?
The disk is nearly saturated. If
awaitis also high (>10ms for SSD), queries are waiting for I/O. This is your bottleneck.
Q6: During a soak test, pg_stat_activity shows connection count growing by 1 every minute and never shrinking. What's happening?
Connection leak. The application is opening connections but not returning them to the pool on some code paths (typically error paths). Eventually
max_connectionsis reached and all new requests fail.
Part 7: Testing Kubernetes Services¶
Black Friday isn't just about your application code. If you're running on Kubernetes, the infrastructure has its own set of bottlenecks.
What Breaks First in Kubernetes Under Load¶
| Component | Failure Mode | How You'll See It |
|---|---|---|
| HPA (Horizontal Pod Autoscaler) | Scales too slowly | Latency spike for 2-3 minutes, then recovery |
| Node autoscaler | New nodes take 3-5 min to join | Pods stuck in Pending state |
| Service/kube-proxy | iptables rules lag behind new pods | Requests routed to not-yet-ready pods (5xx) |
| Ingress controller | Connection limits per pod | 502/503 errors at the ingress |
| DNS (CoreDNS) | Lookup latency under load | Intermittent timeouts, especially first requests |
| etcd | Write latency > 100ms | API server slowness, deploy failures |
Before your load test, check: kubectl get hpa -A (are you near maxReplicas?),
kubectl top nodes (CPU/memory headroom), and kubectl get pods -n kube-system -l
k8s-app=kube-dns (enough CoreDNS pods?).
Gotcha: If your application does DNS lookups for every database connection (common in service mesh setups), CoreDNS becomes a hidden bottleneck under load. Two CoreDNS pods can handle typical traffic, but at 10x they may fall behind.
Shadow Traffic: Testing Production Without Risk¶
The most honest load test runs against production. Staging lies -- different data, different scale, different network topology. But you can't risk breaking production.
Shadow traffic (also called traffic mirroring) copies real production requests to a test endpoint without affecting the response to the real user.
With Istio, add a mirror block to your VirtualService pointing at a canary subset with
a mirrorPercentage of 10-20%. With Envoy, use the request_mirror_policies route config.
The mirrored traffic is fire-and-forget: responses from the canary are discarded. Your
production users see no difference. But your canary gets real production traffic patterns
-- the exact bursty, unpredictable distribution that synthetic tests can never replicate.
Trivia: Netflix's "replay traffic testing" takes this further -- capturing real production traffic and replaying it against new service versions. Far more realistic than synthetic tests, but requires careful handling to avoid side effects like duplicate emails.
Part 8: Load Testing in CI/CD¶
You found the breaking point. You fixed the bottleneck. But how do you make sure it doesn't regress? Put a baseline test in your pipeline.
The key: baseline, not stress. In CI, you're catching regressions, not finding the breaking point. A 2-minute constant-rate test with strict thresholds does the job.
// tests/load/baseline.js -- CI performance gate
export const options = {
scenarios: {
baseline: {
executor: 'constant-arrival-rate',
rate: 50,
timeUnit: '1s',
duration: '2m',
preAllocatedVUs: 50,
maxVUs: 100,
},
},
thresholds: {
http_req_duration: [
{ threshold: 'p(95)<300', abortOnFail: true, delayAbortEval: '30s' },
],
http_req_failed: [
{ threshold: 'rate<0.01', abortOnFail: true },
],
},
};
In GitHub Actions, use grafana/k6-action@v0.3.1. Spin up your app and database as
service containers, run the baseline test, and fail the PR if thresholds breach.
Gotcha: CI runners have limited resources. Your test might fail because the runner can't generate enough load, not because of a real regression. Calibrate thresholds against what the CI environment can actually deliver, and use
delayAbortEvalto ignore warm-up.
Part 9: The Tool Landscape¶
k6 isn't the only option. Here's when you'd reach for something else.
| Tool | Language | Best For | Watch Out |
|---|---|---|---|
| k6 | JavaScript (Go engine) | Most teams. CI integration. Grafana ecosystem | No browser rendering. WebSocket support is basic |
| Locust | Python | Complex flows needing Python logic. Data science teams | Single-threaded per worker. Need distributed mode for high RPS |
| Gatling | Scala/Java | JVM shops. Beautiful HTML reports | Steep learning curve. JVM memory overhead |
| wrk | Lua | Raw HTTP benchmarking. Maximum RPS from one machine | No complex scenarios. No built-in reporting |
| hey | Go | Quick one-liner tests ("is this endpoint fast?") | Very basic. No scenarios, no scripting |
| Apache Bench (ab) | C | Installed everywhere. Quick sanity check | Single-threaded. No modern features. No TLS/2 |
| Vegeta | Go | Constant-rate HTTP attacks. Unix pipeline friendly | Limited scenario support |
# hey: instant one-liner latency check
hey -n 1000 -c 50 https://api.example.com/health
# wrk: saturate an endpoint for 30 seconds
wrk -t10 -c200 -d30s https://api.example.com/products
# vegeta: constant rate, Unix-pipe style
echo "GET https://api.example.com/health" | \
vegeta attack -duration=60s -rate=100/s | \
vegeta report
Trivia: ApacheBench (
ab) has been bundled with the Apache HTTP Server since 1996 -- nearly 30 years old and still the first tool many reach for, because it's already installed on almost every Linux server. Locust's name references a swarm; Gatling is named after the Gatling gun. The naming convention in load testing trends toward violent metaphors: attack, blast, siege, wrk (pronounced "work," but suspiciously close to "wreck").
Exercises¶
Exercise 1: Calculate Before You Test (2 minutes)¶
Your API has these production metrics: - 300 RPS average, 450 RPS peak - 150ms average latency - Database connection pool: 30 - Database query time: 25ms per request
Using Little's Law, answer: 1. How many concurrent requests at average load? 2. How many database connections needed at peak? 3. Will the connection pool hold at 3x peak traffic?
Answers
1. Concurrency at average: 300 x 0.15 = 45 concurrent requests 2. DB connections at peak: 450 x 0.025 = 11.25 -- pool of 30 is fine 3. At 3x peak (1,350 RPS): 1,350 x 0.025 = 33.75 DB connections needed. Pool of 30 is not enough. You'll see request queuing starting at 1,200 RPS (30 / 0.025).Exercise 2: Spot the Coordinated Omission (5 minutes)¶
A colleague shows you this k6 test and says "we can handle 500 RPS, p99 is only 200ms":
export const options = { vus: 500, duration: '5m' };
export default function () {
http.get('https://api.example.com/products');
sleep(1);
}
What's wrong with their conclusion? Write a corrected version.
Answer
Three problems: 1. **Coordinated omission**: VU-based executor with `sleep(1)`. At best this sends 500 / (response_time + 1s) RPS. If response time is 200ms, that's ~417 RPS, not 500. If the server slows to 2s, throughput drops to ~167 RPS. The test backs off under load. 2. **Hardcoded data**: Every VU hits the same endpoint with no variation. Cache hits everywhere. 3. **No thresholds**: Checks aren't failing the test. A 100% error rate still exits 0. Corrected:Exercise 3: Diagnose the Bottleneck (think, don't code)¶
During a stress test at 2,000 RPS you see: p50=80ms, p99=12,000ms, CPU 35%, memory 60%, DB CPU 25%, DB connections 48/50 active. What's the bottleneck? What do you fix?
Answer
**Bottleneck: Database connection pool.** 48/50 active connections while CPU is healthy everywhere. The p50/p99 gap means most requests get a connection immediately (80ms), but some wait in the queue (12s). Little's Law: 2,000 x 0.025s = 50 connections needed -- right at the edge with zero headroom. Increase to 75-100, but first verify the database can handle more concurrent queries.Cheat Sheet¶
Little's Law Quick Reference¶
Concurrency = Throughput x Latency
Throughput = Concurrency / Latency
Latency = Concurrency / Throughput
Required connection pool = RPS x per-request-db-time
Max RPS before pool exhaustion = pool_size / per-request-db-time
k6 Executors¶
| Executor | Use When | Coordinated Omission? |
|---|---|---|
constant-vus / ramping-vus |
N concurrent sessions | Yes |
constant-arrival-rate |
Fixed RPS, real traffic | No |
ramping-arrival-rate |
Ramping RPS, real traffic | No |
Quick Commands¶
| What | Command |
|---|---|
| k6 basic run | k6 run script.js |
| k6 to Grafana | k6 run --out influxdb=http://localhost:8086/k6 script.js |
| k6 env vars | k6 run -e BASE_URL=https://staging.example.com script.js |
| hey quick check | hey -n 1000 -c 50 https://api.example.com/health |
| wrk saturate | wrk -t10 -c200 -d30s https://api.example.com/endpoint |
| Watch connections | watch -n 1 "ss -tn state established \| wc -l" |
| DB connections | psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" |
Load Test Result Red Flags¶
| Signal | Likely Cause |
|---|---|
| p99 >> p50 (10x+ gap) | Connection pool exhaustion, GC pauses, cache misses |
| Throughput plateaus then drops | Server saturation -- check CPU, threads, connections |
| Error rate climbs gradually | Resource leak (connections, memory, file descriptors) |
| Latency spikes every N seconds | Garbage collection or background job interference |
| VUs hit maxVUs | Arrival rate can't be maintained -- server is too slow |
http_req_blocked is high |
Test runner connection pool exhaustion (not server issue) |
Takeaways¶
-
Little's Law is your capacity calculator. Concurrency = Throughput x Latency. If you know any two, you know the third. Calculate required connection pools and thread counts before you test, not after the system crashes.
-
Always use arrival-rate executors for traffic simulation. VU-based tests hide problems by backing off when the server slows down. This is coordinated omission, and it makes your results look 10-100x better than reality.
-
The percentile gap tells the story. A small gap between p50 and p99 means consistent performance. A large gap means some users are having a terrible time while most are fine. Always report p50, p95, and p99 -- never just the average.
-
Correlate load test results with server metrics. A load test without server-side dashboards is like running a stress test blindfolded. You'll know that something broke but not what. CPU, memory, connection pools, and database metrics must be visible during every test run.
-
Soak tests catch what stress tests miss. A 10-minute burst won't reveal memory leaks, connection pool exhaustion, or slow resource accumulation. Run at expected load for 4-8 hours before any major launch.
-
Put a baseline test in CI. You don't need a full stress test on every PR. A 2-minute constant-rate test with strict thresholds catches regressions before they reach production.
Related Lessons¶
- The Cascading Timeout -- what happens when the death spiral from Part 2 actually hits production
- Prometheus and the Art of Not Alerting -- setting up the dashboards you need during load tests
- The Mysterious Latency Spike -- diagnosing production latency issues after the load test is over
- Deploy a Web App From Nothing -- building the service that eventually needs load testing
- Kubernetes Services: How Traffic Finds Your Pod -- understanding the K8s networking layer that load tests exercise