Load Testing: Finding the Breaking Point

lesson
load-testing
performance
capacity-planning
kubernetes
observability
linux-internals ---# Load Testing: Finding the Breaking Point

Topics: load testing, performance, capacity planning, Kubernetes, observability, Linux internals Level: L1-L2 (Foundations to Operations) Time: 75-90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

It's October. Black Friday is five weeks away. Your VP of Engineering just forwarded an email from the business side: "Marketing is projecting 10x normal traffic on Black Friday. Can our systems handle it?" The VP's reply to you is one line: "Can they?"

You don't know. Nobody knows. The system has never seen 10x traffic. Staging is half the size of production. The last "load test" was someone running curl in a for loop eight months ago.

Your job over the next five weeks: find out if the system survives 10x traffic, and if it doesn't, figure out exactly where it breaks so the team can fix it before the turkey lands.

By the end of this lesson you'll understand: - The difference between load, stress, soak, and spike testing (and when each one matters) - How to write a realistic k6 load test (not a hello-world) - Little's Law -- the one equation that ties concurrency, throughput, and latency together - Why most benchmarks lie (the coordinated omission problem) - How to read load test results and know what the numbers actually mean - Where systems actually break: CPU, memory, connection pools, database locks, network - How to run load tests in CI/CD and against production safely

Part 1: What Kind of Test Do You Need?¶

Not all load tests answer the same question. Picking the wrong type is like checking your oil when the tire is flat.

Test Type	Traffic Shape	Duration	The Question It Answers
Load	Constant or ramp to target	10-30 min	"Does it work at expected traffic?"
Stress	Ramp past expected limits	30-60 min	"Where does it break?"
Soak	Sustained expected load	4-24 hours	"Does it leak memory or connections?"
Spike	Instant jump to 10x	5-10 min	"Does autoscaling actually work?"
Breakpoint	Slow ramp until failure	60-120 min	"What's the exact capacity ceiling?"

For Black Friday, you need all of them -- in order:

Load test first: does the system behave at current traffic? (Baseline.)
Stress test: ramp to 10x. Where does it crack?
Fix whatever broke. Repeat.
Spike test: simulate the midnight rush. Does autoscaling kick in fast enough?
Soak test: run at 3x for 8 hours. Any memory leaks hiding?

Gotcha: A 10-minute stress test won't catch a connection pool leak that only manifests after 2 hours of sustained load. The soak test exists specifically for this -- it's the test that finds the slow bleed.

Part 2: Little's Law -- The One Equation You Need¶

Before you touch a load testing tool, you need one mental model. Everything else builds on it.

Mental Model: Little's Law states that in a stable system:

L = lambda x W

Or in human terms: Concurrency = Throughput x Latency

L (concurrency): number of requests in flight at any moment

lambda (throughput): requests completed per second

W (latency): average time to complete one request

This is not an approximation. It's a mathematical law proven by John Little in 1961. It holds for any stable system -- a web server, a grocery store checkout, a highway.

Worked Example 1: Black Friday Capacity¶

Your system currently handles 200 requests per second with 100ms average latency.

Concurrency = 200 RPS x 0.1s = 20 concurrent requests

Twenty requests in flight at any time. Your connection pool is set to 50. Plenty of headroom.

Now Black Friday hits. Traffic goes to 2,000 RPS. Latency stays at 100ms (optimistic):

Concurrency = 2,000 RPS x 0.1s = 200 concurrent requests

Your connection pool of 50 just became the bottleneck. Requests queue. Latency climbs. And when latency climbs, things get worse:

Worked Example 2: The Death Spiral¶

At 2,000 RPS, the connection pool saturates. Latency increases to 500ms:

Concurrency = 2,000 RPS x 0.5s = 1,000 concurrent requests

One thousand requests in flight. Your application threads are exhausted. The request queue overflows. Timeouts cascade. Latency jumps to 5 seconds:

Concurrency = 2,000 RPS x 5s = 10,000 concurrent requests

This is how systems die. Little's Law predicts the death spiral: higher latency means more concurrency, which means more queueing, which means higher latency. Once you're past the tipping point, the system cannot recover without shedding load.

Trivia: John D.C. Little proved his law in 1961 at MIT. The elegant thing about it: he proved it requires no assumptions about the arrival distribution, service distribution, or order of service. It works for M/M/1 queues, web servers, drive-throughs, and emergency rooms. It's one of the most general results in queueing theory.

Interview Bridge: "Explain how a system can be fine at 1,000 RPS but completely collapse at 1,200 RPS." Little's Law + connection pool exhaustion is the textbook answer. The transition from "working" to "dead" is non-linear because of the feedback loop between latency and concurrency.

Flashcard Check #1¶

Q1: Your service handles 500 RPS at 200ms latency. How many concurrent requests are in flight?

500 x 0.2 = 100 concurrent requests.

Q2: That same service has a max thread pool of 80. What happens when traffic increases?

At 400 RPS (80 / 0.2s), the thread pool saturates. Additional requests queue, latency increases, and the death spiral begins unless you shed load (rate limiting, circuit breaker).

Q3: Little's Law says Concurrency = Throughput x Latency. If you double throughput and latency stays flat, what happens to concurrency?

It doubles. You need twice the connection pool, twice the threads, twice the database connections. This is why "just add more traffic" breaks things that seemed fine.

Part 3: Your First Real Load Test (k6 Walkthrough)¶

Enough theory. Let's write a test that actually models your e-commerce system under Black Friday conditions.

Name Origin: k6 was created by Load Impact, a Swedish company founded in 2010. They open-sourced k6 in 2017, deliberately choosing JavaScript (ES6) scripts over JMeter's XML-heavy GUI approach. Grafana Labs acquired them in 2021, recognizing that load testing and observability are two halves of the same problem. The name "k6" is short for "kilo-6" -- six being the HTTP status code family for... actually, the origin is unclear. The team has said it just sounded good.

The Script¶

This isn't a hello-world. This models a real user flow: browse products, add to cart, checkout. Different actions happen at different rates, just like real traffic.

import http from 'k6/http';
import { check, sleep, group } from 'k6';
import { SharedArray } from 'k6/data';
import { Counter, Rate, Trend } from 'k6/metrics';

// --- Custom metrics (beyond k6 defaults) ---
const checkoutErrors = new Counter('checkout_errors');
const checkoutSuccess = new Rate('checkout_success_rate');
const checkoutDuration = new Trend('checkout_duration');

// --- Test data: loaded once, shared across all VUs ---
const products = new SharedArray('products', function () {
  return JSON.parse(open('./data/products.json'));
  // e.g., [{"id": "prod-001", "name": "Widget"}, ...]
});

const users = new SharedArray('users', function () {
  return JSON.parse(open('./data/users.json'));
  // e.g., [{"id": "user-042", "token": "eyJhb..."}, ...]
});

// --- Configuration ---
const BASE_URL = __ENV.BASE_URL || 'https://staging.shop.example.com';

export const options = {
  scenarios: {
    // 90% of users just browse
    browse: {
      executor: 'ramping-arrival-rate',
      startRate: 50,
      timeUnit: '1s',
      preAllocatedVUs: 100,
      maxVUs: 500,
      stages: [
        { duration: '2m', target: 50 },    // warm up
        { duration: '5m', target: 500 },   // ramp to 10x
        { duration: '10m', target: 500 },  // hold at 10x
        { duration: '2m', target: 0 },     // cool down
      ],
      exec: 'browseFlow',
    },
    // 10% of users check out
    checkout: {
      executor: 'ramping-arrival-rate',
      startRate: 5,
      timeUnit: '1s',
      preAllocatedVUs: 50,
      maxVUs: 200,
      stages: [
        { duration: '2m', target: 5 },
        { duration: '5m', target: 50 },
        { duration: '10m', target: 50 },
        { duration: '2m', target: 0 },
      ],
      exec: 'checkoutFlow',
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<500', 'p(99)<2000'],
    http_req_failed: ['rate<0.01'],
    checkout_success_rate: ['rate>0.99'],
    checkout_duration: ['p(95)<3000'],
  },
};

Let's break down the key decisions:

Decision	Why
`ramping-arrival-rate` executor	Models real traffic: users don't slow down when your server does (avoids coordinated omission)
Two scenarios (browse + checkout)	Real traffic isn't uniform -- 90% browse, 10% buy. Different endpoints, different load profiles
`SharedArray` for test data	Loaded once, shared across VUs. Without it, 500 VUs each load the file = OOM on the test runner
`maxVUs: 500`	Hard safety cap. Even if you misconfigure the rate, you can't accidentally DDoS
Thresholds on `p(95)` and `p(99)`	Averages lie. p95 tells you what 1 in 20 users experiences. p99 tells you what 1 in 100 sees

Now the user flows:

export function browseFlow() {
  const product = products[Math.floor(Math.random() * products.length)];
  group('browse', () => {
    const listRes = http.get(`${BASE_URL}/api/products?page=1&limit=20`);
    check(listRes, {
      'product list 200': (r) => r.status === 200,
      'has products': (r) => r.json('items').length > 0,
    });
    const detailRes = http.get(`${BASE_URL}/api/products/${product.id}`);
    check(detailRes, { 'product detail 200': (r) => r.status === 200 });
  });
  sleep(Math.random() * 2 + 1); // 1-3s think time
}

export function checkoutFlow() {
  const user = users[__VU % users.length];
  const product = products[Math.floor(Math.random() * products.length)];
  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${user.token}`,
    },
    timeout: '10s',  // fail fast -- don't let VUs hang
  };

  group('checkout', () => {
    const cartRes = http.post(
      `${BASE_URL}/api/cart`,
      JSON.stringify({ product_id: product.id, quantity: 1 }),
      params
    );
    check(cartRes, { 'add to cart': (r) => r.status === 201 });
    if (cartRes.status !== 201) { checkoutErrors.add(1); return; }

    const start = Date.now();
    const orderRes = http.post(
      `${BASE_URL}/api/orders`,
      JSON.stringify({ cart_id: cartRes.json('cart_id') }),
      params
    );
    checkoutDuration.add(Date.now() - start);
    checkoutSuccess.add(orderRes.status === 201);
    check(orderRes, {
      'order created': (r) => r.status === 201,
      'has order id': (r) => r.json('order_id') !== undefined,
    });
    if (orderRes.status !== 201) { checkoutErrors.add(1); }
  });
  sleep(Math.random() * 3 + 2); // 2-5s think time
}

Gotcha: Notice the timeout: '10s' in the checkout params. Without it, k6 waits indefinitely for a response. When the server starts queuing under load, your VUs get stuck waiting instead of measuring the failure. Always set timeouts that match your SLOs.

Run it:

# Against staging
k6 run -e BASE_URL=https://staging.shop.example.com load-test.js

# With real-time Grafana output
k6 run --out influxdb=http://localhost:8086/k6 load-test.js

# With JSON output for later analysis
k6 run --out json=results.json load-test.js

Part 4: Why Most Benchmarks Lie (Coordinated Omission)¶

You run your beautiful load test. The results look great: p99 latency is 800ms, error rate is 0.2%. You report to the VP: "We can handle 10x." Black Friday arrives. The system collapses at 4x. What happened?

Coordinated omission happened.

Mental Model: Imagine you're timing how long it takes to get coffee at a cafe. You walk in, order, time the wait, get your coffee: 3 minutes. You do this 100 times. Average: 3 minutes. Sounds accurate.

But here's what you missed: every time the cafe was slow, you waited in line before ordering. You didn't time the line -- only the order-to-delivery. A real customer who walked in during the rush waited 3 minutes in line + 3 minutes for the order = 6 minutes total. Your benchmark said 3 minutes. Reality was 6.

That's coordinated omission.

How It Works in Load Testing¶

With VU-based execution (the default in most tools), each VU does request-sleep-request in a loop. When the server slows to 3 seconds per request, each VU sends one-third as many requests. You think you're testing at 100 RPS. You're actually testing at 33 RPS. The load test automatically backed off at exactly the moment things got interesting.

Your reported latencies look great because you measured fewer requests during the slow period. You measured 33 responses at 3s instead of the 100 that real users would have sent.

Trivia: Gil Tene, CTO of Azul Systems, coined the term "coordinated omission" in a 2013 presentation. He demonstrated that most benchmarking tools systematically underreported latency by 10-100x. The word "coordinated" refers to the fact that the load generator and the system under test are inadvertently cooperating to hide the problem -- when the server slows down, the generator slows down too.

The Fix: Arrival Rate Executors¶

// BAD: VU-based -- backs off when server slows
export const options = {
  vus: 100,
  duration: '5m',
};

// GOOD: Arrival rate -- maintains target rate regardless
export const options = {
  scenarios: {
    load: {
      executor: 'constant-arrival-rate',
      rate: 100,           // 100 iterations/second, period
      timeUnit: '1s',
      duration: '10m',
      preAllocatedVUs: 100,
      maxVUs: 500,         // k6 spins up more VUs as requests back up
    },
  },
};

With constant-arrival-rate, k6 sends exactly 100 requests per second. If the server slows down, k6 spawns more VUs to maintain the rate. If it hits maxVUs and still can't keep up, it reports dropped iterations -- which is exactly what happens to real users (they get timeouts or errors).

Remember: If you only take one thing from this lesson: always use arrival-rate executors for load tests that model real traffic. VU-based tests are fine for modeling "50 people logged in simultaneously" -- but not for "what happens at 2,000 RPS."

Part 5: Reading the Results -- What the Numbers Actually Mean¶

Your test ran. k6 printed a wall of stats. Here's how to read them.

     data_received..................: 847 MB  4.5 MB/s
     data_sent......................: 126 MB  668 kB/s
     http_req_blocked...............: avg=1.2ms   p(95)=3.4ms
     http_req_connecting............: avg=0.8ms   p(95)=2.1ms
     http_req_duration..............: avg=127ms   p(50)=89ms   p(95)=340ms  p(99)=1.2s
     http_req_failed................: 0.34%  ✓ 1,203   ✗ 351,447
     http_req_receiving.............: avg=2.1ms   p(95)=8.4ms
     http_req_sending...............: avg=0.3ms   p(95)=0.8ms
     http_req_tls_handshaking.......: avg=5.2ms   p(95)=12.3ms
     http_req_waiting...............: avg=119ms   p(50)=82ms   p(95)=325ms  p(99)=1.1s
     http_reqs......................: 352,650  1,869/s
     iteration_duration.............: avg=2.1s    p(95)=3.8s
     iterations.....................: 352,650  1,869/s
     vus............................: 312     min=12   max=487
     vus_max........................: 500

The Metrics That Matter¶

Metric	What It Tells You	What to Worry About
`http_req_duration p(50)`	Median latency. What a typical user sees	Baseline. Compare across runs
`http_req_duration p(95)`	What 1 in 20 users sees	This is your SLO target
`http_req_duration p(99)`	What 1 in 100 users sees	Tail latency. Connection pools, GC pauses
`http_req_failed`	Error rate	Should be < 0.1% for a healthy service
`http_reqs` (rate)	Actual throughput (RPS)	Did you hit your target rate?
`vus` max	Peak concurrent VUs	If it hit `maxVUs`, you had dropped iterations
`http_req_blocked`	Time waiting for a free TCP connection	High = connection pool exhaustion on test runner
`http_req_connecting`	Time establishing TCP connection	High = DNS or network issues
`http_req_tls_handshaking`	TLS negotiation time	High = CPU-bound TLS or cert chain too long
`http_req_waiting`	Server processing time (TTFB)	The "real" latency minus network overhead

Reading the Gap Between Percentiles¶

The gap between p50 and p99 tells you a story:

p50=89ms, p95=340ms, p99=1.2s   ← Large gaps: bimodal distribution

This means most requests are fast (89ms) but some hit a completely different code path. Common causes: - Cache miss (cache hit = 89ms, cache miss = 1.2s with a database round trip) - GC pause (most requests dodge GC, unlucky ones wait for a full GC cycle) - Connection pool exhaustion (most requests get a connection, some wait in the queue) - Cold starts (first request to a scaled-up pod or Lambda function)

p50=340ms, p95=380ms, p99=420ms  ← Small gaps: consistent performance

This is a healthy system under load -- everyone gets roughly the same experience.

Gotcha: A service averaging 100ms but with 5% of requests taking 5,000ms is broken for 1 in 20 users. Average latency would report 345ms -- technically "fast." p95 reports 5,000ms -- the truth. Averages lie, percentiles don't.

Part 6: Where Systems Actually Break¶

You've found the breaking point. Now you need to figure out what broke. This is where load testing crosses into Linux performance diagnosis.

The Bottleneck Hierarchy¶

When a system degrades under load, work your way down:

Layer	Common Bottlenecks
Application	Thread pools, connection pools, memory leaks
Database	Lock contention, slow queries, connection limits
OS / Kernel	CPU saturation, memory pressure, file descriptors, socket limits
Infrastructure	Disk I/O, network bandwidth, load balancer limits

The Diagnostic Checklist (Run While Your Load Test Is Running)¶

# CPU: any core at 100% = single-threaded bottleneck
mpstat -P ALL 1 3

# Memory: si/so > 0 = swapping (bad). 'b' column > 0 = blocked on I/O
vmstat 1 5

# Disk: %util near 100% or await > 10ms (SSD) = storage bottleneck
iostat -xz 1 3

# Network: high timewait = connection churn
ss -s

# File descriptors: used near max = ulimit problem
cat /proc/sys/fs/file-nr

# Database connections: active near max_connections = pool bottleneck
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Database slow queries
psql -c "SELECT query, mean_exec_time, calls
         FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

Under the Hood: On Linux, every network connection is a file descriptor. The default per-process limit (ulimit -n) is often 1024. A web server handling 2,000 concurrent connections needs at least 2,000 file descriptors -- plus some for log files, database connections, and other I/O. The system-wide limit (/proc/sys/fs/file-max) is separate and usually much higher. Both must be sufficient.

The Connection Pool War Story¶

War Story: A team ran a stress test against their API service. At 500 RPS, everything was fine. At 800 RPS, latency gradually climbed from 100ms to 30 seconds over 10 minutes, then errors cascaded. CPU was at 40%. Memory was fine. Disk was idle.

The culprit: a database connection pool sized at 20. At 500 RPS with 40ms database latency, Little's Law gives 500 x 0.04 = 20 concurrent database connections -- exactly the pool size. At 800 RPS: 800 x 0.04 = 32 connections needed, but only 20 available. Requests queued for a connection. Queue wait added to latency. Higher latency meant more requests in flight (Little's Law again). The pool could never catch up.

The fix was a one-line config change: pool size from 20 to 50. But the team only found it because they correlated the load test timeline with database connection metrics in Grafana. Without that dashboard, they would have blamed "the network" and wasted days.

Remember: Little's Law predicts connection pool exhaustion. Required connections = throughput x per-request database time. If that number exceeds your pool size, requests queue and the death spiral begins. Always calculate this before your load test.

Flashcard Check #2¶

Q4: CPU is at 40% during a load test but latency is climbing. What should you check next?

Connection pools (database, HTTP client), thread pools, and downstream service latency. Low CPU with high latency means the application is waiting, not computing.

Q5: iostat -xz 1 shows %util at 95% on your database server's disk. What does that mean?

The disk is nearly saturated. If await is also high (>10ms for SSD), queries are waiting for I/O. This is your bottleneck.

Q6: During a soak test, pg_stat_activity shows connection count growing by 1 every minute and never shrinking. What's happening?

Connection leak. The application is opening connections but not returning them to the pool on some code paths (typically error paths). Eventually max_connections is reached and all new requests fail.

Part 7: Testing Kubernetes Services¶

Black Friday isn't just about your application code. If you're running on Kubernetes, the infrastructure has its own set of bottlenecks.

What Breaks First in Kubernetes Under Load¶

Component	Failure Mode	How You'll See It
HPA (Horizontal Pod Autoscaler)	Scales too slowly	Latency spike for 2-3 minutes, then recovery
Node autoscaler	New nodes take 3-5 min to join	Pods stuck in `Pending` state
Service/kube-proxy	iptables rules lag behind new pods	Requests routed to not-yet-ready pods (5xx)
Ingress controller	Connection limits per pod	502/503 errors at the ingress
DNS (CoreDNS)	Lookup latency under load	Intermittent timeouts, especially first requests
etcd	Write latency > 100ms	API server slowness, deploy failures

Before your load test, check: kubectl get hpa -A (are you near maxReplicas?), kubectl top nodes (CPU/memory headroom), and kubectl get pods -n kube-system -l k8s-app=kube-dns (enough CoreDNS pods?).

Gotcha: If your application does DNS lookups for every database connection (common in service mesh setups), CoreDNS becomes a hidden bottleneck under load. Two CoreDNS pods can handle typical traffic, but at 10x they may fall behind.

Shadow Traffic: Testing Production Without Risk¶

The most honest load test runs against production. Staging lies -- different data, different scale, different network topology. But you can't risk breaking production.

Shadow traffic (also called traffic mirroring) copies real production requests to a test endpoint without affecting the response to the real user.

With Istio, add a mirror block to your VirtualService pointing at a canary subset with a mirrorPercentage of 10-20%. With Envoy, use the request_mirror_policies route config. The mirrored traffic is fire-and-forget: responses from the canary are discarded. Your production users see no difference. But your canary gets real production traffic patterns -- the exact bursty, unpredictable distribution that synthetic tests can never replicate.

Trivia: Netflix's "replay traffic testing" takes this further -- capturing real production traffic and replaying it against new service versions. Far more realistic than synthetic tests, but requires careful handling to avoid side effects like duplicate emails.

Part 8: Load Testing in CI/CD¶

You found the breaking point. You fixed the bottleneck. But how do you make sure it doesn't regress? Put a baseline test in your pipeline.

The key: baseline, not stress. In CI, you're catching regressions, not finding the breaking point. A 2-minute constant-rate test with strict thresholds does the job.

// tests/load/baseline.js -- CI performance gate
export const options = {
  scenarios: {
    baseline: {
      executor: 'constant-arrival-rate',
      rate: 50,
      timeUnit: '1s',
      duration: '2m',
      preAllocatedVUs: 50,
      maxVUs: 100,
    },
  },
  thresholds: {
    http_req_duration: [
      { threshold: 'p(95)<300', abortOnFail: true, delayAbortEval: '30s' },
    ],
    http_req_failed: [
      { threshold: 'rate<0.01', abortOnFail: true },
    ],
  },
};

In GitHub Actions, use grafana/k6-action@v0.3.1. Spin up your app and database as service containers, run the baseline test, and fail the PR if thresholds breach.

Gotcha: CI runners have limited resources. Your test might fail because the runner can't generate enough load, not because of a real regression. Calibrate thresholds against what the CI environment can actually deliver, and use delayAbortEval to ignore warm-up.

Part 9: The Tool Landscape¶

k6 isn't the only option. Here's when you'd reach for something else.

Tool	Language	Best For	Watch Out
k6	JavaScript (Go engine)	Most teams. CI integration. Grafana ecosystem	No browser rendering. WebSocket support is basic
Locust	Python	Complex flows needing Python logic. Data science teams	Single-threaded per worker. Need distributed mode for high RPS
Gatling	Scala/Java	JVM shops. Beautiful HTML reports	Steep learning curve. JVM memory overhead
wrk	Lua	Raw HTTP benchmarking. Maximum RPS from one machine	No complex scenarios. No built-in reporting
hey	Go	Quick one-liner tests ("is this endpoint fast?")	Very basic. No scenarios, no scripting
Apache Bench (ab)	C	Installed everywhere. Quick sanity check	Single-threaded. No modern features. No TLS/2
Vegeta	Go	Constant-rate HTTP attacks. Unix pipeline friendly	Limited scenario support

# hey: instant one-liner latency check
hey -n 1000 -c 50 https://api.example.com/health

# wrk: saturate an endpoint for 30 seconds
wrk -t10 -c200 -d30s https://api.example.com/products

# vegeta: constant rate, Unix-pipe style
echo "GET https://api.example.com/health" | \
  vegeta attack -duration=60s -rate=100/s | \
  vegeta report

Trivia: ApacheBench (ab) has been bundled with the Apache HTTP Server since 1996 -- nearly 30 years old and still the first tool many reach for, because it's already installed on almost every Linux server. Locust's name references a swarm; Gatling is named after the Gatling gun. The naming convention in load testing trends toward violent metaphors: attack, blast, siege, wrk (pronounced "work," but suspiciously close to "wreck").

Exercises¶

Exercise 1: Calculate Before You Test (2 minutes)¶

Your API has these production metrics: - 300 RPS average, 450 RPS peak - 150ms average latency - Database connection pool: 30 - Database query time: 25ms per request

Using Little's Law, answer: 1. How many concurrent requests at average load? 2. How many database connections needed at peak? 3. Will the connection pool hold at 3x peak traffic?

Answers

1. Concurrency at average: 300 x 0.15 = 45 concurrent requests 2. DB connections at peak: 450 x 0.025 = 11.25 -- pool of 30 is fine 3. At 3x peak (1,350 RPS): 1,350 x 0.025 = 33.75 DB connections needed. Pool of 30 is not enough. You'll see request queuing starting at 1,200 RPS (30 / 0.025).

Exercise 2: Spot the Coordinated Omission (5 minutes)¶

A colleague shows you this k6 test and says "we can handle 500 RPS, p99 is only 200ms":

export const options = { vus: 500, duration: '5m' };

export default function () {
  http.get('https://api.example.com/products');
  sleep(1);
}

What's wrong with their conclusion? Write a corrected version.

Answer

Three problems: 1. **Coordinated omission**: VU-based executor with `sleep(1)`. At best this sends 500 / (response_time + 1s) RPS. If response time is 200ms, that's ~417 RPS, not 500. If the server slows to 2s, throughput drops to ~167 RPS. The test backs off under load. 2. **Hardcoded data**: Every VU hits the same endpoint with no variation. Cache hits everywhere. 3. **No thresholds**: Checks aren't failing the test. A 100% error rate still exits 0. Corrected:

export const options = {
  scenarios: {
    load: {
      executor: 'constant-arrival-rate',
      rate: 500,
      timeUnit: '1s',
      duration: '5m',
      preAllocatedVUs: 200,
      maxVUs: 1000,
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<500', 'p(99)<2000'],
    http_req_failed: ['rate<0.01'],
  },
};

Exercise 3: Diagnose the Bottleneck (think, don't code)¶

During a stress test at 2,000 RPS you see: p50=80ms, p99=12,000ms, CPU 35%, memory 60%, DB CPU 25%, DB connections 48/50 active. What's the bottleneck? What do you fix?