Mental Model: RED Method¶

Category: Debugging & Diagnosis Origin: Tom Wilkie (Grafana Labs, formerly Weaveworks), ~2015 One-liner: For every service, monitor Rate (requests/sec), Errors (failed requests), and Duration (latency distribution) — these three signals expose nearly all service-level failures.

The Model¶

The RED Method is a service-oriented counterpart to Brendan Gregg's USE Method. Where USE interrogates physical and virtual resources (CPU, memory, disk, NIC), RED interrogates request-processing behavior. A service can be failing its users even when every underlying resource is healthy — RED is how you catch that.

The three signals: Rate is the volume of requests the service is processing per unit time. Abnormal rate — either a dramatic spike or a sudden collapse — is almost always meaningful. A rate collapse often means upstream services are failing to reach you, or your own service crashed and clients are backing off. A rate spike may be the cause of the other two signals climbing. Errors is the count (or rate) of failed requests. This includes explicit HTTP 5xx responses, gRPC non-OK status codes, database query errors, and application-level errors that return HTTP 200 with an error payload — which you must instrument explicitly. Duration is the latency distribution of requests, typically expressed as percentiles: p50, p95, p99, p99.9. The p99 and p99.9 matter as much as p50 because high-percentile latency is what users experience during load, and it often reveals the worst-case behavior that will cause cascading failures.

The critical insight of RED is the distribution emphasis for Duration. A service with p50 latency of 20ms but p99 latency of 4 seconds is not a healthy service — the mean or average hides the tail. SLOs are typically written against percentiles, and SLOs are what connect your monitoring to user impact. Always look at the distribution, not the average.

RED applies cleanly to any system that processes discrete requests: HTTP microservices, gRPC services, message queue consumers (rate = messages/sec consumed, errors = failed processing, duration = processing time per message), Kubernetes controllers, and database query handlers. It does not apply to batch jobs with no concept of a discrete "request" — for those, adapt to throughput, failure rate, and job duration.

The method's boundary condition: RED tells you that something is wrong at the service layer and approximately where, but not why at the infrastructure level. If RED shows high Duration, you then pivot to USE to check whether any underlying resource is the cause, or use distributed tracing to identify which downstream call is contributing the latency.

Visual¶

                        RED Method Signal Hierarchy
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  Rate ───────── requests/second arriving at the service         │
│      Normal: stable or gradual change                           │
│      Alert:  sudden drop (upstream failure, crash)              │
│              sudden spike (traffic anomaly, retry storm)         │
│                                                                  │
│  Errors ─────── failed requests / total requests (%)            │
│      Normal: near 0% (application-dependent)                    │
│      Alert:  any sustained rise above SLO error budget          │
│                                                                  │
│  Duration ───── latency distribution (percentiles)              │
│      ┌────────────────────────────────────────────────────────┐  │
│      │  p50  ██░░░░░░░░░░░  20ms  (most users see this)       │  │
│      │  p95  ████████░░░░░  180ms (1 in 20 requests)          │  │
│      │  p99  ██████████░░░  950ms (1 in 100 requests)  ⚠     │  │
│      │  p999 ████████████░  4200ms (1 in 1000)         ⚠⚠    │  │
│      └────────────────────────────────────────────────────────┘  │
│      Alert: p99/p999 diverges from p50 = tail latency problem   │
│                                                                  │
│  Triage flow:                                                    │
│  Rate collapsed? → Is the service alive? Did upstreams fail?    │
│  Errors high?    → What error type? Which endpoints?            │
│  Duration high?  → Which percentile? Which downstream call?     │
└──────────────────────────────────────────────────────────────────┘

Prometheus Implementation Reference¶

The RED Method maps directly to Prometheus metric patterns. A correctly instrumented service should expose the following:

Rate — typically derived from a counter:

rate(http_requests_total[5m])
# or for gRPC:
rate(grpc_server_started_total[5m])

Errors — ratio of failed to total requests:

rate(http_requests_total{status=~"5.."}[5m])
  /
rate(http_requests_total[5m])
# Note: always use ratio, not raw count, for alerting

Duration — from a histogram, use percentile functions:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

A Grafana dashboard panel showing all three RED signals for a service is the minimum viable observability. If a service exposes only a health check endpoint and no RED metrics, it is unobservable in production.

Alert thresholds: - Rate: alert on rate < N (service appears to have stopped) or rate > 3x baseline (traffic anomaly) - Errors: alert when error ratio exceeds SLO error budget burn rate - Duration: alert when p99 exceeds SLO latency target; also alert when p99/p50 ratio exceeds ~10x (tail latency divergence)

Interpreting Signal Combinations¶

Different combinations of the three RED signals point to different root causes:

Rate	Errors	Duration	Likely diagnosis
Normal	High	Normal	Bug in specific code path; error is fast, not slow
Normal	Low	High	Downstream dependency slow; upstream circuit breaker not yet tripped
High	High	High	Overload — service is receiving more traffic than it can handle
Low	High	—	Upstream is failing to reach the service; or service is crashing fast
High	Normal	Normal	Traffic spike absorbed without degradation — capacity is sufficient
Normal	Normal	High	Single slow downstream call or database query in the critical path

These patterns are guides, not rules. Always check the time series — the order in which signals changed matters as much as their current values.

When to Reach for This¶

A service is degraded and you need to immediately characterize how it is failing before digging into root cause
Building dashboards and alerting for a new microservice from scratch — RED gives you the minimum viable observability set
SLO burn rate is elevated and you need to correlate which signal is consuming the error budget
During an incident involving multiple services — apply RED to each service in the call graph to identify where the degradation originates
When reviewing whether a service is ready to receive production traffic after a deployment
Performance regression testing: baseline the RED metrics before a deploy and compare after

When NOT to Use This¶

For infrastructure-layer problems (node pressure, disk saturation, memory exhaustion) — use USE Method instead; RED will show symptoms but won't identify the resource culprit
For batch jobs, cron tasks, or streaming pipelines with no concept of discrete requests — the metrics exist but the interpretation differs; adapt carefully
As a substitute for distributed tracing: RED identifies which service has the duration problem, but tracing tells you which span within it is slow
When debugging data correctness issues (wrong results returned to clients) — a request that returns an incorrect result counts as a success to RED; you need semantic health checks for this

Applied Examples¶

Example 1: Kubernetes DNS resolution degrading — CoreDNS service¶

An alert fires: pod-to-pod communication latency is elevated across the cluster. Multiple teams report their services are slow. Where do you start?

Apply RED to CoreDNS, which underpins all service discovery in Kubernetes.

Rate: coredns_dns_requests_total rate shows 14,000 req/s — a 3x spike from the usual 4,500. Something is generating a DNS query storm.

Errors: coredns_dns_responses_total{rcode="SERVFAIL"} is at 8% of requests. Normal is <0.1%. CoreDNS is failing to resolve a significant fraction of queries.

Duration: coredns_dns_request_duration_seconds p99 has climbed from 2ms to 890ms. The p50 is 40ms (normal is <1ms).

Interpretation: All three RED signals are elevated simultaneously. The rate spike arrived first (visible in the time series), then errors and duration followed. The query storm is exhausting CoreDNS workers. Root cause investigation: which workloads are generating the extra queries? (kubectl top pods across namespaces, then application logs). Resolution: identify the misconfigured application generating excessive DNS lookups, patch it, and potentially add CoreDNS autoscaling.

Example 2: Payment service latency spike — gRPC microservice¶

An e-commerce checkout flow is showing elevated cart abandonment. The payment service is on the critical path.

Apply RED to the payment gRPC service.

Rate: 420 RPC/s, consistent with traffic patterns. Rate is normal — no traffic anomaly.

Errors: grpc_server_handled_total{grpc_code="Unavailable"} is 0.3%. Slightly elevated from baseline 0.05%, but within SLO bounds.

Duration: p50 is 95ms (normal ~40ms), p95 is 2.1 seconds (normal ~120ms), p99 is 8.4 seconds. The tail latency is catastrophic while the median is only mildly elevated.

Interpretation: Normal rate, near-normal errors, severe tail latency. The error rate being in-SLO is misleading — the p99 duration will cause client timeouts that eventually manifest as errors at the caller. The tail latency pattern (p50 mildly elevated, p99 severely elevated) suggests a downstream dependency with intermittent slowness: a database connection pool exhausted for a subset of requests, or a third-party payment gateway with slow responses on some transaction types. Next step: distributed traces on the slowest 1% of requests to identify the slow span.

The Junior vs Senior Gap¶

Junior	Senior
Checks "is the service up?" (binary health check)	Immediately pulls Rate, Error rate, and duration percentiles as a unit
Looks at average latency and declares "latency looks fine"	Looks at p99 and p99.9 — knows averages mask tail behavior
Treats HTTP 200 as a success without verifying the response body	Instruments application-level errors separately from HTTP status errors
Investigates the service that generated the alert without checking its upstream and downstream neighbors	Applies RED to the entire call graph to find where degradation originates
Misses a Rate collapse as a signal (assumes "quiet is good")	Recognizes a rate drop as an alert-worthy signal indicating upstream failure or service crash
Builds dashboards with one latency graph (average or p50)	Builds dashboards with p50, p95, p99, and p99.9 on the same panel

Applying RED Across the Call Graph¶

In a microservices architecture, a single user-facing request touches many services. When a user-facing SLO breaches, the root cause is rarely the first service in the call chain. Apply RED to each service in the dependency graph to isolate where the degradation originates.

The technique: 1. Start at the user-facing service (highest in the call graph). Check its RED metrics. 2. If Duration is high, check the downstream services it calls. 3. If those show normal Duration, the bottleneck is in the calling service itself. Drill into traces. 4. If a downstream service shows high Duration, recurse: check its downstream dependencies. 5. If a service shows Rate collapse while its upstream shows normal Rate, the collapsed service has crashed or is unreachable — the upstream's error rate will confirm this.

This traversal is efficient because each step either confirms the location or eliminates a layer. In a system with 6 services in the call chain, you can isolate the root service in 3 steps on average.

Service mesh integration: If you run Istio, Linkerd, or similar, the service mesh automatically instruments all service-to-service calls with RED metrics at the proxy layer — you get RED signals for free for every service without application instrumentation. The tradeoff is that application-level errors (HTTP 200 with error body) are invisible to the mesh; application instrumentation is still needed to catch those.

RED as an SLO Foundation¶

Service Level Objectives are almost always written in terms of RED metrics:

Availability SLO: "99.9% of requests succeed" → Error rate SLO: 1 - errors/rate < 0.001
Latency SLO: "95% of requests complete in under 200ms" → Duration SLO: p95 < 200ms
Throughput SLO: "Service handles at least 500 req/s" → Rate floor: rate ≥ 500

Because SLOs are defined in RED terms, RED dashboards and alerts directly measure whether the SLO is being met — there is no translation layer. When a RED alert fires, it is by definition an SLO violation or a burn-rate warning.

Error budget burn rate is a derived metric from the Error signal: if your monthly error budget is 0.1% (99.9% availability SLO, 720 hours per month = 43.2 minutes of allowed downtime), and your current error rate is 2%, you are burning your budget 20x faster than the SLO allows. The burn rate formula:

burn_rate = current_error_rate / SLO_error_rate_threshold

When burn rate > 1, you are consuming error budget faster than it replenishes. When burn rate > 14.4, you will exhaust the monthly budget in under 2 hours — this is the threshold for a high-urgency page.

Common Pitfalls¶

Alerting on Rate alone without context. A 50% drop in request rate could be a catastrophic failure or normal overnight traffic decline. Rate alerts must be conditioned on time-of-day baselines or compared against a rolling window.

Treating all errors equally. HTTP 400 errors from clients sending bad requests are different from HTTP 500 errors caused by service failures. Define your error SLO clearly — typically exclude client errors from the service error budget, or handle them separately with their own alert.

Missing application-level errors. A service that returns {"status": "error", "code": "PAYMENT_DECLINED"} with HTTP 200 is invisible to RED if you only instrument HTTP status codes. Wrap application outcomes in a counter with success/failure labels.

Ignoring tail behavior under load. p50 latency during off-peak hours tells you nothing about behavior at peak. Verify p99 under realistic peak traffic — a service with p99 = 50ms at 10% load and p99 = 4 seconds at peak is failing its SLO when it matters most.

Connections¶

Complements: USE Method (RED for service-layer health, USE for infrastructure-layer health — during incidents, run both; RED pinpoints which service is failing, USE identifies which resource is the cause)
Complements: Differential Diagnosis (RED narrows the investigation to a service and a signal type; Differential Diagnosis structures the next step of eliminating hypotheses about why)
Tensions: Correlation vs Causation (a Rate spike preceding an Error rate rise looks causal, but both may be driven by a third factor — an upstream timeout storm causing retries; verify before assuming Rate caused Errors)
Topic Packs: observability, prometheus
Case Studies: coredns-timeout-pod-dns (RED reveals the DNS error rate and duration degradation that explains cluster-wide slowness), dns-resolution-slow (RED applied to the DNS resolver layer isolates slow upstream forwarder as duration culprit)