Logs vs Metrics vs Traces¶
Mental model¶
Three lenses on the same system:
- Logs = diary entries. "At 14:03:07 user X got error Y."
- Metrics = dashboard gauges. "Request rate is 450/s, error rate is 2%."
- Traces = flight recorder. "This request took 340ms total: 120ms in auth, 200ms in DB, 20ms in serialization."
Each answers a different question. None replaces the others.
What it looks like¶
Three categories of "monitoring data" that people treat interchangeably.
What it really is¶
Three fundamentally different data structures optimized for different queries:
| Pillar | Shape | Answers |
|---|---|---|
| Logs | Timestamped text/JSON | What happened? |
| Metrics | Numeric time series | How much / how often? |
| Traces | Tree of spans | Where did time go? |
- Logs: high cardinality, high volume, expensive to store. Preserve full context of individual events.
- Metrics: low cardinality, cheap to store, aggregate well. Great for dashboards and alerting thresholds.
- Traces: request-scoped. Show causal parent-child relationships between services. Each unit of work is a span (one service call, one DB query, one cache lookup).
Why it seems confusing¶
All three are "telemetry." Tools like Grafana display all three. You can derive metrics from logs (count ERROR lines per minute). You can attach logs to trace spans. The boundaries feel soft because modern platforms blur them.
But the underlying data models are different, the storage costs are different, and the questions they answer are different.
What actually matters¶
When to use which: - Alert on metrics (they're cheap and fast to query). - Debug with logs (they have the detail). - Diagnose latency with traces (they show the call tree).
Two key methodologies: - USE (utilization, saturation, errors) — for resources like CPU, disk, network. - RED (rate, errors, duration) — for request-driven services like APIs.
Structured logs (JSON with consistent fields) bridge the gap toward metrics. You can filter, aggregate, and correlate them far more easily than plain text.
flowchart LR
SVC["Service"] --> L["Logs\n(what happened?)"]
SVC --> M["Metrics\n(how much/how often?)"]
SVC --> T["Traces\n(where did time go?)"]
L --> |"debug"| INV["Investigation"]
M --> |"alert"| ALERT["Alerting"]
T --> |"diagnose"| LAT["Latency Analysis"]
style L fill:#36f,color:#fff
style M fill:#5a5,color:#fff
style T fill:#f80,color:#fff
Common mistakes¶
- Alerting on log patterns instead of metrics. Log-based alerts are slow and expensive.
- Collecting traces for every request in production. Sample. 100% trace capture kills performance and storage.
- Having metrics but no logs — you know something broke, but you can't tell why.
- Treating traces as optional. Without them, debugging latency across microservices is guesswork.
Small examples¶
// Structured log entry
{
"ts": "2026-03-28T14:03:07Z",
"level": "error",
"msg": "payment failed",
"user_id": "u-9182",
"trace_id": "abc123",
"duration_ms": 340
}
# Jaeger trace (simplified)
Trace abc123
span: api-gateway 120ms
span: auth-service 40ms
span: payment-svc 200ms
span: db-query 180ms
One-line summary¶
Logs record events, metrics track numbers over time, traces map request flow — use all three together.