Logs vs Metrics vs Traces¶

Mental model¶

Three lenses on the same system:

Logs = diary entries. "At 14:03:07 user X got error Y."
Metrics = dashboard gauges. "Request rate is 450/s, error rate is 2%."
Traces = flight recorder. "This request took 340ms total: 120ms in auth, 200ms in DB, 20ms in serialization."

Each answers a different question. None replaces the others.

What it looks like¶

Three categories of "monitoring data" that people treat interchangeably.

What it really is¶

Three fundamentally different data structures optimized for different queries:

Pillar	Shape	Answers
Logs	Timestamped text/JSON	What happened?
Metrics	Numeric time series	How much / how often?
Traces	Tree of spans	Where did time go?

Logs: high cardinality, high volume, expensive to store. Preserve full context of individual events.
Metrics: low cardinality, cheap to store, aggregate well. Great for dashboards and alerting thresholds.
Traces: request-scoped. Show causal parent-child relationships between services. Each unit of work is a span (one service call, one DB query, one cache lookup).

Why it seems confusing¶

All three are "telemetry." Tools like Grafana display all three. You can derive metrics from logs (count ERROR lines per minute). You can attach logs to trace spans. The boundaries feel soft because modern platforms blur them.

But the underlying data models are different, the storage costs are different, and the questions they answer are different.

What actually matters¶

When to use which: - Alert on metrics (they're cheap and fast to query). - Debug with logs (they have the detail). - Diagnose latency with traces (they show the call tree).

Two key methodologies: - USE (utilization, saturation, errors) — for resources like CPU, disk, network. - RED (rate, errors, duration) — for request-driven services like APIs.

Structured logs (JSON with consistent fields) bridge the gap toward metrics. You can filter, aggregate, and correlate them far more easily than plain text.

flowchart LR
    SVC["Service"] --> L["Logs\n(what happened?)"]
    SVC --> M["Metrics\n(how much/how often?)"]
    SVC --> T["Traces\n(where did time go?)"]

    L --> |"debug"| INV["Investigation"]
    M --> |"alert"| ALERT["Alerting"]
    T --> |"diagnose"| LAT["Latency Analysis"]

    style L fill:#36f,color:#fff
    style M fill:#5a5,color:#fff
    style T fill:#f80,color:#fff

Common mistakes¶

Alerting on log patterns instead of metrics. Log-based alerts are slow and expensive.
Collecting traces for every request in production. Sample. 100% trace capture kills performance and storage.
Having metrics but no logs — you know something broke, but you can't tell why.
Treating traces as optional. Without them, debugging latency across microservices is guesswork.

Small examples¶

// Structured log entry
{
  "ts": "2026-03-28T14:03:07Z",
  "level": "error",
  "msg": "payment failed",
  "user_id": "u-9182",
  "trace_id": "abc123",
  "duration_ms": 340
}

# Prometheus metric (counter)
http_requests_total{method="POST", status="500"} 17

# Jaeger trace (simplified)
Trace abc123
  span: api-gateway    120ms
    span: auth-service   40ms
    span: payment-svc   200ms
      span: db-query    180ms

One-line summary¶

Logs record events, metrics track numbers over time, traces map request flow — use all three together.