Skip to content

Portal | Level: L2: Operations | Topics: Prometheus, Grafana, Loki, Tempo | Domain: Observability

Observability - Skill Check

Mental model (bottom-up)

Observability is signals: metrics, logs, traces. You use them to answer: What broke? Where? Why now? Prometheus stores time series: every unique labelset = new series (costs RAM/disk).

Visual stack

[SLOs/Alerts      ]  actionable thresholds + runbooks
|
[Dashboards       ]  human view of trends and correlations
|
[Storage/query    ]  Prometheus/Loki/Tracing backend
|
[Collection       ]  scrape/agent/OTel collectors
|
[Instrumentation  ]  code emits metrics/logs/spans
|
[Reality          ]  actual system behavior

Glossary

  • SLI/SLO - measurement vs objective target
  • golden signals - latency, traffic, errors, saturation
  • cardinality - number of unique label values/series
  • counter/gauge/histogram - count vs level vs distribution
  • trace/span - request path / unit of work in trace
  • burn rate - speed of consuming error budget

Common failure modes

  • High-cardinality labels (user_id, request_id) explode time series.
  • Alerts without runbooks become noise.
  • Dashboards without decisions are vanity art.

Roadmap core (10, easy -> hard)

  • Monitoring vs logging vs tracing?
  • Metrics for trends, logs for events, traces for request flow.
  • Golden signals?
  • Latency, traffic, errors, saturation.
  • What is a time series metric?
  • Value indexed by time + labels.
  • Counter vs gauge vs histogram?
  • Monotonic count vs current value vs distribution.
  • Why labels can hurt you?
  • High cardinality explodes storage/query cost.
  • Alerting goal?
  • Actionable signals, not noise.
  • SLI vs SLO?
  • Indicator measurement vs target objective.
  • What is "burn rate"?
  • Speed of consuming error budget; guides urgency.
  • Dashboards: what's the trap?
  • Vanity charts without decision value; no link to alerts/SLOs.
  • Root cause workflow (tight)?
  • Correlate alerts -> deploys -> metrics/logs/traces -> mitigate -> postmortem.

Prometheus / metrics (easy -> hard)

  • Scrape (pull) model means?
  • Prometheus polls targets; targets expose /metrics.
  • Why label cardinality matters?
  • Each unique label set = new time series; memory/storage explodes.
  • rate() is for what?
  • Per-second rate of counters over a window.
  • Histogram buckets used for?
  • Latency distributions and quantiles (via histogram math).
  • Common metrics anti-pattern?
  • Labels like user_id/request_id creating unbounded series.
  • How to debug missing metrics?
  • Check target up, scrape config, auth/TLS, endpoint health.
  • Recording rules benefit?
  • Precompute expensive queries; consistent dashboards/alerts.
  • Alert rules best practice?
  • Tie to symptoms + runbook; avoid noisy thresholds.

Logging (easy -> hard)

  • Why structured logs?
  • Queryable fields; easier correlation than free text.
  • What is log sampling?
  • Reduce volume while keeping signal (esp. noisy paths).
  • What's the "secret in logs" failure mode?
  • Tokens/passwords printed; requires redaction policy.
  • Centralized logging benefits?
  • Single search place; retention; correlation across services.
  • "Logs are down" fallback?
  • Node/journal logs, sidecars, direct pod logs, backups.

Tracing (easy -> hard)

  • What is a trace?
  • End-to-end request path across services.
  • What is a span?
  • Timed unit of work within a trace (nested).
  • What is context propagation?
  • Passing trace IDs through headers across services.
  • Why tracing helps?
  • Pinpoints latency source and dependency bottlenecks.
  • Common tracing footgun?
  • Too much sampling or too little; missing key spans.

SLOs & incident response (easy -> hard)

  • Why SLOs beat "99.9 because it sounds nice"?
  • They align reliability with user impact and cost.
  • Error budget concept?
  • Allowed unreliability; spend guides feature velocity vs stability.
  • Burn rate alerting?
  • Alerts on budget consumption speed, not arbitrary metrics.
  • Incident command basics?
  • Single owner, clear comms, time-boxed mitigation first.
  • Postmortem that matters includes?
  • Root cause, contributing factors, and concrete prevention tasks.

Key correctness notes

  • Each unique labelset creates a new time series with RAM/CPU/disk/network cost; keep label cardinality low.
  • Keep cardinality low by avoiding high-cardinality labels (user_id, request_id, IP); aim for fewer than 100 unique values per label dimension.

Sources

  • Prometheus official docs (label cardinality guidance), SRE literature (golden signals, SLOs).
  • https://prometheus.io/docs/practices/naming/

Wiki Navigation