Portal | Level: L2: Operations | Topics: Prometheus, Grafana, Loki, Tempo | Domain: Observability
Observability - Skill Check¶
Mental model (bottom-up)¶
Observability is signals: metrics, logs, traces. You use them to answer: What broke? Where? Why now? Prometheus stores time series: every unique labelset = new series (costs RAM/disk).
Visual stack¶
[SLOs/Alerts ] actionable thresholds + runbooks
|
[Dashboards ] human view of trends and correlations
|
[Storage/query ] Prometheus/Loki/Tracing backend
|
[Collection ] scrape/agent/OTel collectors
|
[Instrumentation ] code emits metrics/logs/spans
|
[Reality ] actual system behavior
Glossary¶
- SLI/SLO - measurement vs objective target
- golden signals - latency, traffic, errors, saturation
- cardinality - number of unique label values/series
- counter/gauge/histogram - count vs level vs distribution
- trace/span - request path / unit of work in trace
- burn rate - speed of consuming error budget
Common failure modes¶
- High-cardinality labels (user_id, request_id) explode time series.
- Alerts without runbooks become noise.
- Dashboards without decisions are vanity art.
Roadmap core (10, easy -> hard)¶
- Monitoring vs logging vs tracing?
- Metrics for trends, logs for events, traces for request flow.
- Golden signals?
- Latency, traffic, errors, saturation.
- What is a time series metric?
- Value indexed by time + labels.
- Counter vs gauge vs histogram?
- Monotonic count vs current value vs distribution.
- Why labels can hurt you?
- High cardinality explodes storage/query cost.
- Alerting goal?
- Actionable signals, not noise.
- SLI vs SLO?
- Indicator measurement vs target objective.
- What is "burn rate"?
- Speed of consuming error budget; guides urgency.
- Dashboards: what's the trap?
- Vanity charts without decision value; no link to alerts/SLOs.
- Root cause workflow (tight)?
- Correlate alerts -> deploys -> metrics/logs/traces -> mitigate -> postmortem.
Prometheus / metrics (easy -> hard)¶
- Scrape (pull) model means?
- Prometheus polls targets; targets expose
/metrics. - Why label cardinality matters?
- Each unique label set = new time series; memory/storage explodes.
rate()is for what?- Per-second rate of counters over a window.
- Histogram buckets used for?
- Latency distributions and quantiles (via histogram math).
- Common metrics anti-pattern?
- Labels like user_id/request_id creating unbounded series.
- How to debug missing metrics?
- Check target up, scrape config, auth/TLS, endpoint health.
- Recording rules benefit?
- Precompute expensive queries; consistent dashboards/alerts.
- Alert rules best practice?
- Tie to symptoms + runbook; avoid noisy thresholds.
Logging (easy -> hard)¶
- Why structured logs?
- Queryable fields; easier correlation than free text.
- What is log sampling?
- Reduce volume while keeping signal (esp. noisy paths).
- What's the "secret in logs" failure mode?
- Tokens/passwords printed; requires redaction policy.
- Centralized logging benefits?
- Single search place; retention; correlation across services.
- "Logs are down" fallback?
- Node/journal logs, sidecars, direct pod logs, backups.
Tracing (easy -> hard)¶
- What is a trace?
- End-to-end request path across services.
- What is a span?
- Timed unit of work within a trace (nested).
- What is context propagation?
- Passing trace IDs through headers across services.
- Why tracing helps?
- Pinpoints latency source and dependency bottlenecks.
- Common tracing footgun?
- Too much sampling or too little; missing key spans.
SLOs & incident response (easy -> hard)¶
- Why SLOs beat "99.9 because it sounds nice"?
- They align reliability with user impact and cost.
- Error budget concept?
- Allowed unreliability; spend guides feature velocity vs stability.
- Burn rate alerting?
- Alerts on budget consumption speed, not arbitrary metrics.
- Incident command basics?
- Single owner, clear comms, time-boxed mitigation first.
- Postmortem that matters includes?
- Root cause, contributing factors, and concrete prevention tasks.
Key correctness notes¶
- Each unique labelset creates a new time series with RAM/CPU/disk/network cost; keep label cardinality low.
- Keep cardinality low by avoiding high-cardinality labels (user_id, request_id, IP); aim for fewer than 100 unique values per label dimension.
Sources¶
- Prometheus official docs (label cardinality guidance), SRE literature (golden signals, SLOs).
- https://prometheus.io/docs/practices/naming/
Wiki Navigation¶
Related Content¶
- Observability Architecture (Reference, L2) — Grafana, Loki, Prometheus
- Observability Deep Dive (Topic Pack, L2) — Grafana, Loki, Prometheus
- Track: Observability (Reference, L2) — Grafana, Loki, Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Loki, Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
- Monitoring Fundamentals (Topic Pack, L1) — Grafana, Prometheus
- Monitoring Migration (Legacy to Modern) (Topic Pack, L2) — Grafana, Prometheus
- Observability Drills (Drill, L2) — Loki, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
Pages that link here¶
- Level 4: Operations & Observability
- Master Curriculum: 40 Weeks
- Monitoring Fundamentals - Primer
- Monitoring Migration (Legacy to Modern)
- Observability
- Observability Architecture
- Observability Debugging Decision Flow
- Observability Domain
- Observability Drills
- Primer
- Primer
- Runbook: Grafana Dashboard Blank / No Data
- Runbook: Tempo Not Receiving Traces