Portal | Level: L2: Operations | Topics: Prometheus, Grafana, Loki, Tempo | Domain: Observability

Observability - Skill Check¶

Mental model (bottom-up)¶

Observability is signals: metrics, logs, traces. You use them to answer: What broke? Where? Why now? Prometheus stores time series: every unique labelset = new series (costs RAM/disk).

Visual stack¶

[SLOs/Alerts      ]  actionable thresholds + runbooks
|
[Dashboards       ]  human view of trends and correlations
|
[Storage/query    ]  Prometheus/Loki/Tracing backend
|
[Collection       ]  scrape/agent/OTel collectors
|
[Instrumentation  ]  code emits metrics/logs/spans
|
[Reality          ]  actual system behavior

Glossary¶

SLI/SLO - measurement vs objective target
golden signals - latency, traffic, errors, saturation
cardinality - number of unique label values/series
counter/gauge/histogram - count vs level vs distribution
trace/span - request path / unit of work in trace
burn rate - speed of consuming error budget

Common failure modes¶

High-cardinality labels (user_id, request_id) explode time series.
Alerts without runbooks become noise.
Dashboards without decisions are vanity art.

Roadmap core (10, easy -> hard)¶

Monitoring vs logging vs tracing?
Metrics for trends, logs for events, traces for request flow.
Golden signals?
Latency, traffic, errors, saturation.
What is a time series metric?
Value indexed by time + labels.
Counter vs gauge vs histogram?
Monotonic count vs current value vs distribution.
Why labels can hurt you?
High cardinality explodes storage/query cost.
Alerting goal?
Actionable signals, not noise.
SLI vs SLO?
Indicator measurement vs target objective.
What is "burn rate"?
Speed of consuming error budget; guides urgency.
Dashboards: what's the trap?
Vanity charts without decision value; no link to alerts/SLOs.
Root cause workflow (tight)?
Correlate alerts -> deploys -> metrics/logs/traces -> mitigate -> postmortem.

Prometheus / metrics (easy -> hard)¶

Scrape (pull) model means?
Prometheus polls targets; targets expose /metrics.
Why label cardinality matters?
Each unique label set = new time series; memory/storage explodes.
rate() is for what?
Per-second rate of counters over a window.
Histogram buckets used for?
Latency distributions and quantiles (via histogram math).
Common metrics anti-pattern?
Labels like user_id/request_id creating unbounded series.
How to debug missing metrics?
Check target up, scrape config, auth/TLS, endpoint health.
Recording rules benefit?
Precompute expensive queries; consistent dashboards/alerts.
Alert rules best practice?
Tie to symptoms + runbook; avoid noisy thresholds.

Logging (easy -> hard)¶

Why structured logs?
Queryable fields; easier correlation than free text.
What is log sampling?
Reduce volume while keeping signal (esp. noisy paths).
What's the "secret in logs" failure mode?
Tokens/passwords printed; requires redaction policy.
Centralized logging benefits?
Single search place; retention; correlation across services.
"Logs are down" fallback?
Node/journal logs, sidecars, direct pod logs, backups.

Tracing (easy -> hard)¶

What is a trace?
End-to-end request path across services.
What is a span?
Timed unit of work within a trace (nested).
What is context propagation?
Passing trace IDs through headers across services.
Why tracing helps?
Pinpoints latency source and dependency bottlenecks.
Common tracing footgun?
Too much sampling or too little; missing key spans.

SLOs & incident response (easy -> hard)¶

Why SLOs beat "99.9 because it sounds nice"?
They align reliability with user impact and cost.
Error budget concept?
Allowed unreliability; spend guides feature velocity vs stability.
Burn rate alerting?
Alerts on budget consumption speed, not arbitrary metrics.
Incident command basics?
Single owner, clear comms, time-boxed mitigation first.
Postmortem that matters includes?
Root cause, contributing factors, and concrete prevention tasks.

Key correctness notes¶

Each unique labelset creates a new time series with RAM/CPU/disk/network cost; keep label cardinality low.
Keep cardinality low by avoiding high-cardinality labels (user_id, request_id, IP); aim for fewer than 100 unique values per label dimension.

Sources¶

Prometheus official docs (label cardinality guidance), SRE literature (golden signals, SLOs).
https://prometheus.io/docs/practices/naming/

Observability Architecture (Reference, L2) — Grafana, Loki, Prometheus
Observability Deep Dive (Topic Pack, L2) — Grafana, Loki, Prometheus
Track: Observability (Reference, L2) — Grafana, Loki, Prometheus
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Loki, Prometheus
Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
Monitoring Fundamentals (Topic Pack, L1) — Grafana, Prometheus
Monitoring Migration (Legacy to Modern) (Topic Pack, L2) — Grafana, Prometheus
Observability Drills (Drill, L2) — Loki, Prometheus
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Rules (Topic Pack, L2) — Prometheus

Observability - Skill Check¶

Mental model (bottom-up)¶

Visual stack¶

Glossary¶

Common failure modes¶

Roadmap core (10, easy -> hard)¶

Prometheus / metrics (easy -> hard)¶

Logging (easy -> hard)¶

Tracing (easy -> hard)¶

SLOs & incident response (easy -> hard)¶

Key correctness notes¶

Sources¶

Wiki Navigation¶

Pages that link here¶

Observability - Skill Check¶

Mental model (bottom-up)¶

Visual stack¶

Glossary¶

Common failure modes¶

Roadmap core (10, easy -> hard)¶

Prometheus / metrics (easy -> hard)¶

Logging (easy -> hard)¶

Tracing (easy -> hard)¶

SLOs & incident response (easy -> hard)¶

Key correctness notes¶

Sources¶

Wiki Navigation¶

Related Content¶

Pages that link here¶