Portal | Level: L2: Operations | Topics: Prometheus, Grafana, Loki, Tempo | Domain: Observability
Observability Deep Dive - Primer¶
Why This Matters¶
Observability is how you understand what your systems are doing without having to SSH into every box. When a service is slow, users are getting errors, or costs are spiking, observability gives you the data to diagnose the problem. Without it, you're guessing. The three pillars - metrics, logs, and traces - each answer different questions about your systems.
Core Concepts¶
The Three Pillars¶
| Pillar | What it tells you | Example tools |
|---|---|---|
| Metrics | Numeric measurements over time | Prometheus, Datadog, CloudWatch |
| Logs | Discrete events with context | Loki, Elasticsearch, CloudWatch Logs |
| Traces | Request flow across services | Jaeger, Tempo, Zipkin |
Metrics answer: "What is happening?" - CPU is at 80%, request rate is 500/sec, error rate is 2%.
Logs answer: "Why is it happening?" - Stack traces, error messages, request details.
Traces answer: "Where in the system is it happening?" - This request spent 200ms in the API, 50ms in the cache, and 3000ms in the database.
Who made it: Prometheus was developed at SoundCloud starting in 2012 by Matt T. Proud, inspired by Google's internal Borgmon monitoring system. It was publicly announced in January 2015 and became the second project accepted into the CNCF in May 2016 (after Kubernetes). Grafana, the visualization layer most commonly paired with Prometheus, was created by Torkel Odegaard and released in January 2014. He built it to improve Graphite's poor dashboard UI — the name is a portmanteau combining "Graphite" and "Grafana" (inspired by a misspelling).
Prometheus Architecture¶
Prometheus is the most common open-source metrics system:
┌──────────────┐
Targets ────────> │ Prometheus │ ────> Alertmanager ──> PagerDuty/Slack
(exporters │ (scrape & │
expose /metrics) │ store) │ ────> Grafana (visualization)
└──────────────┘
Key design choices: - Pull-based: Prometheus scrapes targets at intervals (15s-60s typically) - Time-series database: stores metric_name{labels} = value @ timestamp - Local storage: data stored on Prometheus server disk - Service discovery: finds targets via Kubernetes, Consul, DNS, file, etc.
Scrape configuration:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Gotcha: Prometheus's pull-based scraping model means that if a short-lived process (like a batch job) starts, runs for 5 seconds, and exits before the next scrape interval, Prometheus never sees its metrics. Use the Pushgateway for batch/ephemeral jobs — they push metrics to the gateway, and Prometheus scrapes the gateway. But treat the Pushgateway as a buffer, not a general-purpose metric sink — it was not designed for high cardinality or high throughput.
Metric Types¶
| Type | Purpose | Example |
|---|---|---|
| Counter | Monotonically increasing value | http_requests_total |
| Gauge | Value that goes up and down | temperature_celsius |
| Histogram | Distribution of values (buckets) | http_request_duration_seconds |
| Summary | Similar to histogram, client-side quantiles | rpc_duration_seconds |
Naming convention: <namespace>_<name>_<unit> with _total suffix for counters.
Default trap: Histograms are powerful but expensive. Each histogram creates multiple time series: one per bucket plus
_sumand_count. The default bucket set ({.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}) creates 13 series per label combination. A histogram with 3 labels, each with 10 values, generates 13 x 10 x 10 x 10 = 13,000 time series from a single metric definition. Always customize bucket boundaries to match your actual latency distribution, and keep label cardinality low.
PromQL (Prometheus Query Language)¶
# Instant vector: current value
up{job="node"}
# Range vector: values over time
http_requests_total{method="GET"}[5m]
# Rate: per-second rate of counter increase
rate(http_requests_total[5m])
# Aggregation: sum across all instances
sum(rate(http_requests_total[5m])) by (method)
# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# 95th percentile latency (from histogram)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
# Top 5 pods by memory
topk(5, container_memory_usage_bytes{namespace="production"})
Remember: The RED and USE methods are the two essential observability frameworks. RED (Rate, Errors, Duration) is for services: "How many requests? How many fail? How long do they take?" USE (Utilization, Saturation, Errors) is for resources: "How busy is it? How queued is it? How broken is it?" RED was coined by Tom Wilkie (Grafana Labs); USE was coined by Brendan Gregg (Netflix). Together they cover both the service and infrastructure layers.
Grafana¶
Grafana visualizes data from Prometheus, Loki, and other sources:
- Dashboards: collections of panels showing metrics
- Panels: individual visualizations (graphs, gauges, tables, heatmaps)
- Variables: dropdowns that parameterize dashboards ($namespace, $pod)
- Alerts: Grafana can also evaluate alert rules (v8+)
Dashboard design principles: 1. Overview dashboards: high-level health (RED/USE method) 2. Service dashboards: per-service metrics 3. Debugging dashboards: detailed metrics for troubleshooting 4. Use variables so one dashboard works for all environments
Under the hood: Prometheus stores metrics as time-series data using a custom TSDB (Time Series Database) with a write-ahead log. Data is organized into 2-hour blocks that are compacted over time. Each time series is identified by a metric name + label set. The critical operational constraint: cardinality. A metric with a label that has 1 million unique values (like
user_id) creates 1 million time series — each consuming memory and disk. This is the #1 cause of Prometheus OOM crashes.
Loki and LogQL¶
Name origin: Loki is named after the Norse trickster god, fitting for a log aggregation system that is deceptively simple on the surface but powerful underneath. Tempo (Grafana's distributed tracing backend) continues the music theme from Grafana's naming convention. Grafana itself was a portmanteau of "Graphite" and "Grafana" — creator Torkel Odegaard originally misspelled it and the name stuck.
Loki is "Prometheus but for logs." It indexes labels, not log content, making it cheap to operate.
# Stream selector (filter by labels)
{app="nginx", namespace="production"}
# Filter by content
{app="nginx"} |= "error"
{app="nginx"} != "healthcheck"
{app="nginx"} |~ "status=(4|5).."
# Parse and filter
{app="nginx"} | json | status >= 400
# Metrics from logs
count_over_time({app="nginx"} |= "error" [5m])
rate({app="nginx"} |= "error" [5m])
# Top error paths
sum by (path) (count_over_time({app="nginx"} | json | status >= 500 [1h]))
Alerting¶
Alerting rules in Prometheus:
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for 5 minutes"
description: "Current error rate: {{ $value | humanizePercentage }}"
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
for: 30m
labels:
severity: warning
Alertmanager handles routing, deduplication, grouping, and silencing:
route:
receiver: 'slack-default'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack-warnings'
SLI / SLO / SLA¶
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurement of service behavior | 99.2% of requests return in < 500ms |
| SLO (Service Level Objective) | Target for an SLI | 99.5% of requests should be < 500ms |
| SLA (Service Level Agreement) | Contract with consequences | 99.9% uptime, or customer gets credits |
Error budget = 1 - SLO. If your SLO is 99.5%, your error budget is 0.5% (about 3.6 hours/month of allowed downtime).
What Experienced People Know¶
- Alerting on symptoms (error rate, latency) is better than alerting on causes (CPU, memory). Users don't care about your CPU; they care that the site is slow.
- Every alert should be actionable. If there's nothing a human should do when it fires, it shouldn't be an alert. Make it a dashboard instead.
- Cardinality matters. A metric with labels
{user_id="..."}for millions of users will kill Prometheus. Keep label cardinality bounded. - Dashboards rot. If nobody looks at a dashboard, delete it. If everyone has their own dashboard, consolidate.
- The RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for resources are the best starting frameworks.
See Also¶
- Guide: Observability Guide
- Cheatsheet: Observability
- Drills: Observability Drills, PromQL Drills, LogQL Drills
- Skillcheck: Observability
- Runbooks: Prometheus Target Down, Loki No Logs
Wiki Navigation¶
Prerequisites¶
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)
Next Steps¶
- Alerting Rules (Topic Pack, L2)
- Capacity Planning (Topic Pack, L2)
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2)
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2)
- Chaos Engineering & Fault Injection (Topic Pack, L2)
- Continuous Profiling (Topic Pack, L2)
- Load Testing (Topic Pack, L1)
- Log Pipelines (Topic Pack, L2)
Related Content¶
- Observability Architecture (Reference, L2) — Grafana, Loki, Prometheus
- Skillcheck: Observability (Assessment, L2) — Grafana, Loki, Prometheus
- Track: Observability (Reference, L2) — Grafana, Loki, Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Loki, Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
- Monitoring Fundamentals (Topic Pack, L1) — Grafana, Prometheus
- Monitoring Migration (Legacy to Modern) (Topic Pack, L2) — Grafana, Prometheus
- Observability Drills (Drill, L2) — Loki, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
Pages that link here¶
- Anti-Primer: Observability Deep Dive
- Certification Prep: PCA — Prometheus Certified Associate
- Chaos Engineering & Fault Injection
- Comparison: Metrics Platforms
- Comparison: Tracing Platforms
- Continuous Profiling
- Kubernetes Ecosystem - Primer
- Load Testing
- Log Pipelines
- LogQL Drills
- Master Curriculum: 40 Weeks
- Monitoring Migration (Legacy to Modern)
- Observability
- Observability Architecture
- Observability Cheat Sheet