Skip to content

Portal | Level: L2: Operations | Topics: Prometheus, Grafana, Loki, Tempo | Domain: Observability

Observability Deep Dive - Primer

Why This Matters

Observability is how you understand what your systems are doing without having to SSH into every box. When a service is slow, users are getting errors, or costs are spiking, observability gives you the data to diagnose the problem. Without it, you're guessing. The three pillars - metrics, logs, and traces - each answer different questions about your systems.

Core Concepts

The Three Pillars

Pillar What it tells you Example tools
Metrics Numeric measurements over time Prometheus, Datadog, CloudWatch
Logs Discrete events with context Loki, Elasticsearch, CloudWatch Logs
Traces Request flow across services Jaeger, Tempo, Zipkin

Metrics answer: "What is happening?" - CPU is at 80%, request rate is 500/sec, error rate is 2%.

Logs answer: "Why is it happening?" - Stack traces, error messages, request details.

Traces answer: "Where in the system is it happening?" - This request spent 200ms in the API, 50ms in the cache, and 3000ms in the database.

Who made it: Prometheus was developed at SoundCloud starting in 2012 by Matt T. Proud, inspired by Google's internal Borgmon monitoring system. It was publicly announced in January 2015 and became the second project accepted into the CNCF in May 2016 (after Kubernetes). Grafana, the visualization layer most commonly paired with Prometheus, was created by Torkel Odegaard and released in January 2014. He built it to improve Graphite's poor dashboard UI — the name is a portmanteau combining "Graphite" and "Grafana" (inspired by a misspelling).

Prometheus Architecture

Prometheus is the most common open-source metrics system:

                  ┌──────────────┐
Targets ────────> │  Prometheus  │ ────> Alertmanager ──> PagerDuty/Slack
(exporters        │  (scrape &   │
 expose /metrics) │   store)     │ ────> Grafana (visualization)
                  └──────────────┘

Key design choices: - Pull-based: Prometheus scrapes targets at intervals (15s-60s typically) - Time-series database: stores metric_name{labels} = value @ timestamp - Local storage: data stored on Prometheus server disk - Service discovery: finds targets via Kubernetes, Consul, DNS, file, etc.

Scrape configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Gotcha: Prometheus's pull-based scraping model means that if a short-lived process (like a batch job) starts, runs for 5 seconds, and exits before the next scrape interval, Prometheus never sees its metrics. Use the Pushgateway for batch/ephemeral jobs — they push metrics to the gateway, and Prometheus scrapes the gateway. But treat the Pushgateway as a buffer, not a general-purpose metric sink — it was not designed for high cardinality or high throughput.

Metric Types

Type Purpose Example
Counter Monotonically increasing value http_requests_total
Gauge Value that goes up and down temperature_celsius
Histogram Distribution of values (buckets) http_request_duration_seconds
Summary Similar to histogram, client-side quantiles rpc_duration_seconds

Naming convention: <namespace>_<name>_<unit> with _total suffix for counters.

Default trap: Histograms are powerful but expensive. Each histogram creates multiple time series: one per bucket plus _sum and _count. The default bucket set ({.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}) creates 13 series per label combination. A histogram with 3 labels, each with 10 values, generates 13 x 10 x 10 x 10 = 13,000 time series from a single metric definition. Always customize bucket boundaries to match your actual latency distribution, and keep label cardinality low.

PromQL (Prometheus Query Language)

# Instant vector: current value
up{job="node"}

# Range vector: values over time
http_requests_total{method="GET"}[5m]

# Rate: per-second rate of counter increase
rate(http_requests_total[5m])

# Aggregation: sum across all instances
sum(rate(http_requests_total[5m])) by (method)

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# 95th percentile latency (from histogram)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0

# Top 5 pods by memory
topk(5, container_memory_usage_bytes{namespace="production"})

Remember: The RED and USE methods are the two essential observability frameworks. RED (Rate, Errors, Duration) is for services: "How many requests? How many fail? How long do they take?" USE (Utilization, Saturation, Errors) is for resources: "How busy is it? How queued is it? How broken is it?" RED was coined by Tom Wilkie (Grafana Labs); USE was coined by Brendan Gregg (Netflix). Together they cover both the service and infrastructure layers.

Grafana

Grafana visualizes data from Prometheus, Loki, and other sources:

  • Dashboards: collections of panels showing metrics
  • Panels: individual visualizations (graphs, gauges, tables, heatmaps)
  • Variables: dropdowns that parameterize dashboards ($namespace, $pod)
  • Alerts: Grafana can also evaluate alert rules (v8+)

Dashboard design principles: 1. Overview dashboards: high-level health (RED/USE method) 2. Service dashboards: per-service metrics 3. Debugging dashboards: detailed metrics for troubleshooting 4. Use variables so one dashboard works for all environments

Under the hood: Prometheus stores metrics as time-series data using a custom TSDB (Time Series Database) with a write-ahead log. Data is organized into 2-hour blocks that are compacted over time. Each time series is identified by a metric name + label set. The critical operational constraint: cardinality. A metric with a label that has 1 million unique values (like user_id) creates 1 million time series — each consuming memory and disk. This is the #1 cause of Prometheus OOM crashes.

Loki and LogQL

Name origin: Loki is named after the Norse trickster god, fitting for a log aggregation system that is deceptively simple on the surface but powerful underneath. Tempo (Grafana's distributed tracing backend) continues the music theme from Grafana's naming convention. Grafana itself was a portmanteau of "Graphite" and "Grafana" — creator Torkel Odegaard originally misspelled it and the name stuck.

Loki is "Prometheus but for logs." It indexes labels, not log content, making it cheap to operate.

# Stream selector (filter by labels)
{app="nginx", namespace="production"}

# Filter by content
{app="nginx"} |= "error"
{app="nginx"} != "healthcheck"
{app="nginx"} |~ "status=(4|5).."

# Parse and filter
{app="nginx"} | json | status >= 400

# Metrics from logs
count_over_time({app="nginx"} |= "error" [5m])
rate({app="nginx"} |= "error" [5m])

# Top error paths
sum by (path) (count_over_time({app="nginx"} | json | status >= 500 [1h]))

Alerting

Alerting rules in Prometheus:

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 5 minutes"
          description: "Current error rate: {{ $value | humanizePercentage }}"

      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
        for: 30m
        labels:
          severity: warning

Alertmanager handles routing, deduplication, grouping, and silencing:

route:
  receiver: 'slack-default'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack-warnings'

SLI / SLO / SLA

Term Definition Example
SLI (Service Level Indicator) A measurement of service behavior 99.2% of requests return in < 500ms
SLO (Service Level Objective) Target for an SLI 99.5% of requests should be < 500ms
SLA (Service Level Agreement) Contract with consequences 99.9% uptime, or customer gets credits

Error budget = 1 - SLO. If your SLO is 99.5%, your error budget is 0.5% (about 3.6 hours/month of allowed downtime).

What Experienced People Know

  • Alerting on symptoms (error rate, latency) is better than alerting on causes (CPU, memory). Users don't care about your CPU; they care that the site is slow.
  • Every alert should be actionable. If there's nothing a human should do when it fires, it shouldn't be an alert. Make it a dashboard instead.
  • Cardinality matters. A metric with labels {user_id="..."} for millions of users will kill Prometheus. Keep label cardinality bounded.
  • Dashboards rot. If nobody looks at a dashboard, delete it. If everyone has their own dashboard, consolidate.
  • The RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for resources are the best starting frameworks.

See Also


Wiki Navigation

Prerequisites

  • Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)

Next Steps