Skip to content

Observability Cheat Sheet

Remember: Two complementary monitoring frameworks: RED for services (Request rate, Error rate, Duration) and USE for infrastructure (Utilization, Saturation, Errors). RED answers "are my users happy?" while USE answers "are my resources healthy?" Use both: RED dashboards for service owners, USE dashboards for platform teams.

Three Pillars

Pillar Tool Query Language
Metrics Prometheus PromQL
Logs Loki LogQL
Traces Tempo/Jaeger TraceQL

Gotcha: rate() only works on counters (monotonically increasing values). Using rate() on a gauge produces nonsense. For gauges, use deriv() (rate of change) or delta() (difference over window). Conversely, never graph a raw counter — it only goes up. Always wrap counters in rate() or increase() to get useful per-second or per-window values.

PromQL Quick Reference

# Instant vector
http_requests_total{job="api", status="200"}

# Range vector (last 5m)
http_requests_total[5m]

# Rate (per-second average over 5m)
rate(http_requests_total[5m])

# Aggregations
sum(rate(http_requests_total[5m])) by (service)
avg(container_memory_usage_bytes) by (pod)
topk(5, rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Histogram quantile (p99 latency)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0

LogQL Quick Reference

# Basic
{namespace="prod", app="api"}

# Filter
{app="api"} |= "error"
{app="api"} !~ "health|readiness"
{app="api"} | json | status >= 500

# Metrics from logs
count_over_time({app="api"} |= "error" [5m])
rate({app="api"} | json | status >= 500 [5m])

# Aggregation
sum by (status) (count_over_time({app="api"} | json [5m]))

Prometheus Architecture

Targets → Prometheus (scrape) → TSDB
          AlertManager → PagerDuty/Slack
            Grafana (query + dashboard)

Key Metrics to Monitor

Application (RED method):
  Rate:     requests per second
  Errors:   error rate (5xx / total)
  Duration: latency percentiles (p50, p95, p99)

Infrastructure (USE method):
  Utilization: % resource busy
  Saturation:  queue depth, throttling
  Errors:      error counts

Grafana Dashboard Patterns

Row 1: Golden signals (rate, errors, latency)
Row 2: Resource usage (CPU, memory, disk)
Row 3: Saturation (queue depth, connection pool)
Row 4: Business metrics (orders, signups)

Alert Rules

groups:
- name: app
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.01
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error rate > 1% for 5 minutes"

Alertmanager Routing

route:
  receiver: default
  routes:
  - match:
      severity: critical
    receiver: pagerduty
  - match:
      severity: warning
    receiver: slack
    repeat_interval: 4h

receivers:
- name: pagerduty
  pagerduty_configs:
  - routing_key: <key>
- name: slack
  slack_configs:
  - channel: '#alerts'

Debugging with Prometheus

# Check targets
curl localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check rules
curl localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'

# Check alerts
curl localhost:9090/api/v1/alerts | jq '.data.alerts[]'

# TSDB status
curl localhost:9090/api/v1/status/tsdb | jq '.data'

Under the hood: Prometheus uses a pull model — it scrapes targets at a configured interval (default 15s). This means if a target is down, Prometheus knows immediately (the scrape fails). Push-based systems (like StatsD) cannot distinguish between "app crashed" and "app is healthy but idle." The pull model also means Prometheus controls the scrape rate, preventing metric floods from misbehaving applications.

Service Monitoring Setup

# ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Recording Rules (Performance)

groups:
- name: api_recordings
  rules:
  - record: job:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m])) by (job)
  - record: job:http_errors:ratio5m
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
      / sum(rate(http_requests_total[5m])) by (job)