Observability Cheat Sheet¶
Remember: Two complementary monitoring frameworks: RED for services (Request rate, Error rate, Duration) and USE for infrastructure (Utilization, Saturation, Errors). RED answers "are my users happy?" while USE answers "are my resources healthy?" Use both: RED dashboards for service owners, USE dashboards for platform teams.
Three Pillars¶
| Pillar | Tool | Query Language |
|---|---|---|
| Metrics | Prometheus | PromQL |
| Logs | Loki | LogQL |
| Traces | Tempo/Jaeger | TraceQL |
Gotcha:
rate()only works on counters (monotonically increasing values). Usingrate()on a gauge produces nonsense. For gauges, usederiv()(rate of change) ordelta()(difference over window). Conversely, never graph a raw counter — it only goes up. Always wrap counters inrate()orincrease()to get useful per-second or per-window values.
PromQL Quick Reference¶
# Instant vector
http_requests_total{job="api", status="200"}
# Range vector (last 5m)
http_requests_total[5m]
# Rate (per-second average over 5m)
rate(http_requests_total[5m])
# Aggregations
sum(rate(http_requests_total[5m])) by (service)
avg(container_memory_usage_bytes) by (pod)
topk(5, rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Histogram quantile (p99 latency)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
LogQL Quick Reference¶
# Basic
{namespace="prod", app="api"}
# Filter
{app="api"} |= "error"
{app="api"} !~ "health|readiness"
{app="api"} | json | status >= 500
# Metrics from logs
count_over_time({app="api"} |= "error" [5m])
rate({app="api"} | json | status >= 500 [5m])
# Aggregation
sum by (status) (count_over_time({app="api"} | json [5m]))
Prometheus Architecture¶
Key Metrics to Monitor¶
Application (RED method):
Rate: requests per second
Errors: error rate (5xx / total)
Duration: latency percentiles (p50, p95, p99)
Infrastructure (USE method):
Utilization: % resource busy
Saturation: queue depth, throttling
Errors: error counts
Grafana Dashboard Patterns¶
Row 1: Golden signals (rate, errors, latency)
Row 2: Resource usage (CPU, memory, disk)
Row 3: Saturation (queue depth, connection pool)
Row 4: Business metrics (orders, signups)
Alert Rules¶
groups:
- name: app
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 1% for 5 minutes"
Alertmanager Routing¶
route:
receiver: default
routes:
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack
repeat_interval: 4h
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: <key>
- name: slack
slack_configs:
- channel: '#alerts'
Debugging with Prometheus¶
# Check targets
curl localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check rules
curl localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'
# Check alerts
curl localhost:9090/api/v1/alerts | jq '.data.alerts[]'
# TSDB status
curl localhost:9090/api/v1/status/tsdb | jq '.data'
Under the hood: Prometheus uses a pull model — it scrapes targets at a configured interval (default 15s). This means if a target is down, Prometheus knows immediately (the scrape fails). Push-based systems (like StatsD) cannot distinguish between "app crashed" and "app is healthy but idle." The pull model also means Prometheus controls the scrape rate, preventing metric floods from misbehaving applications.
Service Monitoring Setup¶
# ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api
spec:
selector:
matchLabels:
app: api
endpoints:
- port: metrics
interval: 15s
path: /metrics
Recording Rules (Performance)¶
groups:
- name: api_recordings
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)