Skip to content

GrokDevOps Wiki

Observability Cheatsheet

grokdatum/grokdevops

Observability Cheat Sheet¶

Remember: Two complementary monitoring frameworks: RED for services (Request rate, Error rate, Duration) and USE for infrastructure (Utilization, Saturation, Errors). RED answers "are my users happy?" while USE answers "are my resources healthy?" Use both: RED dashboards for service owners, USE dashboards for platform teams.

Three Pillars¶

Pillar	Tool	Query Language
Metrics	Prometheus	PromQL
Logs	Loki	LogQL
Traces	Tempo/Jaeger	TraceQL

Gotcha: rate() only works on counters (monotonically increasing values). Using rate() on a gauge produces nonsense. For gauges, use deriv() (rate of change) or delta() (difference over window). Conversely, never graph a raw counter — it only goes up. Always wrap counters in rate() or increase() to get useful per-second or per-window values.

PromQL Quick Reference¶

# Instant vector
http_requests_total{job="api", status="200"}

# Range vector (last 5m)
http_requests_total[5m]

# Rate (per-second average over 5m)
rate(http_requests_total[5m])

# Aggregations
sum(rate(http_requests_total[5m])) by (service)
avg(container_memory_usage_bytes) by (pod)
topk(5, rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Histogram quantile (p99 latency)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0

LogQL Quick Reference¶

# Basic
{namespace="prod", app="api"}

# Filter
{app="api"} |= "error"
{app="api"} !~ "health|readiness"
{app="api"} | json | status >= 500

# Metrics from logs
count_over_time({app="api"} |= "error" [5m])
rate({app="api"} | json | status >= 500 [5m])

# Aggregation
sum by (status) (count_over_time({app="api"} | json [5m]))

Prometheus Architecture¶

Targets → Prometheus (scrape) → TSDB
                ↓
          AlertManager → PagerDuty/Slack
                ↓
            Grafana (query + dashboard)

Key Metrics to Monitor¶

Application (RED method):
  Rate:     requests per second
  Errors:   error rate (5xx / total)
  Duration: latency percentiles (p50, p95, p99)

Infrastructure (USE method):
  Utilization: % resource busy
  Saturation:  queue depth, throttling
  Errors:      error counts

Grafana Dashboard Patterns¶

Row 1: Golden signals (rate, errors, latency)
Row 2: Resource usage (CPU, memory, disk)
Row 3: Saturation (queue depth, connection pool)
Row 4: Business metrics (orders, signups)

Alert Rules¶

groups:
- name: app
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.01
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error rate > 1% for 5 minutes"

Alertmanager Routing¶

route:
  receiver: default
  routes:
  - match:
      severity: critical
    receiver: pagerduty
  - match:
      severity: warning
    receiver: slack
    repeat_interval: 4h

receivers:
- name: pagerduty
  pagerduty_configs:
  - routing_key: <key>
- name: slack
  slack_configs:
  - channel: '#alerts'

Debugging with Prometheus¶

# Check targets
curl localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check rules
curl localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'

# Check alerts
curl localhost:9090/api/v1/alerts | jq '.data.alerts[]'

# TSDB status
curl localhost:9090/api/v1/status/tsdb | jq '.data'

Under the hood: Prometheus uses a pull model — it scrapes targets at a configured interval (default 15s). This means if a target is down, Prometheus knows immediately (the scrape fails). Push-based systems (like StatsD) cannot distinguish between "app crashed" and "app is healthy but idle." The pull model also means Prometheus controls the scrape rate, preventing metric floods from misbehaving applications.

Service Monitoring Setup¶

# ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Recording Rules (Performance)¶

groups:
- name: api_recordings
  rules:
  - record: job:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m])) by (job)
  - record: job:http_errors:ratio5m
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
      / sum(rate(http_requests_total[5m])) by (job)

Pages that link here¶