Observability Deep Dive - Street Ops¶

What experienced SREs know about running observability systems in production.

Incident Runbooks¶

Cardinality Explosion¶

1. Symptoms:
   - Prometheus memory usage spiking
   - Queries getting slow or timing out
   - "too many time series" errors
   - Prometheus OOM restarts

2. Identify the culprit:
   # Top 10 metrics by cardinality
   topk(10, count by (__name__)({__name__=~".+"}))

   # Check a specific metric's label values
   count(http_requests_total) by (path)
   # If this returns thousands of unique paths, you found it

   # TSDB status API (Prometheus)
   curl localhost:9090/api/v1/status/tsdb | jq .

3. Common causes:
   - Unbounded labels: user IDs, session IDs, request IDs as label values
   - High-cardinality paths: /users/12345 instead of /users/{id}
   - Auto-generated metric names
   - Pod churn in Kubernetes (each new pod = new time series)

4. Fix:
   # Drop or relabel the offending metric at scrape time:
   scrape_configs:
     - job_name: 'myapp'
       metric_relabel_configs:
         - source_labels: [__name__]
           regex: 'expensive_metric_.*'
           action: drop
         - source_labels: [path]
           regex: '/users/[0-9]+'
           target_label: path
           replacement: '/users/{id}'

   # Or fix it in the application: normalize paths before recording

5. Prevention:
   - Review new metrics in PR: check for unbounded labels
   - Set per-scrape series limits in Prometheus
   - Monitor Prometheus TSDB cardinality as a metric itself

Alert Fatigue¶

1. Symptoms:
   - On-call engineers ignoring alerts
   - Hundreds of alerts firing simultaneously
   - Same alerts firing and resolving repeatedly (flapping)
   - Alerts that never have an actionable response

2. Audit current alerts:
   - List all alerts: how many fired in the last 30 days?
   - For each alert: was a human action taken?
   - If no action was taken > 50% of the time, the alert is noise

3. Fix alert quality:
   # Too sensitive (fires on brief spikes):
   # BAD:
   expr: cpu_usage > 80
   for: 1m

   # BETTER (sustained high CPU):
   expr: avg_over_time(cpu_usage[15m]) > 80
   for: 10m

   # Wrong threshold:
   # Don't alert on 80% disk. Alert on "disk will be full in 4 hours":
   expr: predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0

   # Flapping:
   # Add hysteresis by increasing 'for' duration
   # Or use different thresholds for firing vs resolving

4. Alert classification:
   - Page (wake someone up): service is DOWN for users
   - Ticket (next business day): degraded but functional
   - Dashboard (informational): useful context, no action needed

5. Grouping and routing:
   # Alertmanager groups related alerts:
   group_by: ['alertname', 'cluster', 'namespace']
   # Instead of 50 individual pod alerts, you get 1 grouped alert

6. Silence and inhibition:
   # During maintenance:
   amtool silence add alertname=~".*" cluster="staging"
   # Inhibition: if the cluster is down, don't also alert on every pod

Prometheus OOM / Performance Issues¶

1. Symptoms:
   - Prometheus restarts frequently
   - Queries time out
   - Dashboard load times > 10 seconds
   - WAL replay takes minutes after restart

2. Diagnose:
   # Check memory and series count
   process_resident_memory_bytes{job="prometheus"}
   prometheus_tsdb_head_series
   prometheus_tsdb_head_chunks

   # Check query performance
   prometheus_engine_query_duration_seconds

   # Check scrape performance
   prometheus_target_scrape_pool_exceeded_target_limit_total
   scrape_duration_seconds

3. Quick fixes:
   - Reduce retention: --storage.tsdb.retention.time=15d (default 15d)
   - Reduce scrape interval for noisy targets (30s instead of 15s)
   - Drop unused metrics with metric_relabel_configs
   - Add recording rules for expensive queries:

     groups:
       - name: precomputed
         rules:
           - record: job:http_requests:rate5m
             expr: sum(rate(http_requests_total[5m])) by (job)

4. Scale out:
   - Thanos / Cortex / Mimir for long-term storage and horizontal scaling
   - Shard Prometheus: each instance scrapes a subset of targets
   - Use remote_write to send data to long-term storage

5. Memory sizing rule of thumb:
   - ~2-3 KB per active time series in memory
   - 1 million series ≈ 2-3 GB RAM
   - Account for query spikes (can 2-3x memory during heavy queries)

Dashboard Sprawl¶

1. Problem:
   - 200 dashboards, nobody knows which to use
   - Duplicate dashboards for the same service
   - Stale dashboards with broken queries
   - No consistent layout or naming

2. Organize:
   - Folder structure: Infrastructure / Applications / Business
   - Naming convention: [Team] Service - Purpose
     Example: [Platform] Kubernetes - Node Health
   - Star/favorite critical dashboards
   - Delete dashboards nobody viewed in 90 days

3. Standardize:
   - Use dashboard-as-code (Grafonnet, Terraform Grafana provider)
   - Template dashboards: one template for all microservices
   - Standard variable names: $namespace, $pod, $instance
   - Consistent color coding: green=good, yellow=warning, red=critical

4. Layout pattern (per service):
   Row 1: Key SLIs (error rate, latency p50/p95/p99, throughput)
   Row 2: Resource usage (CPU, memory, disk, network)
   Row 3: Dependencies (database, cache, external APIs)
   Row 4: Kubernetes (pod status, restarts, resource requests vs actual)

SLI/SLO Implementation¶

1. Define SLIs:
   # Availability SLI
   sum(rate(http_requests_total{status!~"5.."}[30d]))
   /
   sum(rate(http_requests_total[30d]))

   # Latency SLI
   sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
   /
   sum(rate(http_request_duration_seconds_count[30d]))

2. Set SLOs:
   - Availability: 99.9% (error budget: 43 min/month)
   - Latency: 99% of requests < 500ms

3. Track error budget:
   # Error budget remaining (as a ratio of 1):
   1 - (
     sum(rate(http_requests_total{status=~"5.."}[30d]))
     /
     sum(rate(http_requests_total[30d]))
   )
   /
   (1 - 0.999)

   # If this goes below 0, you've exhausted your error budget

4. Alert on error budget burn rate:
   # Fast burn (alert quickly for major incidents):
   - alert: ErrorBudgetFastBurn
     expr: error_rate > 14.4 * (1 - 0.999)
     for: 5m

   # Slow burn (alert on sustained degradation):
   - alert: ErrorBudgetSlowBurn
     expr: error_rate > 1.0 * (1 - 0.999)
     for: 6h

Gotchas & War Stories¶

The metric that disappeared A team changed their application's metric names in a deploy. All dashboards and alerts broke instantly because they referenced the old names. Prevention: treat metric names as a public API. Use recording rules as an abstraction layer between raw metrics and dashboards/alerts.

The 3am alert that nobody could diagnose Alert fired: "High latency on payment service." On-call checked the payment dashboard - looked fine. Turned out the alert used a different time range than the dashboard. The alert was right; the dashboard was misleading. Prevention: link alerts directly to relevant dashboards with the same time range.

Log volume cost explosion Moved to a logging platform and got a $50K/month bill because every debug log from every service was being shipped. Prevention: set log levels appropriately (INFO in production, DEBUG only when troubleshooting). Sample verbose logs instead of collecting everything. Use Loki's label-based approach to control costs.

The query that killed Prometheus Someone put rate(http_requests_total[5m]) without any label filters on a dashboard with auto-refresh. Every 10 seconds, it queried across all services, all instances, all paths. Prevention: recording rules for expensive queries, query timeouts, and query limits in Prometheus config.

Monitoring the monitoring Prometheus goes down. Who alerts you? Set up dead man's switch alerts that fire continuously and alert when they stop. Or use a separate, simple monitoring system (even a cron job that checks Prometheus health) as a meta-monitor.

Essential PromQL Patterns¶

# Request rate
sum(rate(http_requests_total[5m])) by (service)

# Error rate percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)

# Latency percentiles (from histograms)
histogram_quantile(0.50, sum(rate(http_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_duration_seconds_bucket[5m])) by (le))

# Saturation: CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m])

# Memory usage vs limit
container_memory_usage_bytes / container_spec_memory_limit_bytes

# Disk fill prediction
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600)

# Rate of change (useful for spotting anomalies)
deriv(process_resident_memory_bytes[1h])

# Absent (alert when a metric disappears)
absent(up{job="critical-service"})

Quick Reference¶

Cheatsheet: Observability