Log Analysis & Alerting Rules - Street-Level Ops¶

Quick PromQL Recipes¶

# Is my service up?
up{job="grokdevops"}

# Request rate (RPS)
sum(rate(http_requests_total{job="grokdevops"}[5m]))

# Error percentage
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="grokdevops"}[5m])) * 100

# p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))

# CPU usage by pod (cores)
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)

# Memory usage by pod (MB)
sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod) / 1024 / 1024

# Pod restart count in last hour
increase(kube_pod_container_status_restarts_total{namespace="grokdevops"}[1h])

# Available disk on PVCs
kubelet_volume_stats_available_bytes{namespace="grokdevops"} / 1024 / 1024 / 1024

Gotcha: rate() requires a range vector of at least 2 scrape intervals. If Prometheus scrapes every 15s, rate(...[15s]) returns nothing. Use at least rate(...[1m]) to ensure you have enough data points.

Debug clue: If a PromQL query returns no data, check label names first: http_requests_total{job="grokdevops"} vs http_requests_total{job="grokdevops-api"}. Use {__name__=~"http_request.*"} to discover the actual metric name.

Remember: PromQL function mnemonic: Rate for Requests (counters), Deriv for Drift (gauges). Never use rate() on a gauge (like memory usage) — it gives nonsensical results. Use deriv() or delta() instead.

Quick LogQL Recipes¶

# Recent errors
{namespace="grokdevops"} |= "error" | json | line_format "{{.ts}} {{.level}} {{.msg}}"

# Errors excluding health checks
{namespace="grokdevops"} |= "error" != "/health" != "healthcheck"

# Slow requests (>1s) from JSON logs
{namespace="grokdevops"} | json | duration > 1s

# Count errors by level
sum by (level) (count_over_time({namespace="grokdevops"} | json [5m]))

# Log volume per pod (bytes/sec)
sum by (pod) (bytes_rate({namespace="grokdevops"}[5m]))

# Find unique error messages
{namespace="grokdevops"} | json | level="error" | line_format "{{.msg}}" | decolorize

Pattern: Multi-Window Alert (SLO-Based)¶

Instead of simple thresholds, alert when you're burning error budget too fast:

# Fast burn: high error rate in short window (page immediately)
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
      / sum(rate(http_requests_total{job="grokdevops"}[5m]))
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical

> **Under the hood:** The 14.4x multiplier comes from Google's SRE book: at 14.4x normal burn rate, you exhaust your entire monthly error budget in 1 hour. The 2-minute `for` duration means you page after 2 minutes of sustained fast burn, which catches real incidents while ignoring single-request spikes.

# Slow burn: sustained moderate error rate (Slack)
- alert: ErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h]))
      / sum(rate(http_requests_total{job="grokdevops"}[1h]))
    ) > (1.2 * 0.001)  # 1.2x burn rate
  for: 30m
  labels:
    severity: warning

Pattern: Alert Routing¶

# Alertmanager config
route:
  receiver: default-slack
  group_by: [alertname, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h    # How long before re-notifying if alert is still firing
  routes:
    - match:
        severity: critical
      receiver: pagerduty
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: slack-warnings
    - match:
        severity: info
      receiver: slack-info

receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: <key>
  - name: slack-warnings
    slack_configs:
      - channel: '#alerts-warning'
  - name: default-slack
    slack_configs:
      - channel: '#alerts'

Gotcha: No Data vs Zero¶

# This does NOT fire when there are zero requests (returns no data):
rate(http_requests_total{status=~"5.."}[5m]) > 0.1

# This fires even with zero traffic:
absent(up{job="grokdevops"})

# Safe error rate (handle zero traffic):
(
  sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
  / sum(rate(http_requests_total{job="grokdevops"}[5m]))
) > 0.05
and
sum(rate(http_requests_total{job="grokdevops"}[5m])) > 0.1  # minimum traffic

Gotcha: Alertmanager's group_wait delays the first notification by the specified duration (default 30s) to batch related alerts together. During an incident, this means your first page arrives 30 seconds late. For critical alerts, set group_wait: 0s on the critical route to page immediately.

Gotcha: Recording Rules for Performance¶

Complex queries used in dashboards and alerts should be pre-computed:

groups:
  - name: grokdevops-recording
    rules:
      - record: grokdevops:http_requests:rate5m
        expr: sum(rate(http_requests_total{job="grokdevops"}[5m]))

      - record: grokdevops:http_errors:rate5m
        expr: sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))

      - record: grokdevops:http_error_ratio:rate5m
        expr: |
          grokdevops:http_errors:rate5m / grokdevops:http_requests:rate5m

Then use the recording rule name in alerts: grokdevops:http_error_ratio:rate5m > 0.05

Scale note: At 100+ alert rules using the same rate() expression, Prometheus evaluates the same computation repeatedly. Recording rules compute once and store the result, cutting evaluation time from O(rules) to O(1). Name them level:metric:operations per the official Prometheus naming convention — e.g., job:http_requests:rate5m means "aggregated at job level, from http_requests counter, using 5-minute rate."

Checklist: Alert Review¶

Run monthly: - [ ] Delete alerts that fired 0 times in 30 days (unless they're critical safety nets) - [ ] Delete alerts that fired but nobody acted on - [ ] Tune thresholds for alerts with >10% false positive rate - [ ] Ensure every alert has a runbook_url - [ ] Check that alert routing still matches team structure

War story: A team had 400 alerts, of which 350 were ignored daily. On-call engineers developed "alert blindness" and missed a real database failover because it was one notification among dozens. After pruning to 40 actionable alerts, mean time to acknowledge dropped from 45 minutes to 3 minutes.