Log Analysis & Alerting Rules - Street-Level Ops¶
Quick PromQL Recipes¶
# Is my service up?
up{job="grokdevops"}
# Request rate (RPS)
sum(rate(http_requests_total{job="grokdevops"}[5m]))
# Error percentage
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="grokdevops"}[5m])) * 100
# p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))
# CPU usage by pod (cores)
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
# Memory usage by pod (MB)
sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod) / 1024 / 1024
# Pod restart count in last hour
increase(kube_pod_container_status_restarts_total{namespace="grokdevops"}[1h])
# Available disk on PVCs
kubelet_volume_stats_available_bytes{namespace="grokdevops"} / 1024 / 1024 / 1024
Gotcha:
rate()requires a range vector of at least 2 scrape intervals. If Prometheus scrapes every 15s,rate(...[15s])returns nothing. Use at leastrate(...[1m])to ensure you have enough data points.Debug clue: If a PromQL query returns
no data, check label names first:http_requests_total{job="grokdevops"}vshttp_requests_total{job="grokdevops-api"}. Use{__name__=~"http_request.*"}to discover the actual metric name.Remember: PromQL function mnemonic: Rate for Requests (counters), Deriv for Drift (gauges). Never use
rate()on a gauge (like memory usage) — it gives nonsensical results. Usederiv()ordelta()instead.
Quick LogQL Recipes¶
# Recent errors
{namespace="grokdevops"} |= "error" | json | line_format "{{.ts}} {{.level}} {{.msg}}"
# Errors excluding health checks
{namespace="grokdevops"} |= "error" != "/health" != "healthcheck"
# Slow requests (>1s) from JSON logs
{namespace="grokdevops"} | json | duration > 1s
# Count errors by level
sum by (level) (count_over_time({namespace="grokdevops"} | json [5m]))
# Log volume per pod (bytes/sec)
sum by (pod) (bytes_rate({namespace="grokdevops"}[5m]))
# Find unique error messages
{namespace="grokdevops"} | json | level="error" | line_format "{{.msg}}" | decolorize
Pattern: Multi-Window Alert (SLO-Based)¶
Instead of simple thresholds, alert when you're burning error budget too fast:
# Fast burn: high error rate in short window (page immediately)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="grokdevops"}[5m]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
> **Under the hood:** The 14.4x multiplier comes from Google's SRE book: at 14.4x normal burn rate, you exhaust your entire monthly error budget in 1 hour. The 2-minute `for` duration means you page after 2 minutes of sustained fast burn, which catches real incidents while ignoring single-request spikes.
# Slow burn: sustained moderate error rate (Slack)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="grokdevops"}[1h]))
) > (1.2 * 0.001) # 1.2x burn rate
for: 30m
labels:
severity: warning
Pattern: Alert Routing¶
# Alertmanager config
route:
receiver: default-slack
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h # How long before re-notifying if alert is still firing
routes:
- match:
severity: critical
receiver: pagerduty
repeat_interval: 1h
- match:
severity: warning
receiver: slack-warnings
- match:
severity: info
receiver: slack-info
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: <key>
- name: slack-warnings
slack_configs:
- channel: '#alerts-warning'
- name: default-slack
slack_configs:
- channel: '#alerts'
Gotcha: No Data vs Zero¶
# This does NOT fire when there are zero requests (returns no data):
rate(http_requests_total{status=~"5.."}[5m]) > 0.1
# This fires even with zero traffic:
absent(up{job="grokdevops"})
# Safe error rate (handle zero traffic):
(
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="grokdevops"}[5m]))
) > 0.05
and
sum(rate(http_requests_total{job="grokdevops"}[5m])) > 0.1 # minimum traffic
Gotcha: Alertmanager's
group_waitdelays the first notification by the specified duration (default 30s) to batch related alerts together. During an incident, this means your first page arrives 30 seconds late. For critical alerts, setgroup_wait: 0son the critical route to page immediately.
Gotcha: Recording Rules for Performance¶
Complex queries used in dashboards and alerts should be pre-computed:
groups:
- name: grokdevops-recording
rules:
- record: grokdevops:http_requests:rate5m
expr: sum(rate(http_requests_total{job="grokdevops"}[5m]))
- record: grokdevops:http_errors:rate5m
expr: sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
- record: grokdevops:http_error_ratio:rate5m
expr: |
grokdevops:http_errors:rate5m / grokdevops:http_requests:rate5m
Then use the recording rule name in alerts: grokdevops:http_error_ratio:rate5m > 0.05
Scale note: At 100+ alert rules using the same
rate()expression, Prometheus evaluates the same computation repeatedly. Recording rules compute once and store the result, cutting evaluation time from O(rules) to O(1). Name themlevel:metric:operationsper the official Prometheus naming convention — e.g.,job:http_requests:rate5mmeans "aggregated at job level, from http_requests counter, using 5-minute rate."
Checklist: Alert Review¶
Run monthly: - [ ] Delete alerts that fired 0 times in 30 days (unless they're critical safety nets) - [ ] Delete alerts that fired but nobody acted on - [ ] Tune thresholds for alerts with >10% false positive rate - [ ] Ensure every alert has a runbook_url - [ ] Check that alert routing still matches team structure
War story: A team had 400 alerts, of which 350 were ignored daily. On-call engineers developed "alert blindness" and missed a real database failover because it was one notification among dozens. After pruning to 40 actionable alerts, mean time to acknowledge dropped from 45 minutes to 3 minutes.