Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability
Log Analysis & Alerting Rules (PromQL / LogQL) - Primer¶
Why This Matters¶
Setting up Prometheus and Loki is step one. Writing useful alerts and queries is where the real value lives. Bad alerts cause alert fatigue and missed incidents. Good alerts wake you up only when customers are affected. This pack covers writing PromQL and LogQL for real-world monitoring and alerting.
Name origin: PromQL (Prometheus Query Language) was designed by Julius Volz and Bjoern Rabenstein at SoundCloud, where Prometheus was created in 2012. The language was influenced by Borgmon, Google's internal monitoring system, which SoundCloud ex-Googlers had used. LogQL (Loki Query Language) was designed by Grafana Labs to feel like PromQL but for log streams — the syntax is deliberately similar so that PromQL users can transfer their skills.
PromQL Fundamentals¶
Data Types¶
| Type | What it is | Example |
|---|---|---|
| Instant vector | Set of time series, one sample each | up{job="grokdevops"} |
| Range vector | Set of time series, range of samples | http_requests_total[5m] |
| Scalar | Single numeric value | 42 |
Essential Functions¶
# rate(): per-second rate of a counter over a range
rate(http_requests_total[5m])
# increase(): total increase of a counter over a range
increase(http_requests_total[1h])
# sum(): aggregate across labels
sum(rate(http_requests_total[5m])) by (status_code)
# avg(): average across labels
avg(container_memory_working_set_bytes) by (pod)
# histogram_quantile(): percentile from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# absent(): fires when a metric disappears
absent(up{job="grokdevops"})
# changes(): count of value changes
changes(kube_pod_status_phase{phase="Failed"}[1h])
Selectors and Operators¶
# Label matchers
http_requests_total{method="GET", status=~"2.."} # regex match
http_requests_total{status!="200"} # not equal
# Arithmetic
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Comparison (for alerting)
rate(http_requests_total{status=~"5.."}[5m]) > 0.1
Remember: Mnemonic for the Four Golden Signals: LETS — Latency, Errors, Traffic (request rate), Saturation. These come from Chapter 6 of the Google SRE book. RED (Rate, Errors, Duration) is the microservice-focused subset. USE (Utilization, Saturation, Errors) is Brendan Gregg's method for infrastructure resources. LETS/RED for services, USE for infrastructure.
The Four Golden Signals¶
Every service should be monitored for these:
1. Latency¶
# p99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le)
)
# p50 (median) latency
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le)
)
2. Traffic (Request Rate)¶
# Total RPS
sum(rate(http_requests_total{job="grokdevops"}[5m]))
# RPS by endpoint
sum(rate(http_requests_total{job="grokdevops"}[5m])) by (handler)
3. Errors¶
# Error rate (percentage)
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="grokdevops"}[5m]))
* 100
# Error rate by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)
/
sum(rate(http_requests_total[5m])) by (handler)
* 100
4. Saturation¶
# CPU utilization per pod
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="grokdevops",resource="cpu"}) by (pod)
* 100
# Memory utilization per pod
sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="grokdevops",resource="memory"}) by (pod)
* 100
Writing Alert Rules¶
Anatomy of an Alert¶
groups:
- name: grokdevops-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="grokdevops"}[5m]))
> 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on grokdevops"
description: "Error rate is {{ $value | humanizePercentage }} (>5%) for 5 minutes"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
Key Fields¶
| Field | Purpose |
|---|---|
expr |
PromQL expression that returns true when alert should fire |
for |
How long the condition must be true before firing (debounce) |
labels.severity |
Routes to different notification channels |
annotations |
Human-readable context for the on-call engineer |
runbook_url |
Link to the fix steps |
Gotcha: The
forduration is critical and often set too low. Withoutfor, the alert fires on a single evaluation cycle — a 30-second metric blip becomes a page. Withfor: 5m, the condition must be true for 5 consecutive evaluations. But if your Prometheus evaluation interval is 1 minute andfor: 5m, the alert won't fire until the 6th evaluation (5 full minutes after the first true). Tuneforbased on how much latency you can tolerate vs. how many false positives you will accept.
Alert Severity Guidelines¶
| Severity | Meaning | Response | Examples |
|---|---|---|---|
| critical | Customer-facing impact NOW | Page immediately | >5% error rate, service down |
| warning | Will become critical soon | Slack notification | Disk >80%, cert expiring in 7d |
| info | Informational | Dashboard only | Deploy completed, scaling event |
Essential Kubernetes Alerts¶
groups:
- name: kubernetes-alerts
rules:
# Pod not ready for 5 minutes
- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="true"} == 0
for: 5m
labels:
severity: warning
# Container restarting frequently
- alert: ContainerRestartLoop
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
# Node not ready
- alert: NodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
# PVC almost full
- alert: PVCAlmostFull
expr: |
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
for: 15m
labels:
severity: warning
# Deployment replica mismatch
- alert: DeploymentReplicaMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_ready_replicas
for: 10m
labels:
severity: warning
LogQL Fundamentals¶
LogQL is Loki's query language, similar to PromQL but for logs.
Log Stream Selectors¶
# Select by labels
{namespace="grokdevops", container="grokdevops"}
# Filter by content
{namespace="grokdevops"} |= "error" # contains
{namespace="grokdevops"} !~ "health" # not matching regex
{namespace="grokdevops"} |= "error" != "healthcheck" # chained
Parsers¶
# JSON parsing
{namespace="grokdevops"} | json | status >= 500
# Logfmt parsing
{namespace="grokdevops"} | logfmt | level="error"
# Pattern parsing
{namespace="grokdevops"} | pattern `<ip> - - [<timestamp>] "<method> <path> <_>" <status> <size>`
| status >= 500
Metric Queries (Log-Based Metrics)¶
# Error rate from logs
sum(rate({namespace="grokdevops"} |= "error" [5m]))
# Count of 5xx responses per minute
sum by (status) (
count_over_time({namespace="grokdevops"} | json | status >= 500 [1m])
)
# p99 request duration from logs
quantile_over_time(0.99,
{namespace="grokdevops"} | json | unwrap duration [5m]
) by (path)
# Bytes processed per hour
sum(bytes_over_time({namespace="grokdevops"}[1h]))
LogQL Alert Rules (Loki Ruler)¶
groups:
- name: loki-alerts
rules:
- alert: HighLogErrorRate
expr: |
sum(rate({namespace="grokdevops"} |= "level=error" [5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High error log rate in grokdevops"
- alert: PanicDetected
expr: |
count_over_time({namespace="grokdevops"} |= "panic" [5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Panic detected in grokdevops logs"
Alert Anti-Patterns¶
| Anti-pattern | Why it's bad | Fix |
|---|---|---|
| Alert on symptoms AND causes | Duplicate notifications | Alert on customer-facing symptoms only |
No for duration |
Fires on transient spikes | Add for: 5m minimum |
| No runbook_url | On-call engineer doesn't know what to do | Link every alert to a runbook |
| Alert on everything | Alert fatigue, pages get ignored | Only page for customer impact |
| Threshold too sensitive | False positives | Use percentiles, not averages |
| No severity levels | Everything is treated equally | Use critical/warning/info |
Common Pitfalls¶
War story: Rob Ewaschuk's classic Google SRE paper "My Philosophy on Alerting" (published as a Docs on SRE appendix) established the principle: "Pages should be about imminent or current customer impact, not about potential future problems." This single rule eliminates most alert fatigue. A disk at 80% is a Slack notification (warning). A disk at 99% is a page (critical). Predicted to fill in 4 hours is a page; predicted to fill in 2 weeks is a ticket.
- Using
avgfor latency — Averages hide tail latency. Use percentiles (p95, p99). - Missing
rate()on counters — Raw counter values always increase. Wrap inrate(). - Wrong range duration —
[1m]is too noisy,[1h]is too slow. Start with[5m]. - Alerting on low-traffic services — A single error on 10 requests = 10% error rate. Add minimum request threshold.
- High-cardinality labels — Labels like
user_idorrequest_idexplode metric storage.
Wiki Navigation¶
Prerequisites¶
- Observability Deep Dive (Topic Pack, L2)
Next Steps¶
- Alerting Rules Drills (Drill, L2)
- Skillcheck: Alerting Rules (Assessment, L2)
Related Content¶
- Alerting Rules Drills (Drill, L2) — Alerting Rules, Prometheus
- Runbook: Alert Storm (Flapping / Too Many Alerts) (Runbook, L2) — Alerting Rules, Prometheus
- Skillcheck: Alerting Rules (Assessment, L2) — Alerting Rules, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Pages that link here¶
- Alerting Rules
- Alerting Rules - Skill Check
- Alerting Rules Drills
- Anti-Primer: Alerting Rules
- Capacity Planning
- Certification Prep: PCA — Prometheus Certified Associate
- Comparison: Alerting & Paging
- Level 7: SRE & Cloud Operations
- Master Curriculum: 40 Weeks
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Runbook: Alert Storm (Flapping / Too Many Alerts)
- Scenario: Prometheus Says Target Down
- Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning
- Symptoms: Grafana Dashboard Empty, Prometheus Scrape Blocked by NetworkPolicy