Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability

Log Analysis & Alerting Rules (PromQL / LogQL) - Primer¶

Why This Matters¶

Setting up Prometheus and Loki is step one. Writing useful alerts and queries is where the real value lives. Bad alerts cause alert fatigue and missed incidents. Good alerts wake you up only when customers are affected. This pack covers writing PromQL and LogQL for real-world monitoring and alerting.

Name origin: PromQL (Prometheus Query Language) was designed by Julius Volz and Bjoern Rabenstein at SoundCloud, where Prometheus was created in 2012. The language was influenced by Borgmon, Google's internal monitoring system, which SoundCloud ex-Googlers had used. LogQL (Loki Query Language) was designed by Grafana Labs to feel like PromQL but for log streams — the syntax is deliberately similar so that PromQL users can transfer their skills.

PromQL Fundamentals¶

Data Types¶

Type	What it is	Example
Instant vector	Set of time series, one sample each	`up{job="grokdevops"}`
Range vector	Set of time series, range of samples	`http_requests_total[5m]`
Scalar	Single numeric value	`42`

Essential Functions¶

# rate(): per-second rate of a counter over a range
rate(http_requests_total[5m])

# increase(): total increase of a counter over a range
increase(http_requests_total[1h])

# sum(): aggregate across labels
sum(rate(http_requests_total[5m])) by (status_code)

# avg(): average across labels
avg(container_memory_working_set_bytes) by (pod)

# histogram_quantile(): percentile from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# absent(): fires when a metric disappears
absent(up{job="grokdevops"})

# changes(): count of value changes
changes(kube_pod_status_phase{phase="Failed"}[1h])

Selectors and Operators¶

# Label matchers
http_requests_total{method="GET", status=~"2.."}  # regex match
http_requests_total{status!="200"}                  # not equal

# Arithmetic
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Comparison (for alerting)
rate(http_requests_total{status=~"5.."}[5m]) > 0.1

Remember: Mnemonic for the Four Golden Signals: LETS — Latency, Errors, Traffic (request rate), Saturation. These come from Chapter 6 of the Google SRE book. RED (Rate, Errors, Duration) is the microservice-focused subset. USE (Utilization, Saturation, Errors) is Brendan Gregg's method for infrastructure resources. LETS/RED for services, USE for infrastructure.

The Four Golden Signals¶

Every service should be monitored for these:

1. Latency¶

# p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le)
)

# p50 (median) latency
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le)
)

2. Traffic (Request Rate)¶

# Total RPS
sum(rate(http_requests_total{job="grokdevops"}[5m]))

# RPS by endpoint
sum(rate(http_requests_total{job="grokdevops"}[5m])) by (handler)

3. Errors¶

# Error rate (percentage)
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="grokdevops"}[5m]))
* 100

# Error rate by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)
/
sum(rate(http_requests_total[5m])) by (handler)
* 100

4. Saturation¶

# CPU utilization per pod
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="grokdevops",resource="cpu"}) by (pod)
* 100

# Memory utilization per pod
sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="grokdevops",resource="memory"}) by (pod)
* 100

Writing Alert Rules¶

Anatomy of an Alert¶

groups:
  - name: grokdevops-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="grokdevops"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on grokdevops"
          description: "Error rate is {{ $value | humanizePercentage }} (>5%) for 5 minutes"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

Key Fields¶

Field	Purpose
`expr`	PromQL expression that returns true when alert should fire
`for`	How long the condition must be true before firing (debounce)
`labels.severity`	Routes to different notification channels
`annotations`	Human-readable context for the on-call engineer
`runbook_url`	Link to the fix steps

Gotcha: The for duration is critical and often set too low. Without for, the alert fires on a single evaluation cycle — a 30-second metric blip becomes a page. With for: 5m, the condition must be true for 5 consecutive evaluations. But if your Prometheus evaluation interval is 1 minute and for: 5m, the alert won't fire until the 6th evaluation (5 full minutes after the first true). Tune for based on how much latency you can tolerate vs. how many false positives you will accept.

Alert Severity Guidelines¶

Severity	Meaning	Response	Examples
critical	Customer-facing impact NOW	Page immediately	>5% error rate, service down
warning	Will become critical soon	Slack notification	Disk >80%, cert expiring in 7d
info	Informational	Dashboard only	Deploy completed, scaling event

Essential Kubernetes Alerts¶

groups:
  - name: kubernetes-alerts
    rules:
      # Pod not ready for 5 minutes
      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="true"} == 0
        for: 5m
        labels:
          severity: warning

      # Container restarting frequently
      - alert: ContainerRestartLoop
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: warning

      # Node not ready
      - alert: NodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical

      # PVC almost full
      - alert: PVCAlmostFull
        expr: |
          kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
        for: 15m
        labels:
          severity: warning

      # Deployment replica mismatch
      - alert: DeploymentReplicaMismatch
        expr: |
          kube_deployment_spec_replicas != kube_deployment_status_ready_replicas
        for: 10m
        labels:
          severity: warning

LogQL Fundamentals¶

LogQL is Loki's query language, similar to PromQL but for logs.

Log Stream Selectors¶

# Select by labels
{namespace="grokdevops", container="grokdevops"}

# Filter by content
{namespace="grokdevops"} |= "error"       # contains
{namespace="grokdevops"} !~ "health"       # not matching regex
{namespace="grokdevops"} |= "error" != "healthcheck"  # chained

Parsers¶

# JSON parsing
{namespace="grokdevops"} | json | status >= 500

# Logfmt parsing
{namespace="grokdevops"} | logfmt | level="error"

# Pattern parsing
{namespace="grokdevops"} | pattern `<ip> - - [<timestamp>] "<method> <path> <_>" <status> <size>`
  | status >= 500

Metric Queries (Log-Based Metrics)¶

# Error rate from logs
sum(rate({namespace="grokdevops"} |= "error" [5m]))

# Count of 5xx responses per minute
sum by (status) (
  count_over_time({namespace="grokdevops"} | json | status >= 500 [1m])
)

# p99 request duration from logs
quantile_over_time(0.99,
  {namespace="grokdevops"} | json | unwrap duration [5m]
) by (path)

# Bytes processed per hour
sum(bytes_over_time({namespace="grokdevops"}[1h]))

LogQL Alert Rules (Loki Ruler)¶

groups:
  - name: loki-alerts
    rules:
      - alert: HighLogErrorRate
        expr: |
          sum(rate({namespace="grokdevops"} |= "level=error" [5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate in grokdevops"

      - alert: PanicDetected
        expr: |
          count_over_time({namespace="grokdevops"} |= "panic" [5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Panic detected in grokdevops logs"

Alert Anti-Patterns¶

Anti-pattern	Why it's bad	Fix
Alert on symptoms AND causes	Duplicate notifications	Alert on customer-facing symptoms only
No `for` duration	Fires on transient spikes	Add `for: 5m` minimum
No runbook_url	On-call engineer doesn't know what to do	Link every alert to a runbook
Alert on everything	Alert fatigue, pages get ignored	Only page for customer impact
Threshold too sensitive	False positives	Use percentiles, not averages
No severity levels	Everything is treated equally	Use critical/warning/info

Common Pitfalls¶

War story: Rob Ewaschuk's classic Google SRE paper "My Philosophy on Alerting" (published as a Docs on SRE appendix) established the principle: "Pages should be about imminent or current customer impact, not about potential future problems." This single rule eliminates most alert fatigue. A disk at 80% is a Slack notification (warning). A disk at 99% is a page (critical). Predicted to fill in 4 hours is a page; predicted to fill in 2 weeks is a ticket.

Using avg for latency — Averages hide tail latency. Use percentiles (p95, p99).
Missing rate() on counters — Raw counter values always increase. Wrap in rate().
Wrong range duration — [1m] is too noisy, [1h] is too slow. Start with [5m].
Alerting on low-traffic services — A single error on 10 requests = 10% error rate. Add minimum request threshold.
High-cardinality labels — Labels like user_id or request_id explode metric storage.

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)

Next Steps¶

Alerting Rules Drills (Drill, L2)
Skillcheck: Alerting Rules (Assessment, L2)

Alerting Rules Drills (Drill, L2) — Alerting Rules, Prometheus
Runbook: Alert Storm (Flapping / Too Many Alerts) (Runbook, L2) — Alerting Rules, Prometheus
Skillcheck: Alerting Rules (Assessment, L2) — Alerting Rules, Prometheus
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
Capacity Planning (Topic Pack, L2) — Prometheus
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus

Log Analysis & Alerting Rules (PromQL / LogQL) - Primer¶

Why This Matters¶

PromQL Fundamentals¶

Data Types¶

Essential Functions¶

Selectors and Operators¶

The Four Golden Signals¶

1. Latency¶

2. Traffic (Request Rate)¶

3. Errors¶

4. Saturation¶

Writing Alert Rules¶

Anatomy of an Alert¶

Key Fields¶

Alert Severity Guidelines¶

Essential Kubernetes Alerts¶

LogQL Fundamentals¶

Log Stream Selectors¶

Parsers¶

Metric Queries (Log-Based Metrics)¶

LogQL Alert Rules (Loki Ruler)¶

Alert Anti-Patterns¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

Log Analysis & Alerting Rules (PromQL / LogQL) - Primer¶

Why This Matters¶

PromQL Fundamentals¶

Data Types¶

Essential Functions¶

Selectors and Operators¶

The Four Golden Signals¶

1. Latency¶

2. Traffic (Request Rate)¶

3. Errors¶

4. Saturation¶

Writing Alert Rules¶

Anatomy of an Alert¶

Key Fields¶

Alert Severity Guidelines¶

Essential Kubernetes Alerts¶

LogQL Fundamentals¶

Log Stream Selectors¶

Parsers¶

Metric Queries (Log-Based Metrics)¶

LogQL Alert Rules (Loki Ruler)¶

Alert Anti-Patterns¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶