Skip to content

Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability

Log Analysis & Alerting Rules (PromQL / LogQL) - Primer

Why This Matters

Setting up Prometheus and Loki is step one. Writing useful alerts and queries is where the real value lives. Bad alerts cause alert fatigue and missed incidents. Good alerts wake you up only when customers are affected. This pack covers writing PromQL and LogQL for real-world monitoring and alerting.

Name origin: PromQL (Prometheus Query Language) was designed by Julius Volz and Bjoern Rabenstein at SoundCloud, where Prometheus was created in 2012. The language was influenced by Borgmon, Google's internal monitoring system, which SoundCloud ex-Googlers had used. LogQL (Loki Query Language) was designed by Grafana Labs to feel like PromQL but for log streams — the syntax is deliberately similar so that PromQL users can transfer their skills.

PromQL Fundamentals

Data Types

Type What it is Example
Instant vector Set of time series, one sample each up{job="grokdevops"}
Range vector Set of time series, range of samples http_requests_total[5m]
Scalar Single numeric value 42

Essential Functions

# rate(): per-second rate of a counter over a range
rate(http_requests_total[5m])

# increase(): total increase of a counter over a range
increase(http_requests_total[1h])

# sum(): aggregate across labels
sum(rate(http_requests_total[5m])) by (status_code)

# avg(): average across labels
avg(container_memory_working_set_bytes) by (pod)

# histogram_quantile(): percentile from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# absent(): fires when a metric disappears
absent(up{job="grokdevops"})

# changes(): count of value changes
changes(kube_pod_status_phase{phase="Failed"}[1h])

Selectors and Operators

# Label matchers
http_requests_total{method="GET", status=~"2.."}  # regex match
http_requests_total{status!="200"}                  # not equal

# Arithmetic
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Comparison (for alerting)
rate(http_requests_total{status=~"5.."}[5m]) > 0.1

Remember: Mnemonic for the Four Golden Signals: LETSLatency, Errors, Traffic (request rate), Saturation. These come from Chapter 6 of the Google SRE book. RED (Rate, Errors, Duration) is the microservice-focused subset. USE (Utilization, Saturation, Errors) is Brendan Gregg's method for infrastructure resources. LETS/RED for services, USE for infrastructure.

The Four Golden Signals

Every service should be monitored for these:

1. Latency

# p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le)
)

# p50 (median) latency
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le)
)

2. Traffic (Request Rate)

# Total RPS
sum(rate(http_requests_total{job="grokdevops"}[5m]))

# RPS by endpoint
sum(rate(http_requests_total{job="grokdevops"}[5m])) by (handler)

3. Errors

# Error rate (percentage)
sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="grokdevops"}[5m]))
* 100

# Error rate by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)
/
sum(rate(http_requests_total[5m])) by (handler)
* 100

4. Saturation

# CPU utilization per pod
sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="grokdevops",resource="cpu"}) by (pod)
* 100

# Memory utilization per pod
sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="grokdevops",resource="memory"}) by (pod)
* 100

Writing Alert Rules

Anatomy of an Alert

groups:
  - name: grokdevops-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="grokdevops"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on grokdevops"
          description: "Error rate is {{ $value | humanizePercentage }} (>5%) for 5 minutes"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

Key Fields

Field Purpose
expr PromQL expression that returns true when alert should fire
for How long the condition must be true before firing (debounce)
labels.severity Routes to different notification channels
annotations Human-readable context for the on-call engineer
runbook_url Link to the fix steps

Gotcha: The for duration is critical and often set too low. Without for, the alert fires on a single evaluation cycle — a 30-second metric blip becomes a page. With for: 5m, the condition must be true for 5 consecutive evaluations. But if your Prometheus evaluation interval is 1 minute and for: 5m, the alert won't fire until the 6th evaluation (5 full minutes after the first true). Tune for based on how much latency you can tolerate vs. how many false positives you will accept.

Alert Severity Guidelines

Severity Meaning Response Examples
critical Customer-facing impact NOW Page immediately >5% error rate, service down
warning Will become critical soon Slack notification Disk >80%, cert expiring in 7d
info Informational Dashboard only Deploy completed, scaling event

Essential Kubernetes Alerts

groups:
  - name: kubernetes-alerts
    rules:
      # Pod not ready for 5 minutes
      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="true"} == 0
        for: 5m
        labels:
          severity: warning

      # Container restarting frequently
      - alert: ContainerRestartLoop
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: warning

      # Node not ready
      - alert: NodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical

      # PVC almost full
      - alert: PVCAlmostFull
        expr: |
          kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
        for: 15m
        labels:
          severity: warning

      # Deployment replica mismatch
      - alert: DeploymentReplicaMismatch
        expr: |
          kube_deployment_spec_replicas != kube_deployment_status_ready_replicas
        for: 10m
        labels:
          severity: warning

LogQL Fundamentals

LogQL is Loki's query language, similar to PromQL but for logs.

Log Stream Selectors

# Select by labels
{namespace="grokdevops", container="grokdevops"}

# Filter by content
{namespace="grokdevops"} |= "error"       # contains
{namespace="grokdevops"} !~ "health"       # not matching regex
{namespace="grokdevops"} |= "error" != "healthcheck"  # chained

Parsers

# JSON parsing
{namespace="grokdevops"} | json | status >= 500

# Logfmt parsing
{namespace="grokdevops"} | logfmt | level="error"

# Pattern parsing
{namespace="grokdevops"} | pattern `<ip> - - [<timestamp>] "<method> <path> <_>" <status> <size>`
  | status >= 500

Metric Queries (Log-Based Metrics)

# Error rate from logs
sum(rate({namespace="grokdevops"} |= "error" [5m]))

# Count of 5xx responses per minute
sum by (status) (
  count_over_time({namespace="grokdevops"} | json | status >= 500 [1m])
)

# p99 request duration from logs
quantile_over_time(0.99,
  {namespace="grokdevops"} | json | unwrap duration [5m]
) by (path)

# Bytes processed per hour
sum(bytes_over_time({namespace="grokdevops"}[1h]))

LogQL Alert Rules (Loki Ruler)

groups:
  - name: loki-alerts
    rules:
      - alert: HighLogErrorRate
        expr: |
          sum(rate({namespace="grokdevops"} |= "level=error" [5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate in grokdevops"

      - alert: PanicDetected
        expr: |
          count_over_time({namespace="grokdevops"} |= "panic" [5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Panic detected in grokdevops logs"

Alert Anti-Patterns

Anti-pattern Why it's bad Fix
Alert on symptoms AND causes Duplicate notifications Alert on customer-facing symptoms only
No for duration Fires on transient spikes Add for: 5m minimum
No runbook_url On-call engineer doesn't know what to do Link every alert to a runbook
Alert on everything Alert fatigue, pages get ignored Only page for customer impact
Threshold too sensitive False positives Use percentiles, not averages
No severity levels Everything is treated equally Use critical/warning/info

Common Pitfalls

War story: Rob Ewaschuk's classic Google SRE paper "My Philosophy on Alerting" (published as a Docs on SRE appendix) established the principle: "Pages should be about imminent or current customer impact, not about potential future problems." This single rule eliminates most alert fatigue. A disk at 80% is a Slack notification (warning). A disk at 99% is a page (critical). Predicted to fill in 4 hours is a page; predicted to fill in 2 weeks is a ticket.

  1. Using avg for latency — Averages hide tail latency. Use percentiles (p95, p99).
  2. Missing rate() on counters — Raw counter values always increase. Wrap in rate().
  3. Wrong range duration[1m] is too noisy, [1h] is too slow. Start with [5m].
  4. Alerting on low-traffic services — A single error on 10 requests = 10% error rate. Add minimum request threshold.
  5. High-cardinality labels — Labels like user_id or request_id explode metric storage.

Wiki Navigation

Prerequisites

Next Steps