Skip to content

Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability

Alerting Rules Drills

Remember: Good alerts have three properties: Actionable (someone can do something about it), Relevant (it signals real user impact), Contextualized (includes enough info to start debugging). If an alert fires and the on-call shrugs and silences it, the alert is noise, not signal. Mnemonic: "ARC" — every alert needs an Action, Relevance, and Context.

Gotcha: A for: 5m clause in a Prometheus alert means the expression must be continuously true for 5 minutes before firing. If the issue flaps (true-false-true), the timer resets each time it goes false. For flapping metrics, use avg_over_time() or count_over_time() to smooth the signal instead of relying on for.

Default trap: Prometheus evaluates alert rules at the evaluation_interval (default 1m). With for: 5m, the minimum time from issue to page is evaluation_interval + for = 6 minutes. If you need faster detection, reduce the evaluation interval for critical alert groups.

Drill 1: Basic Rate Alert

Difficulty: Easy

Q: Write a Prometheus alert rule that fires when the HTTP 5xx error rate exceeds 5% for 5 minutes.

Answer
groups:
- name: http-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
      > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "HTTP 5xx error rate is {{ $value | humanizePercentage }}"
The `for: 5m` means the condition must be true for 5 consecutive evaluation cycles before firing. This prevents alerting on brief spikes.

Drill 2: Absent Metric Alert

Difficulty: Easy

Q: Write an alert that fires when a target stops reporting metrics entirely (no data, not just zero).

Answer
- alert: TargetDown
  expr: absent(up{job="grokdevops"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Target grokdevops is not being scraped"
`absent()` returns 1 when the time series doesn't exist at all. This catches: - Target is down - Service discovery lost the target - Network issue preventing scraping Note: `absent()` returns empty when the series exists (even if value is 0).

Drill 3: Disk Full Prediction

Difficulty: Medium

Q: Write an alert that predicts when a disk will be full within 4 hours based on the current fill rate.

Answer
- alert: DiskWillFillIn4Hours
  expr: |
    predict_linear(
      node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h],
      4 * 3600
    ) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will be full within 4 hours"
    description: "Current available: {{ $value | humanize1024 }}B"
`predict_linear(v, t)` uses linear regression over the range vector `v` to predict the value `t` seconds in the future. If predicted available bytes < 0, the disk will be full. The `for: 30m` prevents alerting on brief IO spikes.

Drill 4: Pod Restart Alert

Difficulty: Easy

Q: Alert when any pod has restarted more than 3 times in the last hour.

Answer
- alert: PodRestartingFrequently
  expr: |
    increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) restarted {{ $value }} times in 1h"
Use `increase()` for counters when you want the total increase over a time range. `rate()` gives per-second rate.

Drill 5: Latency P99 Alert

Difficulty: Medium

Q: Alert when p99 latency exceeds 1 second, using histogram metrics.

Answer
- alert: HighP99Latency
  expr: |
    histogram_quantile(0.99,
      sum by(le)(rate(http_request_duration_seconds_bucket[5m]))
    ) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P99 latency is {{ $value }}s (threshold: 1s)"
`histogram_quantile(φ, rate(bucket[range]))`: - `φ` = quantile (0.99 = 99th percentile) - Must use `rate()` on the bucket, not raw values - Group by `le` (less-than-or-equal bucket boundaries) - Use `sum by(le)` to aggregate across instances

Drill 6: Recording Rules for Performance

Difficulty: Medium

Q: The alerting query sum(rate(http_requests_total[5m])) is expensive and evaluated every 15s. How do you optimize it?

Answer
groups:
- name: http-recording-rules
  interval: 30s
  rules:
  # Recording rule: precompute the rate
  - record: http_requests:rate5m
    expr: sum(rate(http_requests_total[5m]))

  - record: http_requests:error_rate5m
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))

  - record: http_requests:burnrate5m
    expr: |
      1 - (
        sum(rate(http_requests_total{code!~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
      )
Then use the recording rules in alerts:
- alert: HighErrorRate
  expr: http_requests:error_rate5m > 0.05
  for: 5m
Benefits: - Faster alert evaluation (pre-computed) - Consistent values across dashboards and alerts - Reduced Prometheus load Naming convention: `metric:aggregation_window` or `level:metric:operations`.

Drill 7: Silence vs Inhibit

Difficulty: Easy

Q: What's the difference between silencing and inhibiting alerts in Alertmanager?

Answer **Silence**: Manually suppress a specific alert for a time window.
# "We know about this, don't page me during maintenance"
amtool silence add alertname=DiskFull instance=node-3 \
  --duration=2h --comment="Planned disk expansion"
**Inhibition**: Automatically suppress alerts when a related higher-severity alert is firing.
# alertmanager.yml
inhibit_rules:
- source_matchers:
  - severity = critical
  target_matchers:
  - severity = warning
  equal: [namespace, alertname]
This means: "If a critical alert fires for namespace X, suppress all warning alerts for the same namespace and alertname." Use cases: - Don't page for slow responses when the service is already down - Don't alert on pod restarts when the node is unreachable

Drill 8: Alertmanager Routing

Difficulty: Medium

Q: Configure Alertmanager to route critical alerts to PagerDuty and warnings to Slack, with team-based routing.

Answer
# alertmanager.yml
route:
  receiver: default-slack
  group_by: [alertname, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - match:
      severity: critical
    receiver: pagerduty-oncall
    repeat_interval: 5m
  - match:
      severity: warning
      team: platform
    receiver: slack-platform
  - match:
      severity: warning
      team: backend
    receiver: slack-backend

receivers:
- name: default-slack
  slack_configs:
  - channel: '#alerts-general'
    api_url: https://hooks.slack.com/services/xxx

- name: pagerduty-oncall
  pagerduty_configs:
  - service_key: xxx
    severity: critical

- name: slack-platform
  slack_configs:
  - channel: '#platform-alerts'
    api_url: https://hooks.slack.com/services/xxx

- name: slack-backend
  slack_configs:
  - channel: '#backend-alerts'
    api_url: https://hooks.slack.com/services/xxx

Drill 9: Node Pressure Alerts

Difficulty: Medium

Q: Write alerts for node memory pressure, disk pressure, and PID pressure.

Answer
- alert: NodeMemoryPressure
  expr: kube_node_status_condition{condition="MemoryPressure", status="true"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} has MemoryPressure"

- alert: NodeDiskPressure
  expr: kube_node_status_condition{condition="DiskPressure", status="true"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} has DiskPressure"

- alert: NodePIDPressure
  expr: kube_node_status_condition{condition="PIDPressure", status="true"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} has PIDPressure"

- alert: NodeNotReady
  expr: kube_node_status_condition{condition="Ready", status="true"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} is not Ready"
These use `kube-state-metrics` which exposes Kubernetes object states as Prometheus metrics.

Drill 10: LogQL Alert

Difficulty: Medium

Q: Write a Loki/LogQL alert rule that fires when more than 10 error logs per minute are seen for any service.

Answer
# Loki ruler config
groups:
- name: log-alerts
  rules:
  - alert: HighErrorLogRate
    expr: |
      sum by(app)(
        rate({namespace="production"} |= "error" | logfmt | level="error" [1m])
      ) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.app }} generating {{ $value }} error logs/min"

  - alert: OOMKilledDetected
    expr: |
      count_over_time({namespace="production"} |= "OOMKilled" [5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "OOMKilled event detected in production"
LogQL metric queries: - `rate()` — log lines per second - `count_over_time()` — total lines in window - `bytes_over_time()` — total bytes in window

Wiki Navigation

Prerequisites