Alerting¶

32 cards — 🟢 5 easy | 🟡 9 medium | 🔴 6 hard

🟢 Easy (5)¶

1. What are the three standard alert severity levels and their expected response?

Show answer

Critical: customer-facing impact now, page immediately. Warning: will become critical soon, Slack notification. Info: informational, dashboard only.

Remember: CWI — Critical (page), Warning (notify), Info (dashboard). Think 'Can We Ignore?' — Critical no, Warning maybe, Info yes.

Gotcha: never add a fourth level like 'emergency' — it fragments routing and causes confusion about who gets paged.

Example: PagerDuty routes Critical to phone call, Warning to Slack #alerts, Info to dashboard-only widget.

2. What does the "for" field do in a Prometheus alert rule?

Show answer

It specifies how long the condition must be true before the alert fires, acting as a debounce to avoid alerting on brief transient spikes.

Example: for: 5m means 'only fire if error rate > 5% for 5 continuous minutes' — transient 30-second spikes are ignored.

Gotcha: setting for: 0s (or omitting it) fires on every brief spike, causing alert fatigue. Start with 2m minimum for critical, 5m for warning.

Remember: 'for' = 'for how long must this be broken before I wake someone up?'

3. Why should you alert on symptoms rather than causes?

Show answer

Users care about service impact (errors, latency), not internal causes (CPU, memory). Symptom-based alerts ensure you only page when customers are affected.

Remember: USE for resources (Utilization, Saturation, Errors), RED for services (Rate, Errors, Duration). Alert on RED — those are symptoms.

Example: alert on 'HTTP 500 rate > 1%' (symptom), not 'CPU > 80%' (cause). CPU can spike without user impact.

4. What is alert suppression and when should you use it?

Show answer

Alert suppression temporarily mutes specific alerts by label match. Use during planned maintenance windows (e.g., database migration), known incidents already being worked, or noisy alerts pending a fix. Always set a time-bound expiry so suppressed alerts automatically resume. Unbounded suppressions are a leading cause of missed incidents.

Gotcha: unbounded suppressions are worse than no suppression — you forget about them and miss real incidents. Always set TTL.

Remember: suppress = 'I know about this and am handling it.' Not the same as ignoring.

5. What are the top three practices to prevent alert fatigue in an on-call team?

Show answer

1. Every alert must have a runbook — if there is no documented action, delete the alert.
2. Review alert frequency monthly — alerts firing more than once a week need permanent fixes, not repeated manual response.
3. Tune for-durations and thresholds — a 10-second CPU spike is not worth waking someone at 3 AM. Set for: 5m minimum for warning, 2m for critical.

Remember: RAST — Runbooks, Appropriate severity, Silence stale alerts, Tune for-durations. If an alert has no runbook, delete it.

Fun fact: Google SRE found teams with >100 alerts/week had worse incident response than teams with <10 well-tuned alerts.

🟡 Medium (9)¶

1. How does Alertmanager route alerts to different receivers?

Show answer

Using a routing tree in its config. Routes match on alert labels (e.g., severity: critical goes to PagerDuty, severity: warning goes to Slack). The route tree supports nested matching with group_by, group_wait, and repeat_interval.

Remember: Alertmanager routing is a label-matching tree. More specific routes should be nested inside broader ones. Test routing with amtool config routes test.

Example: route: receiver=default, then child routes: match severity=critical -> PagerDuty, match team=platform -> platform-slack.

2. What is alert grouping in Alertmanager and why is it important?

Show answer

Grouping batches related alerts (e.g., by alertname and namespace) into a single notification using group_by. Without grouping, a cluster-wide issue could fire hundreds of individual alerts, overwhelming the on-call engineer.

Example: group_by: [alertname, namespace] batches all OOMKilled alerts in the payments namespace into one Slack message instead of 50.

Gotcha: over-grouping (group_by: [alertname] only) merges unrelated clusters. Under-grouping (too many labels) sends one message per pod.

3. What is alert inhibition in Alertmanager?

Show answer

Inhibition suppresses lower-priority alerts when a higher-priority alert is already firing. For example, if a NodeDown critical alert fires, all warning-level pod alerts on that node are suppressed to reduce noise.

Remember: 'Inhibit = upstream silences downstream.' NodeDown inhibits PodCrashLoop because pods cannot run on a dead node.

Example: source_matchers: [alertname=NodeDown, severity=critical], target_matchers: [severity=warning], equal: [node].

4. What is alert fatigue and how do you prevent it?

Show answer

Alert fatigue occurs when too many alerts fire, causing on-call engineers to ignore them. Prevent it by only alerting on customer-facing symptoms, using appropriate for durations, setting proper severity levels, and linking every alert to a runbook.

Remember: RAST — Runbooks, Appropriate severity, Silence stale alerts, Tune for-durations. If an alert has no runbook, delete it.

Fun fact: Google SRE found teams with >100 alerts/week had worse incident response than teams with <10 well-tuned alerts.

5. How should you define alert severity levels to drive consistent response?

Show answer

Critical (P1): customer-facing impact right now, page immediately, 5-min ack SLA.
Warning (P2): will degrade within hours, Slack notify, 30-min ack SLA.
Info (P3): anomaly worth noting, dashboard only, no notification.
Each level must have a clear escalation path, defined response time, and documented notification channel. Ambiguous severity leads to either over-paging or missed incidents.

Remember: severity levels must map to notification channels and SLAs. If Critical and Warning both go to Slack, your severity model is broken.

6. How does Alertmanager route alerts to different teams and escalation tiers?

Show answer

Routes match on labels: team=platform goes to the platform Slack channel, severity=critical routes to PagerDuty. Nested routes allow refinement (e.g., team=platform AND service=database goes to the DBA pager). Use group_wait (30s) to batch initial alerts, group_interval (5m) for subsequent batches, and repeat_interval (4h) to avoid re-notifying for the same alert.

Remember: Alertmanager routing is a label-matching tree. More specific routes should be nested inside broader ones. Test routing with amtool config routes test.

Example: route: receiver=default, then child routes: match severity=critical -> PagerDuty, match team=platform -> platform-slack.

7. What is alert correlation and how does it reduce noise during large-scale failures?

Show answer

Alert correlation groups related alerts that fire during the same incident — e.g., high latency + elevated errors + database connection pool exhausted are likely one root cause, not three separate incidents. Alertmanager's group_by achieves basic correlation. Advanced tools (PagerDuty, BigPanda) use ML and topology to merge related alerts into a single incident automatically.

Remember: correlation reduces N alerts to 1 incident. The formula: fewer pages = faster response = less burnout.

8. Why use Prometheus recording rules for alert queries instead of evaluating them directly?

Show answer

Complex alert expressions (multi-join, histogram_quantile, large label cardinality) are expensive to evaluate every cycle. Recording rules precompute and store the result as a new metric, making alert evaluation a simple threshold check. This also makes dashboards faster when reusing the same expression.

Example: for: 5m means 'only fire if error rate > 5% for 5 continuous minutes' — transient 30-second spikes are ignored.

Gotcha: setting for: 0s (or omitting it) fires on every brief spike, causing alert fatigue. Start with 2m minimum for critical, 5m for warning.

9. How can alert dependency trees reduce noise during cascading failures?

Show answer

Model upstream-downstream relationships so that if an upstream service (e.g., database) fires critical, downstream alerts (e.g., API latency) are automatically inhibited or annotated. Alertmanager's inhibit_rules match on labels: a source alert with severity=critical can suppress target alerts with the same cluster label.

Remember: 'Alert upstream first, silence downstream.' Model your dependency tree so database alerts suppress API latency alerts.

🔴 Hard (6)¶

1. What are Alertmanager silences and when should you use them?

Show answer

Silences temporarily mute specific alerts by matching on labels. Use them during planned maintenance windows or known issues being worked on, to avoid distracting the on-call team. They should be time-bounded and documented.

Gotcha: unbounded silences (no expiry) are the #1 cause of missed incidents. Always set an expiry and document the reason.

Example: amtool silence add alertname=HighMemory --duration=2h --comment='Known issue JIRA-1234, deploying fix at 3pm'.

2. Why is alerting on error rate dangerous for low-traffic services, and how do you fix it?

Show answer

A single error on 10 requests yields a 10% error rate, triggering a false alert. Fix by adding a minimum request threshold (e.g., rate > 0.05 AND sum(rate(requests[5m])) > 10) so the alert only fires when traffic volume is meaningful.

Remember: 'Volume before velocity' — always pair rate alerts with a minimum traffic threshold to avoid false positives on low-volume services.

Example: rate(http_errors[5m]) / rate(http_requests[5m]) > 0.05 AND rate(http_requests[5m]) > 0.1 (at least ~6 req/min).

3. What should a well-designed alert rule include beyond the PromQL expression?

Show answer

A for duration to debounce, severity label for routing, annotations with summary and description (using template variables like {{ $value }}), and a runbook_url linking to documented fix steps for the on-call engineer.

Remember: FSAR — For duration, Severity label, Annotations (summary + description), Runbook URL. Every production alert needs all four.

Example: annotations: { summary: 'High error rate on {{ $labels.service }}', runbook_url: 'https://wiki.example.com/runbooks/high-error-rate' }.

4. How do you configure inhibition rules to reduce cascading alert noise?

Show answer

In Alertmanager config, inhibit_rules suppress target alerts when a source alert is firing. Example: when a NodeDown alert fires (source: alertname=NodeDown, severity=critical), suppress all pod alerts on that node (target: severity=warning, equal: [node]). The equal field ensures the source and target share the same node label. This prevents hundreds of pod alerts when the real issue is a dead node.

Remember: 'Inhibit = upstream silences downstream.' NodeDown inhibits PodCrashLoop because pods cannot run on a dead node.

Example: source_matchers: [alertname=NodeDown, severity=critical], target_matchers: [severity=warning], equal: [node].

5. How do SLO-based burn-rate alerts work and why are they superior to threshold alerts?

Show answer

Instead of alerting on "error rate > 5%", burn-rate alerts measure how fast you are consuming your error budget. A 14.4x burn rate over 1 hour means you will exhaust your monthly error budget in 2 days. Multi-window alerts (fast burn over 1h AND slow burn over 6h) catch both sudden spikes and slow degradations while avoiding false positives from brief transients.

Remember: burn rate = actual error rate / SLO error budget rate. A 14.4x burn rate means 'at this pace, you exhaust your monthly budget in 2 days.'

6. How does a multi-window burn-rate alert reduce false positives for SLO monitoring?

Show answer

It uses two windows: a long window (e.g., 1h) to detect sustained error budget consumption and a short window (e.g., 5m) to confirm the issue is still active. Both must fire simultaneously. This avoids paging on brief spikes that the long window catches in retrospect.

Remember: burn rate = actual error rate / SLO error budget rate. A 14.4x burn rate means 'at this pace, you exhaust your monthly budget in 2 days.'