Interview Gauntlet: Alerts Firing but System Seems Fine¶
Category: Incident Response Difficulty: L2-L3 Duration: 15-20 minutes Domains: Monitoring, Metrics Pipeline
Round 1: The Opening¶
Interviewer: "Alerts are firing for high error rate and elevated latency across multiple services. But when you look at the dashboards, check user reports, and test the API manually, everything seems perfectly fine. What's going on?"
Strong Answer:¶
"There are two categories: either the system had a real problem that's already resolved but the alerts are stale, or the monitoring system itself is producing bad data. I'd check the alert definition first — is this a firing alert or a pending-to-firing transition? What's the for duration? If the alert has a for: 5m clause and the condition was true 5 minutes ago but is now false, the alert might still be firing depending on the evaluation interval. Next, I'd check the actual metric values in Prometheus: rate(http_requests_total{code=~\"5..\"}[5m]) / rate(http_requests_total[5m]). If the current error rate is 0% but the alert is firing, the alert is stale. I'd check Alertmanager's state directly — amtool alert query — to see when the alert started and whether Prometheus is still sending it. If the metric shows a real spike that happened and resolved, the question becomes why the alert didn't auto-resolve. This is usually a mismatch between the alert evaluation interval and the resolution behavior — Prometheus needs to evaluate the rule and find the condition false to resolve the alert."
Common Weak Answers:¶
- "The alerts are wrong, just silence them." — Silencing without understanding why the alert is firing is how you miss the next real incident.
- "Someone probably just spiked the system." — Could be true but doesn't investigate why the alert persists after the spike resolved.
- "Restart Alertmanager." — A brute-force approach that doesn't diagnose the issue and might cause worse problems if there are real alerts pending.
Round 2: The Probe¶
Interviewer: "You check and discover the alert was correct — there was a genuine error spike 3 hours ago. An auto-remediation system (a Kubernetes CronJob) detected the issue and restarted the affected pods, which fixed it. But the alert never cleared. Why?"
What the interviewer is testing: Understanding of the alert lifecycle — specifically how Prometheus resolves alerts and what can prevent resolution.
Strong Answer:¶
"Prometheus evaluates alerting rules at a fixed interval (default 1 minute for recording rules, configurable per group). For an alert to resolve, the condition needs to evaluate to false for at least one evaluation cycle. If the alert rule uses a rate function over a window — like rate(errors[5m]) — the error rate will show the spike for 5 minutes after the last error, even if no new errors are occurring. After 5 minutes, the rate drops to zero and the alert should resolve. If the alert hasn't resolved after 3 hours, something else is going on. I'd check: first, is Prometheus actually evaluating the rule? curl http://prometheus:9090/api/v1/rules shows the last evaluation time and result for each rule. If the rule hasn't been evaluated recently, the Prometheus server might be overloaded or the rule group might be lagging. Second, check the Alertmanager side: even if Prometheus resolves the alert, Alertmanager has a resolve_timeout (default 5 minutes). If Prometheus stops sending the alert entirely (not resolving it, but just stops sending), Alertmanager will auto-resolve it after the resolve_timeout. But if Prometheus is still actively sending the alert as firing, Alertmanager keeps it open. Third, check for inhibition or grouping rules in Alertmanager that might be preventing the resolution notification from reaching the notification channel."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's the difference between Prometheus marking an alert as resolved vs Alertmanager's resolve_timeout?" Prometheus explicitly sends resolved notifications when the condition becomes false. Alertmanager's resolve_timeout is a safety net for when Prometheus stops sending the alert entirely (e.g., Prometheus restarts and forgets the alert state). Mixing these up suggests the candidate has read about alerting but hasn't debugged a stuck alert in production.
Round 3: The Constraint¶
Interviewer: "The root cause is that the metrics pipeline is lagged. Prometheus is scraping targets that are returning stale metrics — the targets show a scrape_duration_seconds of 0.5 seconds and the scrape is succeeding, but the application metrics haven't been updated in 3 hours. The application's /metrics endpoint is returning the same values it had during the incident. Why?"
Strong Answer:¶
"The application's metrics endpoint is serving stale data. This happens when the metrics aren't being updated by the application itself. Common causes: the application has a metrics cache or a metrics collection thread that crashed or hung. In a Python application using prometheus_client, the metrics are stored in a global registry that updates on each request or via callbacks. If the application process forked (common with gunicorn with --workers > 1), each worker has its own registry, and the /metrics endpoint might be served by a different worker than the one handling requests. The metrics from the worker that handled the error spike are frozen because that worker is no longer receiving traffic. Another possibility: the application uses a push gateway or an intermediate metrics collector that's caching and not refreshing. Or the application's metrics endpoint is behind a separate HTTP server that crashed while the main application is fine. I'd check: curl http://pod-ip:8080/metrics | grep http_requests_total directly from inside the cluster and compare the values over two scrapes 30 seconds apart. If the counter value doesn't change between scrapes despite active traffic, the metrics endpoint is stale. The fix depends on the cause — if it's a multi-process Python app, the solution is using the prometheus_client multiprocess mode with a shared directory."
The Senior Signal:¶
What separates a senior answer: Knowing about the Python
prometheus_clientmultiprocess pitfall. In gunicorn with multiple workers, each worker maintains its own metric registry, and the/metricsendpoint only shows metrics from the worker that happens to serve the scrape request. This is a well-known gotcha that catches teams when they scale up workers. The fix isprometheus_client.CollectorRegistrywithmultiprocess.MultiProcessCollectorand a sharedprometheus_multiproc_dir. Mentioning this specific scenario shows real debugging experience.
Round 4: The Curveball¶
Interviewer: "Let's say the metrics staleness went undetected for a week. During that week, your SLO dashboard showed 100% availability because the error counters weren't incrementing. But you actually had two small incidents. Your monthly SLO report goes to the VP of Engineering. Do you report 100% (what the data shows) or do you flag the data quality issue?"
Strong Answer:¶
"I flag the data quality issue. Reporting 100% availability when we know the measurement was broken would be misleading — and if it came out later, it would undermine trust in the entire SLO program. I'd present the report with a clear caveat: 'Our SLO metrics were unreliable for the period of [dates] due to a stale metrics endpoint. The reported 100% availability during this window cannot be trusted. Based on incident records, we had two incidents totaling approximately X minutes of user-facing impact, which would put our actual availability at approximately Y%.' I'd also include a remediation section: what caused the metrics gap, what we've done to fix it (monitoring the metrics pipeline itself, alerting on stale counters), and how we'll prevent it in the future. The VP wants to know: are we reliable, and do we know when we're not? Admitting that our observability had a blind spot is better than presenting false confidence. And honestly, this is a great argument for investing in observability improvements — we literally couldn't see our own incidents."
Trap Question Variant:¶
The right answer is clearly "flag the issue." But the trap is in how the candidate frames it. Saying "I'd just tell the VP the data is wrong" is blunt but not helpful. Saying "I'd bury a footnote" is weaselly. The senior approach is to lead with the known impact, quantify the uncertainty, present the fix, and use it as a lever for improvement. Candidates who treat this as purely a technical question miss the organizational trust dimension.
Round 5: The Synthesis¶
Interviewer: "You've now seen three failure modes: stale alerts, stale metrics, and stale SLO reports. They're all variations of the same problem. What's the underlying issue, and how do you build a monitoring system that monitors itself?"
Strong Answer:¶
"The underlying issue is that we treat the monitoring system as a source of truth without verifying its freshness. Every metric, alert, and dashboard has an implicit assumption: the data is current. When that assumption breaks, everything downstream becomes unreliable — from alert routing to executive reporting. To build self-monitoring: first, freshness checks. Every Prometheus scrape target should have a metric that changes on every scrape — like process_start_time_seconds or a custom last_updated_timestamp gauge. If the metric doesn't change between two consecutive scrapes, the target is stale. Alert on it: changes(my_app_last_scrape_success_timestamp[5m]) == 0. Second, meta-alerts for the pipeline. Alert if Prometheus scrape failures exceed a threshold, if Alertmanager has been unreachable for more than 2 minutes, if the metrics cardinality is exploding (which causes Prometheus to slow down). Third, the dead man's switch pattern: a Watchdog alert that's always firing. If PagerDuty stops receiving the Watchdog heartbeat, it means the entire alert pipeline is broken, and PagerDuty auto-escalates. Fourth, external validation: a synthetic monitoring service that runs outside the monitoring stack and independently verifies that the API is working. This provides a second opinion that doesn't depend on Prometheus being healthy. The principle is defense in depth — no single monitoring layer should be trusted without verification from another layer."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Alert investigation methodology and stale alert awareness |
| 2 | Prometheus/Alertmanager alert lifecycle mechanics |
| 3 | Metrics pipeline debugging and application-level metric issues |
| 4 | Data integrity ethics and stakeholder communication |
| 5 | Meta-monitoring design and observability reliability |