Interview Gauntlet: Monitoring Stack from Scratch¶
Category: System Design Difficulty: L2-L3 Duration: 15-20 minutes Domains: Observability, SRE Practices
Round 1: The Opening¶
Interviewer: "You're joining a startup with 15 microservices, no monitoring, and a history of finding outages from customer complaints. Design a monitoring stack from scratch."
Strong Answer:¶
"I'd start with the three pillars but prioritize ruthlessly. Week one: metrics with Prometheus and Grafana. Deploy Prometheus via the kube-prometheus-stack Helm chart, which gives you Prometheus, Alertmanager, Grafana, and a set of default recording rules and dashboards for Kubernetes, node, and container metrics out of the box. Each service needs to expose a /metrics endpoint — for most frameworks, that's a middleware or library. I'd define four golden signals per service: request rate, error rate, latency (p50/p95/p99), and saturation (CPU, memory). Week two: logging with Loki and Fluent Bit — structured JSON logs from every service, queryable in Grafana. Week three: basic alerting. I'd start with just three alerts per service: error rate above 1% for 5 minutes, p99 latency above the SLO threshold, and pod restarts above 2 in 10 minutes. Ship alerts to Slack for now, upgrade to PagerDuty when the team is ready for on-call. Tracing comes last because it requires instrumentation effort — I'd plan it for month two using OpenTelemetry."
Common Weak Answers:¶
- "I'd set up Datadog." — Not wrong, but doesn't show architectural thinking. The interviewer wants to see you reason about what to monitor and why, not pick a vendor.
- "We need tracing first because it's the most powerful." — Tracing requires the most instrumentation effort and the least immediate payoff for a team finding outages from customer complaints. Metrics first.
- Listing 20 things to monitor without prioritization — The context is a startup with no monitoring. Boiling the ocean is the wrong answer; the right answer is rapid time-to-first-alert.
Round 2: The Probe¶
Interviewer: "Six months in, your on-call engineer is getting 40 alerts per shift and ignoring most of them. How did you get here, and how do you fix it?"
What the interviewer is testing: Alert fatigue is the number one failure mode of monitoring systems. This tests whether the candidate has operated alerts in production or just set them up.
Strong Answer:¶
"Alert fatigue almost always comes from one of three sources: threshold alerts that are too sensitive, alerts that fire on symptoms rather than user impact, and alerts that have no clear action. I'd start with an alert audit. Pull the last 30 days of alert history and categorize: how many were actionable (someone needed to do something), how many were self-resolving (the system recovered before a human intervened), and how many were noise (the alert fired but nothing was actually wrong). In my experience, 60-70% of alerts in an immature system are noise or self-resolving. For self-resolving alerts, I'd increase the for duration in Prometheus — if a condition needs to persist for 15 minutes instead of 5 to fire an alert, most transient spikes get filtered. For noise, I'd delete the alert entirely. For the remaining actionable alerts, I'd add a runbook link to every one — if you can't write a runbook for it, you can't alert on it. Then I'd tier the alerts: page-worthy alerts wake someone up (user-facing error rate, total service down), warning alerts go to Slack (elevated latency, disk usage trending), and informational alerts go to a dashboard only. The goal is fewer than 5 pages per on-call shift."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's a healthy alert-to-incident ratio?" There's no universal number, but a reasonable answer is: "For every alert that pages, it should result in a human taking action at least 80% of the time. If your page-to-action ratio is below 50%, your alerts need pruning." Candidates who invent precise industry statistics are bluffing.
Round 3: The Constraint¶
Interviewer: "The CEO wants to promise customers 99.95% uptime in the SLA. You're currently at roughly 99.5% but you're not measuring precisely. How do you build SLO-based monitoring to get there?"
Strong Answer:¶
"First, I need to define what 'uptime' means in precise terms — an SLI, a Service Level Indicator. For an API service, the SLI might be: the proportion of HTTP requests that return a non-5xx response within 500ms, measured at the load balancer. This is better than synthetic uptime checks because it reflects actual user experience. I'd implement this as a Prometheus recording rule: sum(rate(http_requests_total{code!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])). Then set the SLO at 99.95%, which means an error budget of 0.05% — roughly 22 minutes of downtime per month, or about 260 minutes per year. I'd build an error budget burn rate alert using the multi-window, multi-burn-rate approach from the Google SRE book: if we're burning through our monthly error budget at 14.4x the normal rate for the last hour, page immediately. If we're burning at 6x over 6 hours, send a ticket. This replaces most of the threshold-based alerts with one unified signal: are we on track to meet our SLO or not? In Grafana, I'd build an error budget dashboard showing remaining budget, burn rate trend, and time until budget exhaustion at current rate. The business team gets a monthly SLO report."
The Senior Signal:¶
What separates a senior answer: Understanding the difference between an SLA (contractual, has financial penalties) and an SLO (internal target, drives engineering decisions). Knowing that the SLO should be stricter than the SLA — if the SLA is 99.95%, the internal SLO should be 99.97% or higher to provide a safety margin. Also: understanding that error budget is a tool for balancing reliability investment against feature velocity — when the budget is healthy, ship faster; when it's depleted, focus on reliability.
Round 4: The Curveball¶
Interviewer: "Your monitoring stack itself goes down. Prometheus crashes, Grafana is unreachable, Alertmanager is not sending alerts. How do you know? Who tells you?"
Strong Answer:¶
"This is the meta-monitoring problem — who watches the watchers? You need at least one external, independent signal that doesn't depend on your primary stack. I'd use an external synthetic monitoring service — something like Uptime Robot, Pingdom, or a simple AWS Lambda that hits our service endpoints every minute and pages via a completely separate channel (direct PagerDuty API call or SMS gateway, not through our Alertmanager). This is cheap insurance — $10/month for basic synthetic checks. Within the stack, I'd also set up Prometheus to monitor itself: Alertmanager has a --cluster.peer flag for HA, and you can configure a dead man's switch — an alert that's always firing. If PagerDuty stops receiving the dead man's switch heartbeat, it means Alertmanager is down and PagerDuty auto-escalates. Watchdog alerts in the kube-prometheus-stack do exactly this. For Prometheus itself, in a multi-Prometheus setup (like Thanos or Prometheus with federation), each instance can monitor the other. But the absolute minimum is: one external check that pages through a channel completely independent of your monitoring infrastructure."
Trap Question Variant:¶
The right answer is acknowledging the recursion. Candidates who say "Prometheus monitors itself" are missing the point — if Prometheus is down, its self-monitoring is also down. The insight is that you need an out-of-band signal, even if it's simple. Candidates who overengineer this with multiple layers of meta-monitoring are showing theoretical thinking. One external heartbeat check is enough for most organizations.
Round 5: The Synthesis¶
Interviewer: "You've built this monitoring stack over a year. What would you do differently if you started over, knowing what you know now?"
Strong Answer:¶
"Three things. First, I'd start with SLOs from day one instead of adding them later. We spent months building threshold-based alerts that we later replaced with error budget alerts. If I'd started with 'what does the customer experience' instead of 'what technical metric is interesting,' we'd have had better signal sooner with less work. Second, I'd invest in structured logging standards before deploying the log pipeline. We spent weeks dealing with inconsistent log formats across services — some JSON, some plaintext, different field names for the same concept. A shared logging library with standard fields (request_id, service_name, trace_id) would have saved us weeks of parsing work. Third, I'd establish on-call norms before turning on paging. We turned on PagerDuty alerts before we had runbooks, escalation policies, or agreement on what constitutes a page-worthy event. That led to burnout and alert fatigue that took months to unwind. The monitoring tools are the easy part — the organizational practices around them are what make or break the system."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Monitoring architecture fundamentals and prioritization |
| 2 | Operational experience with alert fatigue and remediation |
| 3 | SLO/SLI/error budget implementation knowledge |
| 4 | Meta-monitoring awareness and practical out-of-band monitoring |
| 5 | Reflection, learning from mistakes, and organizational thinking |
Prerequisite Topic Packs¶
- Monitoring Fundamentals
- Prometheus Deep Dive
- Observability Deep Dive
- SRE Practices
- Postmortem and SLO