Interview Gauntlet: Your Approach to On-Call¶
Category: Behavioral + Technical Hybrid Difficulty: L2-L3 Duration: 15-20 minutes Domains: SRE Practices, On-Call
Round 1: The Opening¶
Interviewer: "Describe your approach to on-call. How do you prepare for a shift, and what does a typical on-call week look like for you?"
Strong Answer:¶
"My on-call preparation starts the day before the shift. I check three things: the deployment log for the past week (what changed in production?), the open incident backlog (are there any known issues I should be aware of?), and the alert history (what fired during the previous shift and was anything left unresolved?). I also verify my setup: laptop charged, VPN connected, PagerDuty notifications working on both phone and watch, and the runbook index bookmarked. During the shift, my goal is to keep response time under 10 minutes for pages. I organize my week to be interrupt-friendly: I schedule deep-work tasks (coding, design) for off-shift, and use on-call time for tasks that can be interrupted — reviewing PRs, updating runbooks, improving monitoring. When I get paged, my first 2 minutes are: check the alert, check the linked dashboard, check if this alert has a runbook. If it has a runbook, follow it. If it doesn't, I triage: is this user-facing? Is it degradation or total outage? How many users are affected? Then I either resolve it, escalate it, or acknowledge and monitor if it's self-resolving. At the end of my shift, I do a handoff: a Slack message to the next on-call with any open issues, any alerts that fired but might recur, and anything I noticed that should be fixed but wasn't urgent."
Common Weak Answers:¶
- "I keep my phone nearby and respond when I get paged." — This is the minimum. It shows no preparation, no systematic approach, and no proactive improvement.
- "On-call doesn't bother me, I just handle it." — Sounds tough but doesn't demonstrate a process. Every experienced on-call engineer has a system.
- No mention of handoffs — On-call without handoffs means knowledge is lost at every shift transition. This is an organizational failure signal.
Round 2: The Probe¶
Interviewer: "You get a page at 3 AM. The alert says 'high error rate on payment service.' Walk me through your first 5 minutes, minute by minute."
What the interviewer is testing: Whether the candidate has a practiced triage routine or makes it up as they go.
Strong Answer:¶
"Minute 0-1: I acknowledge the page in PagerDuty from my phone to stop the escalation timer. I open my laptop and pull up the payment service dashboard while it boots. Minute 1-2: I check the error rate graph. Is it 1%, 10%, or 100%? Is it spiking or sustained? Is it climbing or plateauing? I also check the request rate — a drop in requests alongside errors might mean the load balancer is health-checking pods out of rotation. Minute 2-3: I check for recent deployments. kubectl rollout history deployment/payment -n production or check the deployment pipeline's last runs. If there was a deployment in the last 30 minutes, it's the prime suspect. Minute 3-4: I check the pod status — are pods running? restarting? OOMKilled? Then I check the error logs: kubectl logs -l app=payment -n production --tail=100 --since=5m | grep -i error. I'm looking for a specific error message that points me to the cause. Minute 4-5: Based on what I've found, I make a call. If it's a bad deployment and errors are above 5%, I start a rollback: kubectl rollout undo deployment/payment -n production. If it's not deployment-related, I post an initial finding to the incident channel, declare severity based on impact, and continue investigating. If I can't determine the cause in 5 minutes and the impact is severe, I escalate to the service owner — even at 3 AM, because a 100% payment failure is worth waking someone up."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "How do you decide when to wake someone up vs handling it yourself?" The honest answer involves thresholds: "If it's a service I own and the impact is moderate, I handle it. If it's a service I don't deeply understand, or the impact is severe (revenue-affecting, data-loss risk), I escalate. I'd rather wake someone up for a false alarm than spend an hour on something they could fix in 5 minutes." Candidates who say "I never escalate" are either overconfident or haven't faced a serious enough incident.
Round 3: The Constraint¶
Interviewer: "Your on-call rotation has 4 engineers. One of them gets 80% of the pages because they're the most experienced and always volunteers to take the hard alerts. How do you fix the uneven distribution without burning out the experienced engineer or dropping reliability?"
Strong Answer:¶
"This is a common dysfunction — the hero anti-pattern. The experienced engineer absorbs all the pain, which means the other 3 engineers never develop on-call skills, and when the hero goes on vacation or quits, the team is in crisis. Three-part fix. First, redistribute the alert routing. Pages should go to whoever is on-call, not to the most experienced person. If alert routing has an override or escalation chain that always lands on one person, remove it. The on-call engineer handles the alert, with the experienced engineer available as an escalation (second-level, not first-level). Second, invest in runbooks. The reason the experienced engineer handles everything is probably that they have the context in their head. Extract that context into runbooks — for every recurring alert, document: what does this alert mean, what should I check first, what's the likely fix, and when should I escalate. The experienced engineer writes the first draft; others validate by following the runbook during live incidents. Third, pair on incidents. For the next month, the on-call engineer handles the incident with the experienced engineer shadowing on a Zoom call. The on-call person drives, the experienced person coaches. This builds confidence and transfers knowledge faster than any documentation. Measurement: track pages per person per month and resolution time by engineer. The goal is even page distribution with acceptable resolution time (it might increase slightly during the learning period — that's expected)."
The Senior Signal:¶
What separates a senior answer: Identifying the hero anti-pattern by name and understanding that it's an organizational risk, not just a fairness issue. If one person holds all the on-call knowledge, the team has a single point of failure for incident response. The three-part fix (redistribute routing, write runbooks, pair on incidents) addresses the immediate problem and the underlying knowledge gap simultaneously. Also: accepting that resolution time might temporarily increase as the team levels up, and framing that as acceptable short-term cost for long-term resilience.
Round 4: The Curveball¶
Interviewer: "You've been on-call for 6 months and you're getting burned out. The 3 AM pages are affecting your sleep, your focus during the day, and your job satisfaction. How do you raise this without seeming like you can't handle the job?"
Strong Answer:¶
"I'd frame it as a system problem, not a personal problem, because it almost certainly is. If I'm burned out, others are or will be too. I'd come to my manager with data: 'Over the last 6 months, I've been paged an average of X times per on-call shift, with Y of those pages between midnight and 6 AM. The median time-to-resolve is Z minutes, meaning I lose an average of A hours of sleep per shift.' Then I'd propose specific improvements rather than just complaining. First, noise reduction: review the last 3 months of pages and categorize them as actionable (required human intervention), self-resolving (system recovered before I could act), and false positive (alert condition was met but nothing was actually wrong). In my experience, 50-60% of off-hours pages fall into the self-resolving or false positive categories. Fixing those — by increasing alert for durations, improving health checks, or adding auto-remediation — directly reduces 3 AM pages. Second, on-call compensation: if the company expects engineers to be available 24/7 one week per month, there should be compensation — either monetary (on-call stipend, per-page bonus) or time (a day off after a particularly brutal shift). Third, invest in reliability: the long-term fix for on-call burnout is reducing the number of incidents. Every postmortem action item that prevents a recurrence is one fewer 3 AM page. I'd propose dedicating 20% of engineering time to reliability improvements, funded by the on-call page count data."
Trap Question Variant:¶
The right answer is raising it proactively and with data. Candidates who say "I just power through it" are normalizing unsustainable practices. Candidates who say "I'd look for another job" are skipping the step of trying to fix the problem. The senior answer: data-driven proposal, specific improvements, and framing it as an organizational concern rather than a personal complaint. "I can handle it" is not a badge of honor when the system is broken.
Round 5: The Synthesis¶
Interviewer: "If you were designing an on-call program from scratch for a 20-engineer team, what are the non-negotiable elements?"
Strong Answer:¶
"Six non-negotiables. First, every page has a runbook. If an alert fires and there's no documented response, either the alert shouldn't exist or the runbook is missing. We create the runbook before we create the alert. Second, on-call rotation of at least 5-6 people per team. Fewer than that means each person is on-call too frequently. With 20 engineers across 3-4 teams, this is achievable if the alert scope is right. Third, defined SLAs for response: acknowledge within 5 minutes, triage within 15 minutes, escalate or resolve within 60 minutes. These set expectations for both the on-call engineer and the people waiting for a response. Fourth, postmortem for every Sev-1 and Sev-2 incident, with at least one action item that reduces the probability of recurrence. The postmortem is the primary mechanism for reducing on-call burden over time. Fifth, on-call handoffs. A structured handoff at shift change: what's ongoing, what might recur, what changed in the environment. Even a 5-minute Slack summary is better than nothing. Sixth, regular alert review. Monthly, review all alerts that fired: which ones led to action, which ones were noise? Delete or tune the noise. Track the signal-to-noise ratio (actionable pages / total pages) and set a target above 80%. An on-call program where engineers are confident that a page means something real and have a runbook to guide them is sustainable. A program where every page might be nothing and there's no documentation is a burnout factory."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | On-call preparation and systematic approach |
| 2 | Practiced triage routine under pressure |
| 3 | On-call team management and knowledge distribution |
| 4 | Self-awareness and burnout prevention advocacy |
| 5 | On-call program design and organizational thinking |