On Call¶

30 cards — 🟢 6 easy | 🟡 8 medium | 🔴 6 hard

🟢 Easy (6)¶

1. What is the Incident Commander (IC) role, and what should they NOT do during an incident?

Show answer

The IC coordinates the response: declares the incident and severity, opens the war room, assigns roles, sets investigation direction, decides on escalation, and declares resolution. The IC does NOT debug code, SSH into servers, write queries, or get tunnel-visioned on one theory.

Remember: Detect→Triage→Mitigate→Resolve→Postmortem. "DTMRP."

2. What are the four key incident response roles and their responsibilities?

Show answer

Incident Commander (coordinates response, makes decisions), Technical Lead (drives investigation and remediation), Communications Lead (updates statuspage, customers, stakeholders), and Scribe (records timeline, decisions, and actions in real-time).

Remember: Detect→Triage→Mitigate→Resolve→Postmortem. "DTMRP."

3. How do you decide between SEV-1 and SEV-2 using a severity decision tree?

Show answer

If the service is completely down, it is SEV-1. If not completely down but more than 10% of users are affected without a workaround, it is SEV-2. If fewer users are affected, it is SEV-3. No customer impact is SEV-4.

Remember: SEV1=outage, SEV2=degraded, SEV3=limited, SEV4=cosmetic.

4. What are the key principles for designing a fair on-call rotation?

Show answer

1-week shifts (not 2 — burnout risk). Equal distribution across team members. Minimum 2 people per rotation (primary + secondary). Allow shift swaps with advance notice. Avoid scheduling during planned PTO. Ensure timezone coverage for global teams. Secondary becomes next week's primary for context continuity. Track on-call load per person quarterly and rebalance if uneven.

Remember: Good on-call = runbooks + escalation + blameless postmortems.

Gotcha: Every alert should be actionable. Can wait until morning? Not page-worthy.

5. What are the warning signs of on-call fatigue and how do you address them?

Show answer

Warning signs: ignoring non-critical alerts, delayed ack times trending up, engineers requesting removal from rotation, increased sick days during on-call weeks, and cynicism about alerting. Address by: reducing alert noise (fix or delete noisy alerts), ensuring no more than 2 pages per day-shift, zero pages as the night-shift goal, compensating on-call time, and rotating fairly so no one is on-call more than 25% of the time.

Remember: Alert fatigue → missed alerts. Cure: delete noise, aggregate, proper thresholds.

6. What are common approaches to on-call compensation?

Show answer

Flat stipend per on-call shift (e.g., $200-500/week). Per-page bonus on top of the stipend. Time-off-in-lieu (comp day after a disruptive shift). Reduced sprint velocity during on-call weeks. Some companies combine stipend + per-page + comp time for severe incidents. The key principle: on-call is real work outside normal hours and must be compensated. Uncompensated on-call leads to burnout and attrition.

Remember: Good on-call = runbooks + escalation + blameless postmortems.

Gotcha: Every alert should be actionable. Can wait until morning? Not page-worthy.

🟡 Medium (8)¶

1. What is a typical PagerDuty escalation policy structure and why are escalation timeouts important?

Show answer

Level 1: Primary on-call (5-min ack window). Level 2: Secondary on-call (5-min ack). Level 3: Engineering manager (10-min ack). Level 4: VP Engineering (phone call). Escalation timeouts prevent a missed acknowledgment from blocking the entire incident response.

Remember: primary→secondary→manager→VP. Tools: PagerDuty, Opsgenie, VictorOps.

2. What must be included in an on-call handoff, and why is a written handoff non-negotiable?

Show answer

A handoff must include: active issues (with ticket references and runbooks), recent infrastructure changes, watch items (upcoming events), and known noisy alerts. Written handoffs are non-negotiable because the outgoing engineer disappearing without context leaves the incoming engineer blind to ongoing issues.

Remember: Good on-call = runbooks + escalation + blameless postmortems.

Gotcha: Every alert should be actionable. Can wait until morning? Not page-worthy.

3. What should a runbook contain and why is it important for on-call?

Show answer

A runbook should contain: symptoms (what alerts fire, what users see), diagnosis steps (specific dashboards and commands), remediation steps (detailed with exact commands), and escalation paths (when and who to escalate to). Runbooks let any on-call engineer handle an issue at 3am without prior knowledge of the service.

Remember: Runbooks = step-by-step guides. Include: check, mitigate, escalate. Keep updated.

4. Why should incident updates be sent on a regular cadence even when there is no new information?

Show answer

Stakeholders who hear nothing for an hour assume the worst and may interfere with the response. Setting a timer for updates every 15-30 minutes, even if the update is "still investigating," maintains trust and keeps stakeholders informed without them needing to ask.

Remember: Detect→Triage→Mitigate→Resolve→Postmortem. "DTMRP."

5. What should an escalation policy define and what are common mistakes?

Show answer

Define: who is paged at each level, ack timeout before escalation (typically 5 min), maximum escalation depth (usually 4 levels), and after-hours behavior. Common mistakes: no secondary on-call (single point of failure), ack timeouts too long (15+ minutes delays response), escalating to managers who cannot fix technical issues, and not testing the escalation path regularly. Test by running a drill page monthly.

Remember: primary→secondary→manager→VP. Tools: PagerDuty, Opsgenie, VictorOps.

6. What are the steps for an effective on-call handoff?

Show answer

1) Outgoing engineer writes a handoff document: active incidents, recent deployments, watch items, known noisy alerts.
2) Overlap meeting (15-30 min) to walk through the document.
3) Verify incoming engineer has access to all tools (PagerDuty, dashboards, VPN).
4) Transfer the PagerDuty rotation at the agreed time.
5) Outgoing remains available for 1 hour post-handoff for questions.
Never rely on verbal-only handoffs — written documentation is required.

Remember: Good on-call = runbooks + escalation + blameless postmortems.

Gotcha: Every alert should be actionable. Can wait until morning? Not page-worthy.

7. How should you classify incident severity and what response does each level require?

Show answer

SEV-1 (Critical): total service outage or data loss, all-hands response, 5-min ack, external comms within 15 min.
SEV-2 (Major): significant degradation affecting >10% of users, dedicated IC + tech lead, 15-min ack.
SEV-3 (Minor): limited impact with workaround available, primary on-call handles, 30-min ack.
SEV-4 (Low): cosmetic or no customer impact, track as a ticket. When in doubt, declare higher severity — it is easier to downgrade than to escalate late.

Remember: Detect→Triage→Mitigate→Resolve→Postmortem. "DTMRP."

8. What is the purpose of an escalation policy in an on-call rotation?

Show answer

It defines when and how to escalate an unacknowledged or unresolved alert to the next responder or team, ensuring incidents don't stall.

Remember: primary→secondary→manager→VP. Tools: PagerDuty, Opsgenie, VictorOps.

🔴 Hard (6)¶

1. What on-call health metrics should you track monthly, and what are the red flags?

Show answer

Track: pages per shift (target <2 per day, 0 per night), time-to-ack (<5 min), time-to-resolve (<30 min for P1), sleep interruptions, and satisfaction score (target >3.5/5). Red flags: pages trending up, same alerts recurring weekly, satisfaction below 3.0, or an engineer requesting permanent removal from rotation.

Remember: Good on-call = runbooks + escalation + blameless postmortems.

Gotcha: Every alert should be actionable. Can wait until morning? Not page-worthy.

2. What is the recommended on-call rotation design, and why should the secondary always be next week's primary?

Show answer

Use 1-week shifts (Mon 09:00 to Mon 09:00) with a 30-minute handoff overlap and written summary. The secondary is always next week's primary so they gain context before becoming primary. This ensures the incoming primary has already seen the previous week's issues as secondary backup.

Remember: Good on-call = runbooks + escalation + blameless postmortems.

Gotcha: Every alert should be actionable. Can wait until morning? Not page-worthy.

3. Why should escalation be reframed as "the system working correctly" rather than as a failure?

Show answer

Engineers often wait too long to escalate because they fear looking incompetent. This delay extends incidents and increases blast radius. Reframing escalation as the system working correctly encourages timely escalation. If escalation to Level 4 (VP Engineering) happens regularly, the problem is the L1-L3 process, not the people.

Remember: primary→secondary→manager→VP. Tools: PagerDuty, Opsgenie, VictorOps.

4. When should a post-incident review be triggered and what must it cover?

Show answer

Trigger for: all SEV-1 and SEV-2 incidents, any incident lasting >1 hour, incidents requiring escalation beyond L2, and near-misses that could have been severe. Must cover: timeline of events, root cause analysis (not blame), what detection/response worked, what failed, and concrete action items with owners and due dates. Hold the review within 48 hours while memory is fresh. Action items without owners and deadlines are worthless.

Remember: Detect→Triage→Mitigate→Resolve→Postmortem. "DTMRP."

5. How do you keep runbooks accurate and useful over time?

Show answer

Link every alert to its runbook (runbook_url annotation in Prometheus). Review and update runbooks during post-incident reviews. Assign runbook ownership to the service-owning team. Include a "last verified" date at the top — stale runbooks are worse than no runbook (they waste time with wrong commands). Test runbooks during game days. Archive runbooks for decommissioned services. Use version control (git) so changes are tracked and reviewable.

Remember: Runbooks = step-by-step guides. Include: check, mitigate, escalate. Keep updated.

6. How does alert fatigue reduce incident response quality, and what is one structural countermeasure?

Show answer

Alert fatigue causes responders to ignore or slow-respond to alerts. Countermeasure: tune alert thresholds and suppress duplicate/flapping alerts so only actionable alerts fire.

Remember: Detect→Triage→Mitigate→Resolve→Postmortem. "DTMRP."