Quiz: Runbook Craft¶

4 questions

L0 (1 questions)¶

1. What are the five sections every effective runbook should have?

Show answer

1. Trigger — what alert or event activates this runbook.
2. Diagnose — commands to run and questions to answer before acting.
3. Act — the fix, using decision trees for multiple scenarios.
4. Verify — confirm the fix worked with specific checks and thresholds.
5. Escalate — when and how to call for help.

L1 (1 questions)¶

1. What are the five runbook automation levels (L0-L4) and which level should be your minimum target?

Show answer

L0: Fully manual prose instructions. L1: Copy-paste commands. L2: Scripts with parameters. L3: Triggered scripts requiring human approval. L4: Fully automated self-healing. Target L1 minimum — copy-paste commands beat prose every time. L1 is achievable immediately and prevents mistyped commands at 3 AM.

L2 (1 questions)¶

1. Why are metrics-driven thresholds better than vague descriptions in runbooks? Give an example.

Show answer

Vague: 'Check if latency is high.' The on-call engineer does not know what 'high' means. Metrics-driven: 'Check p99 latency. If > 500ms (normal baseline: 80-120ms), proceed.' This removes ambiguity, sets a concrete trigger, and provides the baseline so the responder knows what healthy looks like. Every runbook check should include: what to check, the threshold, and the normal value.

L3 (1 questions)¶

1. How should you test runbooks, and why is having the author test their own runbook insufficient?

Show answer

Test with game days: inject a real failure and have someone who did NOT write the runbook follow it. The author has implicit knowledge that did not make it into the document. Success criteria: responder completes without asking for help, resolution under time target, all commands work as documented. Also do tabletop exercises (verbal walkthrough) and review after every incident to catch stale commands.