Skip to content

Quiz: SRE Practices

← Back to quiz index

4 questions

L0 (1 questions)

1. What is toil in the SRE context, and what is the target threshold?

Show answer Toil is manual, repetitive, automatable work that scales linearly with service growth and has no enduring value (e.g., manually restarting pods, rotating certs by hand). The SRE target is no more than 50% of an SRE's time should be spent on toil.

L1 (1 questions)

1. What is an error budget and how does it drive engineering decisions?

Show answer If your SLO is 99.9%, the error budget is 0.1% of allowed failure. When the budget is healthy (>50%), ship freely. When it is low (<25%), prioritize reliability work and reduce deploy frequency. When exhausted, feature-freeze until reliability improves. It bridges product velocity and operational stability.

L2 (1 questions)

1. Your team spends 70% of time on toil. What concrete steps do you take to bring it below 50%?

Show answer 1. Catalog all toil tasks with time estimates.
2. Rank by frequency x time-cost.
3. Automate the top offenders first (e.g., replace manual cert rotation with cert-manager, add auto-remediation for known alerts).
4. Eliminate false/noisy alerts.
5. Track toil percentage weekly.
6. Negotiate with management to protect automation time.

L3 (1 questions)

1. A product team wants to launch a new service in production. What does an SRE production readiness review cover?

Show answer A PRR covers: (1) SLOs defined and measured, (2) monitoring and alerting in place, (3) runbooks for known failure modes, (4) capacity planning and load testing done, (5) graceful degradation strategy, (6) rollback plan, (7) on-call rotation staffed, (8) disaster recovery tested, (9) dependency mapping, (10) security review complete. The service should not launch without passing the PRR.