Postmortem Slo¶

20 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard

🟢 Easy (3)¶

1. What is the difference between an SLI, SLO, and SLA?

Show answer

SLI (Service Level Indicator) is a metric measuring service quality (e.g., 99.2% request success rate). SLO (Service Level Objective) is an internal target for an SLI (e.g., 99.9% success). SLA (Service Level Agreement) is a contractual commitment with consequences (e.g., 99.5% uptime or refund). The SLA should be less strict than the SLO to provide a buffer.

Remember: "SLI feeds SLO backs SLA." I=indicator, O=objective, A=agreement. Each is stricter than the next.

2. What does "blameless" mean in the context of an incident postmortem?

Show answer

Blameless means focusing on systems and processes rather than individuals. Instead of "John pushed bad code," you write "The deploy pipeline lacked a canary step." Blameless does NOT mean accountable-less; it means identifying systemic causes that allowed the failure, not assigning personal blame.

Remember: "Blameless ≠ actionless." The goal is systemic fixes, not individual blame. Ask "what" and "how," never "who."

3. What makes a good SLI, and why is "CPU utilization" a bad one?

Show answer

Good SLIs measure what users experience (% of successful HTTP requests, p99 response time). CPU utilization is a bad SLI because it measures infrastructure, not user experience. A server can be at 90% CPU and serving fine, or at 10% CPU and returning errors.

Remember: "SLO = target, SLI = measurement, SLA = contract." Think: I (indicator) feeds O (objective) which backs A (agreement).

Example: SLI = p99 latency, SLO = p99 < 200ms 99.9% of the time, SLA = customer refund if SLO breached.

🟡 Medium (4)¶

1. What is an error budget and how is it calculated from an SLO?

Show answer

Error budget = 1 - SLO. For a 99.9% SLO, the error budget is 0.1%, which equals 43.2 minutes of downtime equivalent over 30 days. When the budget is positive, teams ship features normally. When exhausted, the team should focus on reliability (feature freeze).

Number anchor: 99.9% SLO = 43.2 minutes/month of allowed downtime. 99.99% = 4.3 minutes. 99.95% = 21.6 minutes.

2. What are the key elements of a quality postmortem action item?

Show answer

Each action item must have a specific owner, a due date, a priority level, and be concrete and actionable. "Improve monitoring" is not actionable. "Add latency alert at p99 > 500ms by 2024-02-15, owned by @alice, P2" is. Action items should be tracked in an issue tracker.

Remember: "AODC: Actionable, Owned, Dated, Concrete." Every action item needs all four or it\'s just a wish.

3. What are the four incident severity levels and which ones require a postmortem?

Show answer

SEV-1 (major customer impact, revenue loss) and SEV-2 (significant impact, degraded service) both require postmortems. SEV-3 (minor impact, workaround available) has an optional postmortem. SEV-4 (no customer impact, internal only) does not require one.

Remember: "Error budget = permission to take risks." When budget is full, ship fast. When budget is low, focus on reliability.

Example: 99.9% SLO = 43.2 minutes/month of allowed downtime. If you've used 40 minutes, freeze deploys.

Remember: "SEV-1 and SEV-2 always get postmortems." SEV-3 is optional. SEV-4 gets a ticket.

4. What is an error budget policy and what happens at each consumption threshold?

Show answer

An error budget policy defines actions at thresholds: at >50% consumed, review recent changes and enable canary deployments. At 100% consumed, institute a feature freeze, direct all effort to reliability, and require postmortems for budget-depleting incidents. When budget is positive, teams ship at normal pace.

Gotcha: Action items without owners and deadlines become wishful thinking. Each item needs: owner, deadline, and tracking ticket.

Remember: "A postmortem without action items is just a story."

Analogy: Error budget is like a credit card limit — when you\'re flush, spend freely (ship features). When you\'re maxed out, pay down debt (focus on reliability).

🔴 Hard (3)¶

1. How would you calculate the error budget burn rate in PromQL, and what does a value greater than 1.0 indicate?

Show answer

Burn rate = (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - SLO_target). A burn rate > 1.0 means you are consuming error budget faster than the rate that would exhaust it over the SLO window. A burn rate of 10 means you will exhaust 30 days of budget in 3 days.

Number anchor: Burn rate of 1.0 = budget lasts exactly the SLO window. Burn rate of 10 = budget exhausted in 1/10th the time (3 days of a 30-day window).

2. What is the difference between a root cause and a contributing factor in a postmortem, and why must the root cause be specific?

Show answer

The root cause is the specific technical failure that directly caused the incident (e.g., "migration timed out, leaving table locked"). Contributing factors explain why the root cause was not caught (e.g., "staging DB had only 1000 rows vs 50M in prod"). Vague root causes like "human error" prevent effective action items and recurrence prevention.

3. Why is setting an SLO of 99.999% usually a mistake, and what is the practical impact?

Show answer

99.999% SLO allows only 26 seconds of downtime per month. No human can detect, diagnose, and remediate an incident in 26 seconds, so the SLO is unachievable through operational response. It also leaves almost no error budget for deployments or experiments. Start with 1-2 SLOs per service (availability + latency) at realistic targets like 99.9%.

Number anchor: 26 seconds per month. That\'s less time than it takes to open a laptop and SSH into a server.