Skip to content

Incident Postmortem & SLO/SLI - Street-Level Ops

Quick SLO Calculations

SLO: 99.9% availability (30-day window)
Error budget: 0.1% = 43.2 minutes
Per day: 1.44 minutes of allowed downtime

SLO: 99.5% availability
Error budget: 0.5% = 216 minutes (3.6 hours)

SLO: 99.99% availability
Error budget: 0.01% = 4.32 minutes (don't pick this unless you're Google)

Remember: SLO math trick — multiply the "nines" to get monthly budget. 99.9% = 43 min, 99.95% = 22 min, 99.99% = 4 min. Each additional nine cuts your budget by 10x.

Interview tip: "What SLIs would you pick for a checkout service?" Best answer: availability (% of non-5xx responses), latency (p99 under 500ms), and correctness (% of orders that charge the right amount). Picking just availability misses the user experience.

Pattern: Postmortem Within 48 Hours

Write the postmortem within 48 hours of incident resolution while memory is fresh:

  1. Hour 0-2: Collect artifacts (logs, graphs, chat transcripts, deploy timestamps)
  2. Hour 2-4: Draft timeline from artifacts
  3. Hour 4-24: Write root cause and contributing factors
  4. Hour 24-48: Review meeting with team, finalize action items
  5. Day 3-5: Share with broader org

Pattern: Postmortem Artifacts to Collect

# Deployment history around incident time
kubectl rollout history deployment/grokdevops -n grokdevops

# Events during incident
kubectl get events -n grokdevops --sort-by='.lastTimestamp' \
  --field-selector 'reason!=Pulled'

# Prometheus query: error rate during incident
# Use Grafana explore with the incident time range

# Loki query: error logs during incident
{namespace="grokdevops"} |= "error" # with time range filter

# Git log: what was deployed?
git log --oneline --since="2024-01-15T14:00:00" --until="2024-01-15T15:00:00"

Debug clue: Always capture kubectl get events immediately — Kubernetes garbage-collects events after 1 hour by default. If you wait until the postmortem meeting, the evidence is gone.

One-liner: Quick error-rate check during an incident: curl -s localhost:9090/api/v1/query --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))' | jq '.data.result[0].value[1]'

Pattern: SLO Review Cadence

Frequency Activity
Daily Check error budget dashboard
Weekly Review burn rate trends, flag risks
Monthly Review SLO attainment, postmortem action items
Quarterly Adjust SLO targets based on data and business needs

Gotcha: SLOs Without Teeth

An SLO without an error budget policy is just a number on a dashboard.

Bad: "Our SLO is 99.9%" (and nothing happens when it's violated) Good: "Our SLO is 99.9%. When budget is exhausted: feature freeze, on-call reviews all deploys, team focus shifts to reliability."

Analogy: An error budget is like a credit card balance. You start the month with a limit (e.g., 43 minutes of downtime). Every incident charges against it. When the balance is zero, you stop spending (feature freeze) until the next billing cycle resets it.

Gotcha: Too Many SLOs

Start with 2 per service: 1. Availability: % of successful requests 2. Latency: p99 response time

Add more only when you have a specific need (correctness, freshness, throughput).

Default trap: Teams default to 99.99% because "higher is better." But 99.99% gives you 4 minutes/month of error budget — one bad deploy eats it. Start at 99.5% and tighten only when customers demand it.

Gotcha: Gaming the SLO

Teams might exclude endpoints from SLI to look better. Prevent by: - Including ALL user-facing endpoints - Auditing SLI definitions quarterly - Publishing SLIs transparently

Template: Quick Incident Summary (for Chat)

When declaring an incident in Slack/chat:

INCIDENT: [one-line description]
SEVERITY: SEV-1/2/3
IMPACT: [who is affected and how]
STATUS: Investigating / Identified / Mitigating / Resolved
COMMANDER: @name
CHANNEL: #inc-YYYYMMDD-short-name
NEXT UPDATE: [time]

Remember: Incident severity mnemonic: SEV-1 = "customers are screaming," SEV-2 = "customers are grumbling," SEV-3 = "we noticed before customers did." If you cannot articulate the customer impact, you probably have the severity wrong.

Template: Action Item Format

[PM-YYYYMMDD-NN] Title
  Owner: @name
  Priority: P1/P2/P3
  Due: YYYY-MM-DD
  Status: Open/In Progress/Done
  Source: Postmortem for [incident name]
  Description: Specific, actionable task

Anti-Pattern: Blame in Disguise

These sound blameless but aren't:

Sounds blameless Actually blame
"The on-call should have escalated sooner" Blaming the individual
"If the developer had tested..." Blaming the developer
"The team failed to follow the process" Blaming the team

Better: - "The escalation criteria were unclear" -> Action: Document escalation criteria - "The test suite didn't cover this case" -> Action: Add integration test - "The process was documented but not discoverable" -> Action: Link process from deploy tool

War story: A team wrote "the engineer should have noticed the alert" in a postmortem. The real fix was that the alert fired in a channel with 200+ daily messages. Moving it to a dedicated channel with PagerDuty integration solved the problem permanently.

Gotcha: Action items without owners and due dates never get done. Track postmortem actions in your issue tracker alongside feature work — if they only live in a Google Doc, they die there.