Skip to content

Track: SRE & Reliability Engineering

This track covers the practices, tools, and mindset for running reliable production systems.

Prerequisites

  • Completed Level 4 (Operations & Observability) or equivalent
  • Familiar with Prometheus, basic alerting, and incident response concepts

Learning Path

Module 1: SLOs, SLIs & Error Budgets

Step Activity Resource
1 Read the primer topics/postmortem-slo/primer.md
2 Work through exercises topics/postmortem-slo/exercises.md
3 Practice drills drills/postmortem_slo_drills.md
4 Review cheat sheet cheatsheets/postmortem-slo.cheatsheet.md
5 Self-assess skillchecks/postmortem-slo.skillcheck.md

Module 2: Alerting Rules (PromQL / LogQL)

Step Activity Resource
1 Read the primer topics/alerting-rules/primer.md
2 Work through exercises topics/alerting-rules/exercises.md
3 Practice PromQL drills drills/promql_drills.md
4 Practice LogQL drills drills/logql_drills.md
5 Practice alerting drills drills/alerting_rules_drills.md
6 Review cheat sheet cheatsheets/alerting-rules.cheatsheet.md

Module 3: Container Runtime Debugging

Step Activity Resource
1 Read the primer topics/containers-deep-dive/primer.md
2 Work through exercises topics/containers-deep-dive/exercises.md
3 Practice drills drills/container_runtime_drills.md
4 Review cheat sheet cheatsheets/container-runtime-debug.cheatsheet.md

Module 4: TLS & PKI

Step Activity Resource
1 Read the primer topics/tls-certificates-ops/primer.md
2 Work through exercises topics/tls-certificates-ops/exercises.md
3 Practice drills drills/tls_pki_drills.md
4 Review cheat sheet cheatsheets/tls-pki.cheatsheet.md
5 Self-assess skillchecks/tls-pki.skillcheck.md
6 Interview scenario interview-scenarios/12-certificate-expired.md

Module 5: etcd & Backup/DR

Step Activity Resource
1 Practice etcd drills drills/etcd_drills.md
2 Work through etcd scenarios scenarios/etcd/etcd-troubleshooting.md
3 Review etcd cheat sheet cheatsheets/etcd-operations.cheatsheet.md
4 Study backup runbooks runbooks/kubernetes/etcd_backup_restore.md
5 Study Velero runbook runbooks/kubernetes/velero_backup_restore.md
6 Review DR runbook runbooks/kubernetes/disaster_recovery.md
7 Interview scenario interview-scenarios/14-etcd-space-exceeded.md

Suggested Pace

  • Intensive: 1 module per week (5 weeks total)
  • Steady: 1 module per 2 weeks (10 weeks total)
  • Daily drills: 5 PromQL drills + 5 from the current module