Track: SRE & Reliability Engineering¶
This track covers the practices, tools, and mindset for running reliable production systems.
Prerequisites¶
- Completed Level 4 (Operations & Observability) or equivalent
- Familiar with Prometheus, basic alerting, and incident response concepts
Learning Path¶
Module 1: SLOs, SLIs & Error Budgets¶
| Step | Activity | Resource |
|---|---|---|
| 1 | Read the primer | topics/postmortem-slo/primer.md |
| 2 | Work through exercises | topics/postmortem-slo/exercises.md |
| 3 | Practice drills | drills/postmortem_slo_drills.md |
| 4 | Review cheat sheet | cheatsheets/postmortem-slo.cheatsheet.md |
| 5 | Self-assess | skillchecks/postmortem-slo.skillcheck.md |
Module 2: Alerting Rules (PromQL / LogQL)¶
| Step | Activity | Resource |
|---|---|---|
| 1 | Read the primer | topics/alerting-rules/primer.md |
| 2 | Work through exercises | topics/alerting-rules/exercises.md |
| 3 | Practice PromQL drills | drills/promql_drills.md |
| 4 | Practice LogQL drills | drills/logql_drills.md |
| 5 | Practice alerting drills | drills/alerting_rules_drills.md |
| 6 | Review cheat sheet | cheatsheets/alerting-rules.cheatsheet.md |
Module 3: Container Runtime Debugging¶
| Step | Activity | Resource |
|---|---|---|
| 1 | Read the primer | topics/containers-deep-dive/primer.md |
| 2 | Work through exercises | topics/containers-deep-dive/exercises.md |
| 3 | Practice drills | drills/container_runtime_drills.md |
| 4 | Review cheat sheet | cheatsheets/container-runtime-debug.cheatsheet.md |
Module 4: TLS & PKI¶
| Step | Activity | Resource |
|---|---|---|
| 1 | Read the primer | topics/tls-certificates-ops/primer.md |
| 2 | Work through exercises | topics/tls-certificates-ops/exercises.md |
| 3 | Practice drills | drills/tls_pki_drills.md |
| 4 | Review cheat sheet | cheatsheets/tls-pki.cheatsheet.md |
| 5 | Self-assess | skillchecks/tls-pki.skillcheck.md |
| 6 | Interview scenario | interview-scenarios/12-certificate-expired.md |
Module 5: etcd & Backup/DR¶
| Step | Activity | Resource |
|---|---|---|
| 1 | Practice etcd drills | drills/etcd_drills.md |
| 2 | Work through etcd scenarios | scenarios/etcd/etcd-troubleshooting.md |
| 3 | Review etcd cheat sheet | cheatsheets/etcd-operations.cheatsheet.md |
| 4 | Study backup runbooks | runbooks/kubernetes/etcd_backup_restore.md |
| 5 | Study Velero runbook | runbooks/kubernetes/velero_backup_restore.md |
| 6 | Review DR runbook | runbooks/kubernetes/disaster_recovery.md |
| 7 | Interview scenario | interview-scenarios/14-etcd-space-exceeded.md |
Suggested Pace¶
- Intensive: 1 module per week (5 weeks total)
- Steady: 1 module per 2 weeks (10 weeks total)
- Daily drills: 5 PromQL drills + 5 from the current module
Pages that link here¶
- Alerting Rules Cheat Sheet
- Container Runtime Debugging Cheat Sheet
- Container Runtime Debugging Drills
- Incident Postmortem & SLO/SLI Drills
- LogQL Drills
- Postmortem & SLO Cheat Sheet
- Postmortem & SLO/SLI - Skill Check
- PromQL Drills
- Runbook: Disaster Recovery Plan
- Runbook: Velero Backup & Restore (Application-Level DR)
- Runbook: etcd Backup & Restore
- Scenario: etcd Database Space Exceeded
- TLS & PKI - Skill Check
- TLS & PKI Cheat Sheet
- TLS & PKI Drills