Level 7: SRE & Cloud Operations¶
SLOs, alerting, cloud providers, FinOps, database operations, etcd, disaster recovery. The practices that keep production reliable at scale.
Concepts¶
slos, error_budgets, alerting_rules, promql, logql, cloud_providers, finops, database_ops, etcd, disaster_recovery, multi_cluster, capacity_planning
Failure Patterns You Should Be Able to Resolve¶
All patterns from Levels 1-6, plus: - FP-021: SLO budget exhaustion and burn-rate alerting failures - FP-022: Cloud provider API throttling and quota limits - FP-023: Database failover during deployment - FP-024: etcd space exceeded / compaction failures - FP-025: Cost spike from orphaned resources
Commands You Should Be Fluent With¶
All commands from Levels 1-6, plus:
- PromQL: rate(), histogram_quantile(), increase(), absent() (SLOs)
- LogQL: {app="..."} |= "error", rate(), count_over_time() (Alerting)
- aws, gcloud, az CLIs for troubleshooting (Cloud)
- etcdctl endpoint status, etcdctl defrag, etcdctl snapshot save (etcd)
- velero backup create, velero restore create (DR)
Assets to Complete¶
SLOs & Alerting¶
- training/library/topics/postmortem-slo/primer.md
- training/library/drills/postmortem_slo_drills.md
- training/library/topics/alerting-rules/primer.md
- training/library/drills/promql_drills.md
- training/library/drills/logql_drills.md
Cloud Operations¶
- training/library/topics/cloud-deep-dive/primer.md
- training/library/drills/cloud_deep_dive_drills.md
- training/library/topics/aws-troubleshooting/primer.md
- training/library/topics/azure-troubleshooting/primer.md
- training/library/topics/gcp-troubleshooting/primer.md
FinOps¶
- training/library/topics/finops/primer.md
- training/library/drills/finops_drills.md
- training/library/interview-scenarios/20-cost-spike-investigation.md
Database & Storage¶
- training/library/topics/database-ops/primer.md
- training/library/drills/database_ops_drills.md
- training/library/interview-scenarios/17-database-failover-during-deploy.md
etcd & DR¶
- training/library/drills/etcd_drills.md
- training/library/runbooks/kubernetes/etcd_backup_restore.md
- training/library/runbooks/kubernetes/velero_backup_restore.md
- training/library/runbooks/kubernetes/disaster_recovery.md
- training/library/interview-scenarios/14-etcd-space-exceeded.md
Capacity & Reliability¶
- training/library/topics/capacity-planning/primer.md
- training/library/topics/disaster-recovery/primer.md
- training/library/topics/sre-practices/primer.md
Review (flashcards)¶
- training/interactive/knowledge/data/cards/cloud.tsv
- training/interactive/knowledge/data/cards/sre.tsv
- training/interactive/knowledge/data/cards/observability.tsv
Pages that link here¶
- AWS Troubleshooting - Primer
- Azure Troubleshooting - Primer
- Capacity Planning - Primer
- Cloud Provider Deep-Dive (AWS & GCP) - Primer
- Cost Optimization & FinOps - Primer
- Database Operations Drills
- Database Operations on Kubernetes - Primer
- Disaster Recovery & Backup Engineering - Primer
- FinOps & Cost Optimization Drills
- GCP Troubleshooting - Primer
- Incident Postmortem & SLO/SLI Drills
- Incident Postmortem Writing & SLO/SLI - Primer
- Log Analysis & Alerting Rules (PromQL / LogQL) - Primer
- LogQL Drills
- PromQL Drills