Level 7: SRE & Cloud Operations¶

SLOs, alerting, cloud providers, FinOps, database operations, etcd, disaster recovery. The practices that keep production reliable at scale.

Concepts¶

slos, error_budgets, alerting_rules, promql, logql, cloud_providers, finops, database_ops, etcd, disaster_recovery, multi_cluster, capacity_planning

Failure Patterns You Should Be Able to Resolve¶

All patterns from Levels 1-6, plus: - FP-021: SLO budget exhaustion and burn-rate alerting failures - FP-022: Cloud provider API throttling and quota limits - FP-023: Database failover during deployment - FP-024: etcd space exceeded / compaction failures - FP-025: Cost spike from orphaned resources

Commands You Should Be Fluent With¶

All commands from Levels 1-6, plus: - PromQL: rate(), histogram_quantile(), increase(), absent() (SLOs) - LogQL: {app="..."} |= "error", rate(), count_over_time() (Alerting) - aws, gcloud, az CLIs for troubleshooting (Cloud) - etcdctl endpoint status, etcdctl defrag, etcdctl snapshot save (etcd) - velero backup create, velero restore create (DR)

Level 7: SRE & Cloud Operations¶

Concepts¶

Failure Patterns You Should Be Able to Resolve¶

Commands You Should Be Fluent With¶

Assets to Complete¶

SLOs & Alerting¶

Cloud Operations¶

FinOps¶

Database & Storage¶

etcd & DR¶

Capacity & Reliability¶

Review (flashcards)¶

Pages that link here¶