Skip to content

Level 7: SRE & Cloud Operations

SLOs, alerting, cloud providers, FinOps, database operations, etcd, disaster recovery. The practices that keep production reliable at scale.

Concepts

slos, error_budgets, alerting_rules, promql, logql, cloud_providers, finops, database_ops, etcd, disaster_recovery, multi_cluster, capacity_planning

Failure Patterns You Should Be Able to Resolve

All patterns from Levels 1-6, plus: - FP-021: SLO budget exhaustion and burn-rate alerting failures - FP-022: Cloud provider API throttling and quota limits - FP-023: Database failover during deployment - FP-024: etcd space exceeded / compaction failures - FP-025: Cost spike from orphaned resources

Commands You Should Be Fluent With

All commands from Levels 1-6, plus: - PromQL: rate(), histogram_quantile(), increase(), absent() (SLOs) - LogQL: {app="..."} |= "error", rate(), count_over_time() (Alerting) - aws, gcloud, az CLIs for troubleshooting (Cloud) - etcdctl endpoint status, etcdctl defrag, etcdctl snapshot save (etcd) - velero backup create, velero restore create (DR)

Assets to Complete

SLOs & Alerting

Cloud Operations

FinOps

Database & Storage

etcd & DR

Capacity & Reliability

Review (flashcards)