Skip to content

Interview Scenarios

DevOps/SRE interview scenarios tied to the runtime labs and runbooks in this repository. Each scenario simulates a real incident and includes:

  • The prompt (what happened, what you see)
  • The expected investigation path
  • A strong answer
  • Common traps
  • Links to hands-on practice

How to Use

  1. Read the scenario prompt as if an interviewer just presented it
  2. Think through your investigation path before reading the answer
  3. Practice the commands on a live cluster using the linked runtime lab
  4. Review the linked runbook for a concise reference

Scenarios

# Scenario Difficulty Lab Runbook
1 Deployment stuck progressing Medium lab-runtime-01 readiness_probe_failed
2 HPA not scaling under load Medium lab-runtime-02 hpa_not_scaling
3 Prometheus says target down Medium lab-runtime-03 prometheus_target_down
4 Logs disappeared from Grafana Loki Medium lab-runtime-04 loki_no_logs
5 Helm upgrade broke prod Medium lab-runtime-05 helm_upgrade_failed
6 CI failed due to vulnerability scan Easy lab-runtime-06 n/a
7 Config drift detected in production Hard lab-runtime-07 n/a
8 Pods OOMKilled under load Medium lab-runtime-08 oomkilled
9 RBAC forbidden during deploy Medium n/a rbac_forbidden
10 Ingress returns 404 intermittently Medium n/a ingress_404
11 Server won't POST in the data center Hard n/a n/a
12 TLS certificate expired Medium n/a n/a
13 Secret leaked to Git Hard n/a n/a
14 etcd database space exceeded Hard n/a etcd_backup_restore
15 100% 503s after Istio rollout Medium n/a n/a
16 GitOps drift causing outage Medium n/a n/a
17 Database failover during deploy Hard n/a n/a
18 Policy engine blocking all deploys Medium n/a n/a
19 Vault tokens expired across services Hard n/a n/a
20 Cloud cost spike investigation Medium n/a n/a
21 Linux server running slow Medium n/a n/a
22 Docker container won't start in prod Medium n/a n/a