Quiz: Ops War Stories¶

3 questions

L0 (1 questions)¶

1. What single question resolves approximately 40% of production incidents within minutes?

Show answer

'Was anything deployed or changed in the last 4 hours?' Recent changes (deploys, config changes, infrastructure changes) are the number one cause of incidents at roughly 40% frequency. If the answer is yes, there is a strong correlation — rollback first, investigate second.

L1 (1 questions)¶

1. df shows 100% disk usage but du -sh /* totals to only 60%. What are the most likely causes of the missing 40%?

Show answer

1. Deleted files still held open by a running process (most common) — check with lsof +D /var/log | grep deleted, fix by restarting the process.
2. Filesystem reserved blocks — ext4 reserves 5% for root by default, check with tune2fs -l.
3. A mount point covering a directory that has data underneath it — umount and check.
4. Large filesystem journal consuming space that du does not report.

L2 (1 questions)¶

1. API response time is 2 seconds but server CPU is at 10%, memory is fine, and disk is fine. Walk through the differential diagnosis.

Show answer

1. DNS resolution delays (extremely common) — test with 'time nslookup api.dependency.com', fix with local DNS cache.
2. Connection pool exhaustion — application waiting for database/Redis/HTTP connections, check pool metrics and ss -tnp ESTABLISHED count.
3. Upstream service is slow — your service is fast but blocks on a dependency, add timeouts and circuit breakers.
4. TCP retransmissions from packet loss — check 'netstat -s | grep retransmit', even 0.5% loss causes massive latency.
5. GC pauses — application frozen during garbage collection, CPU looks idle during stop-the-world pauses.