Quiz: Ops War Stories¶
3 questions
L0 (1 questions)¶
1. What single question resolves approximately 40% of production incidents within minutes?
Show answer
'Was anything deployed or changed in the last 4 hours?' Recent changes (deploys, config changes, infrastructure changes) are the number one cause of incidents at roughly 40% frequency. If the answer is yes, there is a strong correlation — rollback first, investigate second.L1 (1 questions)¶
1. df shows 100% disk usage but du -sh /* totals to only 60%. What are the most likely causes of the missing 40%?
Show answer
1. Deleted files still held open by a running process (most common) — check with lsof +D /var/log | grep deleted, fix by restarting the process.2. Filesystem reserved blocks — ext4 reserves 5% for root by default, check with tune2fs -l.
3. A mount point covering a directory that has data underneath it — umount and check.
4. Large filesystem journal consuming space that du does not report.
L2 (1 questions)¶
1. API response time is 2 seconds but server CPU is at 10%, memory is fine, and disk is fine. Walk through the differential diagnosis.
Show answer
1. DNS resolution delays (extremely common) — test with 'time nslookup api.dependency.com', fix with local DNS cache.2. Connection pool exhaustion — application waiting for database/Redis/HTTP connections, check pool metrics and ss -tnp ESTABLISHED count.
3. Upstream service is slow — your service is fast but blocks on a dependency, add timeouts and circuit breakers.
4. TCP retransmissions from packet loss — check 'netstat -s | grep retransmit', even 0.5% loss causes massive latency.
5. GC pauses — application frozen during garbage collection, CPU looks idle during stop-the-world pauses.