Grading Rubric¶

Criterion	Strong (3)	Adequate (2)	Weak (1)
Identified misleading symptom	Ignored the alert volume; immediately checked why pods were restarting (probe failures)	Checked Alertmanager config first, then pivoted to pod investigation	Spent extended time tuning Alertmanager routing, grouping, or inhibition rules
Found root cause in networking domain	Identified DNS latency from extra search domains + ndots:5 as the cause of slow health checks	Found the health check was slow but not the DNS resolution cause	Assumed the health check endpoint or Elasticsearch was the problem
Remediated in kubernetes domain	Fixed probe design (liveness vs readiness split), increased timeout, and fixed DNS config	Increased the probe timeout but did not fix the DNS or probe design	Restarted pods or suppressed alerts without fixing the root cause
Cross-domain thinking	Explained the full chain: DNS config -> latency -> probe timeout -> pod restart -> alert storm	Acknowledged DNS and probe interaction but missed the alert storm misdirection	Treated it as a single-domain observability or application issue

Prerequisite Topic Packs¶

alerting-rules — needed for Domain A investigation (alert configuration, Alertmanager routing)
k8s-ops (Probes) — needed for understanding liveness vs readiness probe design
dns-deep-dive — needed for Domain B root cause (ndots, search domains, DNS query amplification)
k8s-pods-and-scheduling — needed for Domain C remediation (probe configuration, pod spec)
k8s-debugging-playbook — needed for systematic Kubernetes troubleshooting