Skip to content

Grading Rubric

Criterion Strong (3) Adequate (2) Weak (1)
Identified misleading symptom Ignored the alert volume; immediately checked why pods were restarting (probe failures) Checked Alertmanager config first, then pivoted to pod investigation Spent extended time tuning Alertmanager routing, grouping, or inhibition rules
Found root cause in networking domain Identified DNS latency from extra search domains + ndots:5 as the cause of slow health checks Found the health check was slow but not the DNS resolution cause Assumed the health check endpoint or Elasticsearch was the problem
Remediated in kubernetes domain Fixed probe design (liveness vs readiness split), increased timeout, and fixed DNS config Increased the probe timeout but did not fix the DNS or probe design Restarted pods or suppressed alerts without fixing the root cause
Cross-domain thinking Explained the full chain: DNS config -> latency -> probe timeout -> pod restart -> alert storm Acknowledged DNS and probe interaction but missed the alert storm misdirection Treated it as a single-domain observability or application issue

Prerequisite Topic Packs

  • alerting-rules — needed for Domain A investigation (alert configuration, Alertmanager routing)
  • k8s-ops (Probes) — needed for understanding liveness vs readiness probe design
  • dns-deep-dive — needed for Domain B root cause (ndots, search domains, DNS query amplification)
  • k8s-pods-and-scheduling — needed for Domain C remediation (probe configuration, pod spec)
  • k8s-debugging-playbook — needed for systematic Kubernetes troubleshooting