Skip to content

Grading Rubric

Criterion Strong (3) Adequate (2) Weak (1)
Identified misleading symptom Used IPMI to access the node; identified NIC firmware crash from dmesg within 10 min Checked IPMI but took time to interpret the firmware error messages Got stuck at "network unreachable"; only tried rebooting
Found root cause in datacenter domain Identified the specific firmware version and known bug; checked fleet for affected nodes Found the NIC crash but did not identify it as a known firmware issue Assumed it was a cable or switch problem
Remediated in devops_tooling domain Created a rolling Ansible playbook with drain/update/uncordon; ran against all affected nodes Updated firmware on the immediate node but did not automate fleet-wide Manually rebooted the node and waited for the issue to recur
Cross-domain thinking Explained the full chain: firmware bug -> NIC failure -> network loss -> k8s NotReady; proposed fleet-wide automation Acknowledged the firmware issue but did not connect it to fleet management Treated it as a one-off hardware failure

Prerequisite Topic Packs

  • k8s-node-lifecycle — needed for Domain A investigation (node conditions, NotReady, pod eviction)
  • server-hardware — needed for Domain B root cause (NIC hardware, firmware, IPMI access)
  • firmware — needed for Domain B root cause (firmware update procedures, known bugs)
  • ipmi-and-ipmitool — needed for Domain B (out-of-band management access)
  • ansible — needed for Domain C remediation (playbook creation, rolling updates, fleet management)