Grading Rubric¶

Criterion	Strong (3)	Adequate (2)	Weak (1)
Identified misleading symptom	Used IPMI to access the node; identified NIC firmware crash from dmesg within 10 min	Checked IPMI but took time to interpret the firmware error messages	Got stuck at "network unreachable"; only tried rebooting
Found root cause in datacenter domain	Identified the specific firmware version and known bug; checked fleet for affected nodes	Found the NIC crash but did not identify it as a known firmware issue	Assumed it was a cable or switch problem
Remediated in devops_tooling domain	Created a rolling Ansible playbook with drain/update/uncordon; ran against all affected nodes	Updated firmware on the immediate node but did not automate fleet-wide	Manually rebooted the node and waited for the issue to recur
Cross-domain thinking	Explained the full chain: firmware bug -> NIC failure -> network loss -> k8s NotReady; proposed fleet-wide automation	Acknowledged the firmware issue but did not connect it to fleet management	Treated it as a one-off hardware failure

Prerequisite Topic Packs¶

k8s-node-lifecycle — needed for Domain A investigation (node conditions, NotReady, pod eviction)
server-hardware — needed for Domain B root cause (NIC hardware, firmware, IPMI access)
firmware — needed for Domain B root cause (firmware update procedures, known bugs)
ipmi-and-ipmitool — needed for Domain B (out-of-band management access)
ansible — needed for Domain C remediation (playbook creation, rolling updates, fleet management)