Diagnostic Questions¶

Before revealing the investigation path:¶

A node goes NotReady with 100% ping loss, but IPMI shows the OS is running. What are the possible causes? How do you differentiate between a network cable issue, a switch port issue, and a NIC issue?
You access the node via IPMI and see bnxt_en: firmware heartbeat stalled, resetting in dmesg. What does this tell you about the failure domain? Is this a Linux kernel issue, a driver issue, or a firmware issue?
Three other nodes in the cluster have the same firmware version and two have experienced the same failure recently. What is the most efficient way to identify all affected nodes and verify their firmware versions?
Why is the fix an Ansible playbook (devops tooling) rather than a manual firmware update (datacenter ops) or a Kubernetes configuration change (cordon/drain)?
What process changes would prevent a hardware batch from entering production with outdated firmware? How would you make firmware compliance a continuous check rather than a one-time activity?