Skip to content

Interview Gauntlet: Pods Crash-Looping

Category: Incident Response Difficulty: L2-L3 Duration: 15-20 minutes Domains: Kubernetes, Linux Kernel


Round 1: The Opening

Interviewer: "Your monitoring shows pods in CrashLoopBackOff. They restart, run for about 30 seconds, then crash again. What's your investigation process?"

Strong Answer:

"CrashLoopBackOff means the container is starting and then exiting with a non-zero exit code repeatedly. Kubernetes backs off the restart interval exponentially. My first steps: kubectl get pods -n production to see which pods are affected and their restart counts. Then kubectl describe pod <pod-name> to check the Events section — this tells me the exit code, any OOMKilled status, and whether there are image pull issues. Next, kubectl logs <pod-name> --previous to get the logs from the last crash — --previous is critical because the current container might have already crashed with no logs yet. The exit code matters: exit code 1 is a general application error, exit code 137 usually means OOMKilled (SIGKILL from the kernel), exit code 139 is a segfault. If logs show an application error (missing config, failed database connection, bad migration), that's straightforward. If there are no logs at all and the exit code is 137, the container is being OOM-killed before it can write output — I'd check kubectl describe pod for the Last State section showing reason: OOMKilled."

Common Weak Answers:

  • "I'd check the logs." — Correct direction but missing --previous. Without --previous, you're looking at the current (possibly empty) container's logs, not the one that crashed.
  • "I'd increase the memory limits." — Might work for OOMKilled but jumps to a solution before diagnosis. The crash could be a code bug, a missing dependency, a config error.
  • "I'd delete the pod and let it recreate." — CrashLoopBackOff means the pod is already being recreated and still crashing. Deleting it just resets the backoff timer.

Round 2: The Probe

Interviewer: "The crash is only happening on one specific node. All pods scheduled to that node crash, but the same pods run fine on other nodes. The node shows Ready status. What do you investigate?"

What the interviewer is testing: The ability to reason about node-specific issues that don't show up in Kubernetes node conditions.

Strong Answer:

"A node-specific crash that affects all pods is almost always a node-level issue: disk, runtime, kernel, or hardware. First, I'd check node conditions beyond Ready: kubectl describe node <node> and look for DiskPressure, MemoryPressure, PIDPressure, or taints. A node can be Ready but have disk pressure. Then I'd check the container runtime: ssh <node> && crictl ps -a to see if containers are actually starting, and journalctl -u containerd -n 200 (or docker depending on the runtime) for runtime errors. I'd check dmesg | tail -50 on the node for kernel-level messages — OOM killer invocations, segfaults, hardware errors. I'd also check the kubelet logs: journalctl -u kubelet -n 200 --no-pager for scheduling or mount errors. If it's disk-related, df -h and df -i (inodes can be exhausted even when disk space is available). One thing I've seen in practice: a corrupted container image layer in the node's local cache can cause all pods using that image to crash. crictl rmi --prune clears unused images and forces a re-pull."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "How do you check if it's an inode exhaustion issue vs disk space?" The answer is df -i shows inode usage. A node can have 50% free disk space but 100% inodes consumed if millions of tiny files were created (common with log files or container overlayfs layers). Candidates who don't know about inode exhaustion likely haven't debugged node-level storage issues.


Round 3: The Constraint

Interviewer: "The node looks healthy by all standard checks — disk is fine, memory is fine, no taints, kubelet is happy. But you strace the crashing container and see it's dying on a specific syscall: ILLEGAL INSTRUCTION (core dumped). What is this?"

Strong Answer:

"An illegal instruction signal (SIGILL, which maps to exit code 132) means the CPU attempted to execute an instruction it doesn't support. This is almost always one of two things: either the binary was compiled with CPU instruction set extensions that the node's CPU doesn't have (like AVX-512 instructions on a CPU that only supports AVX2), or there's a genuine hardware fault in the CPU. In Kubernetes, this happens when you have a heterogeneous cluster — say, some nodes are running on newer Intel processors and some on older AMD processors. If the container image was built with compiler optimizations for the newer CPU (like -march=native or -mavx512f), it will crash on nodes with the older CPU. I'd check the CPU model on the affected node versus healthy nodes: cat /proc/cpuinfo | grep 'model name' and grep -c avx512 /proc/cpuinfo. If the flags differ, the fix is a node affinity rule that schedules the pod only on compatible nodes, or rebuilding the image with a more conservative target architecture. If the CPUs are identical, then it might be a genuine hardware fault — cosmic ray bit flip, degraded CPU — and the node should be cordoned and drained for hardware diagnostics."

The Senior Signal:

What separates a senior answer: Immediately recognizing SIGILL as a CPU instruction compatibility issue, not a generic "crash." Knowing that heterogeneous clusters are a real problem, especially in cloud environments where a cluster might span multiple instance generations (e.g., c5.xlarge with Skylake vs c6i.xlarge with Ice Lake). Also: mentioning -march=native as the root cause at the build level, which is the most common way this happens in practice.


Round 4: The Curveball

Interviewer: "It turns out the node has a slightly different CPU generation from the rest of the cluster — it was added during a scale-up event and the cloud provider gave you a different instance type than requested. How do you prevent this class of issue in the future?"

Strong Answer:

"Several layers. First, at the cluster level: in AWS, use a Launch Template with a specific instance type list for the Auto Scaling Group, not just instance families. If you allow c5.* and c6i.* in the same node group, you'll get mixed CPU generations. In EKS managed node groups, you can pin to specific instance types. If you need multiple instance types for cost optimization (like with Spot instances), put different CPU generations in separate node groups with different labels. Second, at the workload level: use node affinity or nodeSelector to schedule CPU-sensitive workloads on specific node groups. Label the node groups with the CPU generation: node.kubernetes.io/cpu-arch: avx512 or similar. Third, at the build level: compile binaries with a baseline instruction set rather than -march=native. For Go, this is the default — Go produces portable binaries. For C/C++ and JVM native compilation (GraalVM), you need to specify the target architecture explicitly. Fourth, at the detection level: add a DaemonSet that checks CPU features on every node and labels them automatically, so new nodes that join with unexpected CPU capabilities are immediately visible."

Trap Question Variant:

The right answer is "This depends on your cloud provider's behavior." Candidates who say "just pin the instance type" are on the right track, but in practice, cloud providers sometimes substitute compatible instance types during capacity shortages (especially with Spot). Acknowledging that the cloud platform can change what it gives you — and that your infrastructure needs to handle that — is the senior insight. It's fine to say "I don't know the exact behavior of every cloud provider here, but the defense-in-depth approach is to enforce at multiple layers."


Round 5: The Synthesis

Interviewer: "This incident started as 'pods are crashing' and ended at 'the cloud provider gave us a different CPU.' What does this teach you about incident investigation methodology?"

Strong Answer:

"Two lessons. First, always validate your assumptions about the infrastructure. We assumed all nodes in the cluster were identical because they were supposed to be. The Kubernetes scheduler assumed the node was fine because it was Ready. The containers assumed the CPU supported their instruction set. Every layer was making an assumption that was invisibly wrong. In incident response, when the obvious checks come back clean, the issue is usually in a layer you're not looking at — and often it's an assumption you didn't know you were making. Second, incidents that seem bizarre are usually the intersection of two normal things: a container compiled with specific CPU extensions (normal in optimized builds) plus a node with a different CPU (normal in cloud scale-up). Neither is a bug alone. The combination is the bug. This is why good incident response moves from 'what changed' to 'what's different between the working case and the broken case.' The diff between the crashing node and the healthy nodes would have revealed the CPU difference earlier if we'd thought to compare hardware specs. I'd add a postmortem action item to include hardware/instance type comparison as a standard step in the node-specific failure playbook."

What This Sequence Tested:

Round Skill Tested
1 Kubernetes pod debugging fundamentals
2 Node-level issue isolation and container runtime debugging
3 Low-level Linux/CPU knowledge and instruction set awareness
4 Infrastructure guardrails and prevention-oriented thinking
5 Incident investigation methodology and assumption-checking

Prerequisite Topic Packs