Skip to content

Interview Gauntlet: When Automation Went Wrong

Category: Behavioral + Technical Hybrid Difficulty: L2-L3 Duration: 15-20 minutes Domains: Automation, Risk Management


Round 1: The Opening

Interviewer: "Tell me about a time automation went wrong. What broke and what was the impact?"

Strong Answer:

"We had an automated cleanup script that was supposed to delete unused Docker images from our container registry to save storage costs. The script ran nightly via a CronJob, identified images that hadn't been pulled in 30 days, and deleted them. It worked perfectly for 3 months. Then one Friday night, it deleted the production image for our core API service. What happened: the API pods were scaled to 0 during a maintenance window (intentionally), and when we tried to scale them back up 2 hours later, Kubernetes couldn't pull the image because the cleanup script had deleted it — the image hadn't been 'pulled' in 30 days because the pods had been running with the cached image on the node. The node cache was cleared during the maintenance. Impact: 2 hours of downtime for our core API because we had to rebuild the image from the CI pipeline, which took 15 minutes for the build plus 10 minutes to deploy and validate. Total incident duration was about 35 minutes from discovery, but the reputational impact was significant because it was a self-inflicted outage from a cost-saving automation."

Common Weak Answers:

  • "A script had a bug and we fixed it." — Too vague. No specifics about what went wrong, why, or the impact.
  • "We accidentally deleted some files." — Doesn't demonstrate understanding of automation-specific failure modes.
  • Stories without real impact — If the automation failure didn't cause meaningful damage, the story doesn't demonstrate risk management lessons.

Round 2: The Probe

Interviewer: "Walk me through the failure chain. Why did 'not pulled in 30 days' seem like a safe heuristic, and where did it break down?"

What the interviewer is testing: Ability to analyze why a reasonable-seeming automation rule produced a bad outcome, and whether the candidate understands the edge cases in container image lifecycle.

Strong Answer:

"The heuristic 'not pulled in 30 days' seemed safe because in normal operation, our CI/CD pipeline builds and deploys new images at least weekly. So any image older than 30 days without a pull was genuinely unused — it was an old version that had been superseded. The failure came from an assumption: that 'not recently pulled' equals 'not in use.' In Kubernetes, once a pod is running, it doesn't pull the image again — the container runtime uses the cached copy on the node. So an image can be actively in production but show zero pulls for months. The cleanup script was querying the registry's access logs for pull events, not querying Kubernetes for which images were actually running. The additional complication: we had imagePullPolicy: IfNotPresent (the default for tagged images), so pods only pull on first scheduling to a node. If a pod is running on a node, it will never pull again until it's rescheduled to a different node. The correct approach would have been: before deleting any image, check all Kubernetes clusters for running pods that reference that image digest. Only delete images that are neither recently pulled NOR currently deployed. This requires cross-referencing the registry with the Kubernetes API, which our simple script didn't do."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "What's the difference between imagePullPolicy: Always vs IfNotPresent vs Never, and when would you use each?" Always forces a registry pull on every pod start (good for :latest tags, bad for performance). IfNotPresent uses the cached image if it exists on the node (good for immutable tags like v1.2.3, default for tagged images). Never only uses locally-present images (used in air-gapped environments or with pre-loaded images). The default for images tagged :latest is Always; for any other tag, it's IfNotPresent. This is a common Kubernetes gotcha.


Round 3: The Constraint

Interviewer: "How did you detect that the automation had caused the outage? If the maintenance window hadn't ended that night, how long would it have taken to discover the deleted image?"

Strong Answer:

"We detected it immediately because the scale-up failed — Kubernetes events showed ImagePullBackOff and we traced it back to the missing image within minutes. But you're asking the right question: if we hadn't tried to scale up that night, we might not have discovered the deletion for days or weeks. The image would have been gone from the registry, and the next time a pod was evicted or rescheduled to a different node (which happens regularly during node upgrades or spot instance reclamation), it would fail to pull the missing image. That could happen during normal operation on a Tuesday afternoon without any obvious trigger. This is what makes the failure mode scary — it's a time bomb. The deletion happens silently, and the outage happens later when something triggers a re-pull. To detect this proactively, we should have had a check that compares deployed images against registry contents. A daily job that runs kubectl get pods -A -o jsonpath='{..image}' across all clusters, deduplicates the image references, and verifies each one exists in the registry. If any deployed image is missing from the registry, alert immediately — even if the pods are currently running fine — because the next reschedule will cause an outage."

The Senior Signal:

What separates a senior answer: Recognizing the time bomb nature of the failure — the deletion and the outage are separated in time, which makes the failure harder to detect and correlate. The proactive check (comparing deployed images against registry contents) shows defensive thinking. Also: understanding that Kubernetes regularly reschedules pods (node upgrades, spot evictions, rebalancing) and any reliance on node-cached images is fragile.


Round 4: The Curveball

Interviewer: "After this incident, a colleague suggests: 'We should stop automating dangerous operations. The cleanup script saved us $200/month in storage but caused a 2-hour outage. The cost doesn't justify the risk.' How do you respond?"

Strong Answer:

"They're right about the cost-benefit analysis of this specific automation — $200/month in storage savings is not worth the risk of a production outage. But the conclusion 'stop automating dangerous operations' is too broad. The right lesson is: automate with appropriate guardrails, not 'don't automate.' The cleanup script's problem wasn't that it was automated — it's that it was automated without safeguards. A human doing the same task manually would have made the same mistake if they applied the same 'not pulled in 30 days' heuristic. The fix is better guardrails, not manual toil. Specifically: add a dry-run mode that reports what would be deleted without deleting. Run the dry-run daily and send the results to Slack for human review. Only actually delete after 7 days on the dry-run list with no objection. Add the cross-reference check against running Kubernetes pods. Add a blocklist for images tagged with certain labels (like protected: true or images matching production-*). Limit deletions per run — if the script would delete more than 10% of the registry, halt and alert. The principle: automate the detection and preparation, gate the destructive action on human approval or verified safety checks. This applies broadly to any automation that modifies or deletes production resources — database cleanup scripts, cloud resource garbage collection, log rotation, certificate renewal. All of these should have dry-run modes and blast radius limits."

Trap Question Variant:

The right answer is nuanced. Candidates who agree ("yeah, stop automating dangerous things") are giving up on a core SRE principle. Candidates who disagree without empathy ("the colleague is wrong") are dismissing a valid concern. The senior answer acknowledges the valid concern (the risk-reward was bad for this specific case), then reframes the solution (better guardrails, not less automation). The phrase "automate the detection, gate the action" is a useful heuristic.


Round 5: The Synthesis

Interviewer: "What guardrails do you put on any automation that touches production systems?"

Strong Answer:

"Five guardrails I apply to every production automation. First, dry-run by default. Every script that creates, modifies, or deletes production resources must have a --dry-run flag that is the default mode. The actual modification requires an explicit --execute or --confirm flag. This prevents accidental execution and enables review. Second, blast radius limits. The script should have a maximum number of resources it can affect per run. If the cleanup script is expected to delete 5 images and suddenly wants to delete 500, it should halt and alert rather than proceeding. I call this the 'are you sure?' threshold — configurable but always present. Third, logging and auditability. Every action the automation takes is logged with a timestamp, the resource affected, the action taken, and the justification (which rule triggered it). After an incident, you need to reconstruct what the automation did and why. Fourth, staged rollout. For automations that run across multiple environments or clusters, process them in order: dev first, wait, staging, wait, then production. If something goes wrong in dev, production is untouched. Fifth, kill switch. Every production automation should have a way to be instantly disabled without modifying code — a feature flag, a ConfigMap, or a simple 'if this file exists, do nothing' check. When an automation is suspected of causing issues, you need to stop it in seconds, not minutes. The meta-principle: treat automation code with the same rigor as application code. Code review, testing, staged rollout, monitoring, and rollback capability. An unreviewed script running as root in a CronJob at midnight is one of the highest-risk things in infrastructure."

What This Sequence Tested:

Round Skill Tested
1 Structured incident storytelling with concrete impact
2 Failure chain analysis and container image lifecycle knowledge
3 Proactive failure detection and defensive thinking
4 Balanced judgment on automation risk vs value
5 Production automation guardrail design

Prerequisite Topic Packs