Skip to content

Interview Gauntlet: Handling a Production Incident

Category: Behavioral + Technical Hybrid Difficulty: L2-L3 Duration: 15-20 minutes Domains: Incident Response, Communication


Round 1: The Opening

Interviewer: "Tell me about a time you handled a production incident. Walk me through what happened and what you did."

Strong Answer:

"We had a payment processing outage that lasted about 90 minutes during our busiest period — a Friday afternoon. Our payment service started returning 500 errors on checkout. I was on-call and got paged at 2:15 PM. Within the first 5 minutes, I confirmed the issue was real (not a monitoring glitch), checked the blast radius (100% of payment attempts failing), and posted an initial update to the incident Slack channel: 'Payment service returning 500s, investigating. Customer-facing impact confirmed. No ETA yet.' I then pulled in the payment service owner, who wasn't on-call but was the domain expert. Together, we traced it to a database migration that had been deployed 20 minutes before the outage started. The migration added a new column with a NOT NULL constraint and no default value, which broke all INSERT queries. We had two options: roll back the migration or add a default value. We chose to add a default value with an ALTER TABLE because rolling back the migration risked losing the schema tracking state. The fix was deployed in about 40 minutes from initial page."

Common Weak Answers:

  • "I fixed a bug once that was causing errors." — Too vague. No timeline, no process, no team dynamics. The interviewer is looking for how you handled it, not just that you fixed it.
  • Listing only technical steps without communication — Incident handling is 50% communication. Strong candidates talk about stakeholder updates, team coordination, and timeline management.
  • "I immediately knew what the problem was." — Sounds good but is usually a sign the candidate is oversimplifying or fabricating. Real incidents have uncertainty.

Round 2: The Probe

Interviewer: "What was the root cause, and how did a NOT NULL migration get deployed without catching this? Walk me through the technical failure chain."

What the interviewer is testing: Whether the candidate actually understands the technical root cause deeply, or just knows the headline.

Strong Answer:

"The technical failure chain had three links. First, the migration itself: ALTER TABLE payments ADD COLUMN billing_region VARCHAR(50) NOT NULL;. Without a DEFAULT clause, PostgreSQL requires all new rows to have a value for this column, and existing application code wasn't providing one. This is a well-known anti-pattern for online schema changes. Second, the testing gap: our CI pipeline ran the migration against a test database, but the test suite only tested read operations against the payments table. The INSERT test used a factory that explicitly set all fields including billing_region, so it never hit the constraint error. The gap was that no integration test verified the actual production code path for payment creation. Third, the deployment process: the migration ran as a Helm pre-upgrade hook, which executes before the new application code is deployed. So we had a window where the old code (which doesn't set billing_region) was running against the new schema (which requires it). Even if the new code was correct, the deployment ordering created a guaranteed failure window. The right approach would have been an expand-and-contract migration: first add the column with a DEFAULT value, deploy the new code, then optionally add the NOT NULL constraint after all code paths are updated."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "What SQL would you use to add the column safely?" The safe version: ALTER TABLE payments ADD COLUMN billing_region VARCHAR(50) DEFAULT 'unknown'; followed later by UPDATE payments SET billing_region = <actual_value> WHERE billing_region = 'unknown'; and then optionally ALTER TABLE payments ALTER COLUMN billing_region SET NOT NULL;. Candidates who can write the safe SQL have done migrations in production. Candidates who hesitate on the syntax but describe the pattern correctly are also fine.


Round 3: The Constraint

Interviewer: "During the incident, how did you communicate to stakeholders? The VP of Product was asking for updates every 5 minutes, the support team needed to know what to tell customers, and the other engineers in the incident channel were all suggesting different fixes."

Strong Answer:

"Communication during an incident is as important as the technical fix, and it's the part most engineers underinvest in. Here's what I did. For the VP: I posted structured updates to the incident channel every 15 minutes with three sections — Current Status (what we know), Impact (quantified: 100% of payments failing, estimated revenue impact based on normal volume), and Next Steps (what we're trying right now, ETA if we have one). When the VP messaged me directly, I redirected them to the channel: 'All updates are going to #incident-2024-0823. I'm focused on the fix.' For the support team: I wrote a one-paragraph customer-facing statement within the first 10 minutes — 'We're experiencing an issue with payment processing. Our team is actively working on a fix. No payment data has been lost; orders will be retryable once service is restored. We expect to have an update within 30 minutes.' For the engineers suggesting fixes: I acknowledged each suggestion ('Good idea, we'll try that if the current approach doesn't work') but maintained a single line of investigation to avoid thrashing. Having one person (me) own the technical direction while delegating communication tasks (I asked a colleague to manage the status page) prevented the common incident anti-pattern of five people investigating five different theories simultaneously."

The Senior Signal:

What separates a senior answer: Structured communication with quantified impact (not just 'payments are broken' but 'X% of payments failing, estimated $Y/minute revenue impact'). Redirecting stakeholders to a single channel instead of context-switching between DMs. Delegating communication tasks (status page, customer-facing statement) to free up the incident commander for technical work. Also: the discipline of maintaining a single investigation thread rather than letting the team diverge.


Round 4: The Curveball

Interviewer: "After the incident, you write a postmortem. The CTO reads it and says: 'This was caused by human error — the developer who wrote the migration should have known better.' How do you respond?"

Strong Answer:

"I'd push back on the 'human error' framing, respectfully but clearly. In my postmortem, I'd use blameless language: 'The migration was written without a DEFAULT clause, and our deployment pipeline did not validate schema compatibility.' The focus should be on system failures, not individual blame. Specifically: why did our system allow this to happen? A developer writing an unsafe migration is expected — people make mistakes. The system should have caught it. I'd identify three system improvements. First, a migration linter: add a pre-commit check that flags dangerous patterns (NOT NULL without DEFAULT on an existing table, column drops, type changes). Tools like squawk for PostgreSQL do exactly this. Second, integration test coverage: add a test that runs a payment creation through the actual production code path against the migrated schema. Third, deployment ordering: decouple migrations from application deployment so the new schema and new code are never running simultaneously with incompatible assumptions — the expand-and-contract pattern I mentioned earlier. The conversation with the CTO is: 'We could blame the developer and send a reminder email, which changes nothing. Or we could add guardrails that prevent this class of error for every future developer. Which outcome do you prefer?' Blameless postmortems aren't about being soft on mistakes — they're about fixing systems instead of blaming people, because blaming people doesn't scale."

Trap Question Variant:

The right answer is clearly "blameless postmortem." But the trap is whether the candidate can articulate why to a skeptical executive. Saying "we should be blameless" is correct but insufficient if the CTO isn't convinced. The senior answer ties blamelessness to concrete outcomes (guardrails, prevention) and frames it as a systems engineering decision, not a cultural philosophy.


Round 5: The Synthesis

Interviewer: "What did you change after this incident, and how did you measure whether the changes actually worked?"

Strong Answer:

"We made three changes and measured each. First, the migration linter: we integrated squawk into our CI pipeline. It catches NOT NULL additions without defaults, column drops, and type changes. We measured it by tracking the number of unsafe migration patterns caught in PRs before merge. In the first quarter, it caught 7 potentially dangerous migrations, confirming it was worth the investment. Second, the expand-and-contract migration policy: we documented a migration safety guide and added a PR template checklist item for database changes. Measurement: zero migration-related incidents in the 6 months after implementation, compared to 3 in the 6 months before. Third, the integration test gap: we added end-to-end payment flow tests that run against a migrated database in CI. But the broader change was cultural. I led a 30-minute 'Migration Safety' workshop for the engineering team, walking through this incident as a case study. The workshop wasn't about blame — it was about patterns: here's what's safe, here's what's dangerous, here's how to check. The most impactful measurement was team feedback: in our quarterly engineering survey, 'confidence in deployment safety' went from 3.1/5 to 4.2/5. That told me the changes weren't just technical guardrails — they were actually changing how the team felt about deploying to production."

What This Sequence Tested:

Round Skill Tested
1 Structured incident response storytelling
2 Technical depth — understanding the failure chain
3 Stakeholder communication under pressure
4 Blameless postmortem advocacy and executive communication
5 Measurable improvement implementation and organizational impact

Prerequisite Topic Packs