Skip to content

Interview Gauntlet: API Returning 503s

Category: Incident Response Difficulty: L2-L3 Duration: 15-20 minutes Domains: Networking, DNS


Round 1: The Opening

Interviewer: "Your API is returning 503 errors. Walk me through your first 5 minutes of investigation."

Strong Answer:

"First, I'd scope the impact. Is it 100% of requests or a subset? I'd check the load balancer metrics or the ingress controller dashboard for error rates by status code. Then I'd check if the backend pods are running — kubectl get pods -n production — and whether the service has healthy endpoints: kubectl get endpoints -n production. If pods are running and endpoints exist, I'd look at pod logs and the ingress controller logs to see where the 503 is being generated. A 503 from the ingress controller means it can't reach the backend; a 503 from the application means the app is explicitly returning it (maybe a health check dependency is failing). I'd also check recent deployments — kubectl rollout history — because the most common cause of sudden 503s is a bad deploy. If nothing obvious shows up, I'd widen to infrastructure: is the node healthy, is there a network policy blocking traffic, did someone change a security group?"

Common Weak Answers:

  • "I'd restart the pods." — Restarting without understanding the cause risks making it worse and destroys evidence.
  • "I'd check the application logs." — This skips the scoping step. You need to know the blast radius before diving into logs.
  • "I'd roll back the last deployment." — Reasonable instinct but premature without confirming the deployment caused it. Rolling back when the issue is DNS or networking wastes time and creates false confidence when the rollback doesn't fix it.

Round 2: The Probe

Interviewer: "You check the metrics and see that it's only 5% of users getting 503s. The other 95% are fine. What does that tell you, and where do you look next?"

What the interviewer is testing: The ability to reason about partial failures — which are harder to diagnose than total outages because they imply a routing or subset problem.

Strong Answer:

"5% failure rate with 95% success tells me the infrastructure is mostly working — this isn't a total service outage. The 5% could be: specific users hitting a specific backend pod that's unhealthy (but still in the endpoint list), traffic to a specific AZ or node that has an issue, a canary deployment serving bad traffic, or a client-side pattern — like mobile users on a specific carrier or region. I'd correlate the 503s with available dimensions: source IP or region, target pod, the load balancer backend that served the request, and the user agent. Most load balancers and ingress controllers log the upstream address in their access logs. If all 503s map to the same upstream pod, that pod has an application issue. If they map to the same node, the node has a networking issue. If they come from a specific source region, it's likely a routing or DNS issue."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "How would you extract the upstream address from nginx ingress logs?" The answer is the $upstream_addr variable in nginx log format, and in the default ingress-nginx config, it's included in the JSON access log. If you haven't configured ingress-nginx logging, saying "I'd need to check whether our ingress controller includes the upstream address in its log format" is better than guessing a field name.


Round 3: The Constraint

Interviewer: "You've correlated it: all 503s come from users in a specific region — Europe. The European load balancer looks healthy. The pods behind it are passing health checks. What's next?"

Strong Answer:

"If the load balancer is healthy and pods are passing health checks but European users get 503s, I need to look between the user and the load balancer. This points to DNS or an intermediate network layer. I'd check: what DNS record are European users resolving? If we use GeoDNS or latency-based routing (Route 53), European users might be resolving to a different endpoint than US users. I'd do a DNS lookup from a European vantage point — using dig @8.8.8.8 api.example.com from a European server or a tool like dnschecker.org to see what the European resolvers return. If the DNS record points to the right load balancer, I'd check the certificate — maybe a cert renewal failed for the European endpoint. I'd also check for a CDN or proxy layer between users and the LB — CloudFront, Cloudflare, or an on-prem reverse proxy. If there's a CDN, the 503 might be the CDN's error page when it can't reach the origin, which would mean the CDN's health check configuration or its connection to our origin is the issue."

The Senior Signal:

What separates a senior answer: Thinking about DNS as a routing layer, not just a name resolution service. GeoDNS and latency-based routing mean different users resolve to different infrastructure. Also: checking the CDN or proxy layer — in many architectures, the user never talks directly to the load balancer, and the 503 might originate from an intermediate layer that the team forgot about.


Round 4: The Curveball

Interviewer: "You discover it's a DNS TTL issue. The European DNS record was updated 2 hours ago to point to a new load balancer, but some ISPs are caching the old record that points to a decommissioned IP. The TTL was set to 86400 seconds — 24 hours. How do you fix this for the affected users right now, and how do you prevent it in the future?"

Strong Answer:

"For immediate mitigation: I can't force external DNS resolvers to flush their cache, but I have options. First, if the old IP is still in our control, I can bring the old load balancer back up (or point that IP to the new load balancer) so requests to the stale DNS record still reach working infrastructure. This is the fastest fix — takes minutes and covers all affected users. If the old IP has been released, that's worse — we can't reclaim it from the cloud provider easily. In that case, the only short-term option is to wait for the TTL to expire while communicating to affected users. For ISPs with TTL 86400, they'll cache up to 24 hours, but many resolvers don't honor TTLs strictly and refresh more often. I'd monitor the error rate and expect it to decay over the next few hours. For prevention: DNS TTLs for production endpoints should be 300 seconds (5 minutes) or less. Before any DNS migration, lower the TTL to 60 seconds at least 48 hours before the change (to flush the old high TTL from caches), make the change, verify, then optionally raise the TTL back to 300. And critically: never decommission the old endpoint until you've confirmed the DNS propagation is complete and the error rate has returned to baseline. The old infrastructure should be kept running for at least 2x the original TTL as a safety net."

Trap Question Variant:

The right answer is "I can't force ISP resolvers to flush." Candidates who suggest "flushing the DNS cache" are thinking about their local machine, not the internet. You have no control over ISP resolver behavior. The only reliable fix is to keep the old endpoint alive. Candidates who understand this have dealt with real DNS migrations.


Round 5: The Synthesis

Interviewer: "This incident started as a simple '503 errors' alert and ended up being a DNS migration gone wrong. What does this tell you about how you'd structure incident response for your team?"

Strong Answer:

"This incident illustrates two things. First, the importance of impact scoping before root cause analysis. We could have spent hours looking at application code and pod health when the issue was in the DNS layer — a completely different part of the stack. Starting with 'who is affected and in what pattern?' immediately narrowed the search space from 'everything' to 'something region-specific.' Second, it shows that incidents often cross team boundaries. DNS is probably managed by a different team than the application pods. In an incident, you need to be able to pull in the right people quickly, which means having clear ownership boundaries documented and an escalation path that doesn't depend on one person knowing the whole stack. For structuring incident response: I'd want a triage checklist that starts with impact scoping (blast radius, user-facing vs internal, geographic or demographic patterns), then moves to recent changes (deploys, DNS changes, infra changes), and only then dives into technical investigation. I'd also want a change log that tracks not just application deploys but DNS changes, CDN config changes, and infrastructure modifications — because those are often the changes that cause the hardest-to-diagnose incidents."

What This Sequence Tested:

Round Skill Tested
1 Structured triage methodology for service errors
2 Reasoning about partial failures and correlation analysis
3 Network and DNS layer debugging when the obvious is healthy
4 DNS operational knowledge and migration safety practices
5 Incident response process design and cross-team coordination

Prerequisite Topic Packs