Interview Gauntlet: Deploy Succeeded but Old Version Visible¶
Category: Incident Response Difficulty: L2-L3 Duration: 15-20 minutes Domains: CDN, Service Mesh
Round 1: The Opening¶
Interviewer: "A deployment completes successfully — Argo CD shows synced, all pods are running the new image, health checks pass. But users report they're still seeing the old version. What's going on?"
Strong Answer:¶
"If the deployment is truly healthy and serving the new version but users see the old version, there's a layer between the user and the pods that's serving stale content. The usual suspects: a CDN cache (CloudFront, Cloudflare) serving a cached response, a browser cache if the cache-control headers are aggressive, a reverse proxy or API gateway with its own cache, or — in a service mesh — a traffic routing rule that's still sending some traffic to the old version. I'd verify first: curl -v https://api.example.com/version from my machine to see what version the CDN returns, then kubectl exec -it <new-pod> -- curl localhost:8080/version to confirm the pod is actually serving the new version. If the external curl shows the old version but the pod-level curl shows the new version, there's a caching layer in between. I'd check the response headers from the external request: Cache-Control, Age, X-Cache (CloudFront returns Hit from cloudfront or Miss from cloudfront), and CF-Cache-Status for Cloudflare."
Common Weak Answers:¶
- "The deployment must have failed." — The premise is that the deployment succeeded. Contradicting the evidence without investigation is a red flag.
- "Check if the image tag is right." — Reasonable but doesn't explain why a correctly tagged pod would show old content to users.
- "Users need to clear their browser cache." — Possible for a frontend deployment, but for an API, this suggests the candidate doesn't understand the cache layers in the infrastructure.
Round 2: The Probe¶
Interviewer: "You confirm it's a CDN cache issue — CloudFront is serving stale content with an Age: 14400 header (4 hours). But here's the weird part: some API paths return the new version and some return the old version. The CDN cache invalidation you ran 30 minutes ago didn't fix it. Why?"
What the interviewer is testing: Understanding of CDN cache behavior, including edge locations, cache keys, and why invalidations can appear to not work.
Strong Answer:¶
"A few possibilities for partial cache staleness after invalidation. First, CloudFront has dozens of edge locations globally. An invalidation takes up to 15 minutes to propagate to all edges, and some edges might not have processed it yet. But at 30 minutes, it should be done. Second, the cache key: CloudFront caches by the full URL including query parameters, and possibly by headers if the behavior is configured that way. If the invalidation pattern was /api/version but some requests include a query parameter like /api/version?format=json, those are separate cache keys. I'd check the invalidation pattern — it needs to be /* for a full flush or a specific wildcard like /api/*. Third, the paths that are returning the new version might have shorter TTLs or Cache-Control: no-cache headers, while the stale paths have longer TTLs. I'd compare the Cache-Control headers between a working path and a stale path. Fourth, there might be multiple CloudFront distributions or behaviors — a common setup has /api/* routed to one origin group and /static/* to another. If the deployment only updated the origin for one behavior but not the other, paths routed through the stale behavior will still serve old content."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's the maximum time for a CloudFront invalidation to complete?" The documented maximum is 10-15 minutes for most invalidations, though in practice it's usually faster. The key gotcha is that the first 1,000 invalidation paths per month are free, but beyond that CloudFront charges per path — so teams sometimes invalidate too narrowly to save costs, missing some cache keys. If you don't remember the exact timing, saying "I know it's minutes, not seconds, but I'd check the CloudFront docs for the current SLA" is fine.
Round 3: The Constraint¶
Interviewer: "The CDN issue is resolved, but now you have a new problem. Some users are being routed to the old version by a service mesh routing rule. You're running Istio with a VirtualService that has canary weights — 90% to v2 (new) and 10% to v1 (old). But v1 should have been decommissioned. Why is it still receiving traffic?"
Strong Answer:¶
"The canary rollout wasn't completed — someone (or the automation) set the weights to 90/10 but never promoted to 100/0. I'd check the VirtualService: kubectl get virtualservice api-service -n production -o yaml and look at the route weights. If the weights show v1 still receiving 10%, the fix is to update the VirtualService to route 100% to v2 and remove the v1 subset. But before doing that, I'd check if the v1 deployment is even still running: kubectl get deploy -n production -l version=v1. If the v1 pods are gone but the routing rule still references them, that 10% of traffic is hitting a dead endpoint — which would manifest as 503 errors for 10% of users, not old version content. If v1 pods are still running, they're serving stale content. The root cause is a process gap: the canary promotion wasn't automated or the automation failed. In Argo Rollouts or Flagger, promotion is typically automated based on metric analysis. If we're managing VirtualService weights manually, we need a checklist or automation that sets 100/0 after the canary bake time passes."
The Senior Signal:¶
What separates a senior answer: Distinguishing between two failure modes: (1) v1 pods still running and serving stale traffic, and (2) v1 pods gone but routing rule still referencing them, causing 503s. These produce very different user symptoms and require different fixes. Also: recognizing this as a process gap, not just a config error — the canary didn't get promoted, which means the rollout process has a hole.
Round 4: The Curveball¶
Interviewer: "Turns out the canary weights were supposed to be managed by Flagger, which auto-promotes based on error rate and latency. Flagger was stuck in a 'Progressing' state and never promoted the canary. Why would Flagger get stuck?"
Strong Answer:¶
"Flagger advances canary weights based on metric queries — typically Prometheus queries for error rate and request duration. If Flagger is stuck in Progressing, it's usually one of these: First, Flagger can't reach Prometheus or the metric query is returning no data. This happens when the Prometheus service endpoint changes, when the metric names change (a common issue after upgrading Istio, which renames metrics), or when the canary hasn't received enough traffic to produce statistically significant metrics. Second, the canary metrics are failing the analysis — error rate is above the threshold or latency is too high. Flagger will keep the canary at its current weight and not advance. I'd check kubectl describe canary api-service -n production for the status conditions and the last analysis result. Third, Flagger itself might be crashlooping or resource-starved — kubectl get pods -n flagger-system and check its logs. I'd check kubectl logs -n flagger-system deploy/flagger --tail=100 for errors. In my experience, the most common cause is the metric query returning empty results because the canary is receiving so little traffic (at 10% weight) that the Prometheus query window doesn't have enough data points. The fix is either lowering the analysis threshold, increasing the canary weight increment, or using a longer metric evaluation window."
Trap Question Variant:¶
The right answer is "I haven't debugged Flagger specifically" if that's true. Flagger is a specific tool and not everyone has used it. The strong fallback is: "I haven't operated Flagger in production, but the general pattern of progressive delivery controllers is: they poll a metrics API, evaluate canary health, and advance weights. When they stall, it's usually a metrics pipeline issue or the canary is failing its health criteria. I'd check the controller's logs and the metric query results." This shows the right mental model even without tool-specific experience.
Round 5: The Synthesis¶
Interviewer: "This incident involved three layers: CDN caching, service mesh routing, and progressive delivery automation. Each layer independently could have caused the user to see stale content. How do you build a deployment verification process that catches this?"
Strong Answer:¶
"You need end-to-end deployment verification, not just 'did the pods come up.' I'd implement a deployment verification pipeline that runs after every deployment and checks from the user's perspective inward. Step one: external synthetic check — hit the production URL from outside the cluster (a Lambda or an external monitoring service) and verify the version header or a version endpoint. This catches CDN caching issues. Step two: mesh-level verification — query the Istio telemetry or the VirtualService directly to confirm 100% of traffic is routed to the new version with no residual canary weights. Step three: pod-level verification — confirm all running pods are on the expected image digest using kubectl get pods -o jsonpath and comparing against the expected SHA. Step four: automated rollback trigger — if any verification step fails after a timeout (say 15 minutes post-deploy), automatically open an incident and optionally trigger a rollback. The theme here is: deployment is not done when Argo CD says 'Synced.' Deployment is done when users are confirmed to be receiving the new version through every layer of the stack. I'd implement this as a post-sync hook in Argo CD or a Flagger webhook that runs after promotion."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Layer-by-layer reasoning about caching and routing |
| 2 | CDN cache mechanics and invalidation debugging |
| 3 | Service mesh traffic management and canary deployment |
| 4 | Progressive delivery controller debugging and intellectual honesty |
| 5 | End-to-end deployment verification design |