Incident Replay: Service Has No Endpoints¶
Setup¶
- System context: Kubernetes production cluster. A Service load balancer is returning 503 errors. Backend pods are running but the Service has no endpoints.
- Time: Tuesday 14:15 UTC
- Your role: On-call SRE / platform engineer
Round 1: Alert Fires¶
[Pressure cue: "Customer-facing API is returning 503 for all requests. Revenue impact: $5K/minute. Auto-escalation triggered."]
What you see:
kubectl get endpoints api-service shows <none> — no endpoints. But kubectl get pods -l app=api-service shows 3 pods in Running state with Ready 1/1.
Choose your action: - A) Restart the pods to force endpoint registration - B) Compare the Service selector with the pod labels - C) Check kube-proxy logs for endpoint sync issues - D) Delete and recreate the Service
If you chose B (recommended):¶
[Result:
kubectl get svc api-service -o jsonpath='{.spec.selector}'shows{app: api-svc}. Pod labels showapp: api-service. The selector does not match —api-svcvsapi-service. A recent Helm chart update changed the selector. Proceed to Round 2.]
If you chose A:¶
[Result: Pods restart but still have the label
app: api-service. The Service selector still looks forapp: api-svc. No change.]
If you chose C:¶
[Result: kube-proxy is functioning correctly — it is faithfully proxying to the empty endpoint list. The problem is upstream of kube-proxy.]
If you chose D:¶
[Result: Recreating the Service with the same spec has the same wrong selector. No improvement.]
Round 2: First Triage Data¶
[Pressure cue: "4 minutes of revenue loss. Fix the routing NOW."]
What you see:
The Helm chart for api-service was updated yesterday. The new chart template uses app: api-svc as the selector (shortened name). But the Deployment template still uses app: api-service as the pod label. The selector and label are defined in different template files and nobody noticed the mismatch.
Choose your action: - A) Patch the Service selector to match the pod labels - B) Patch the pod labels to match the Service selector - C) Roll back the Helm release to the previous version - D) Patch the Helm chart and redeploy
If you chose A (recommended):¶
[Result:
kubectl patch svc api-service -p '{"spec":{"selector":{"app":"api-service"}}}'— NOTE: Service selectors are immutable in some cases. If this fails, you need to delete and recreate. In this case it succeeds. Endpoints populate immediately. Traffic flows. Revenue restored. Proceed to Round 3.]
If you chose B:¶
[Result: Patching pod labels requires updating the Deployment template, which triggers a rolling update. During the rollout, there is a brief period with even fewer pods. Slower fix.]
If you chose C:¶
[Result: Rollback works but also rolls back the legitimate changes from the release. Clean fix is better.]
If you chose D:¶
[Result: Correct long-term fix but
helm upgradetakes time to render, validate, and apply. Too slow during an outage.]
Round 3: Root Cause Identification¶
[Pressure cue: "Traffic restored. Fix the chart and prevent recurrence."]
What you see:
Root cause: The Helm chart refactor split label definitions into a helper template (_helpers.tpl). The Service template used the new helper (api-svc) but the Deployment template used the old hardcoded label (api-service). The mismatch was not caught because there was no test for selector-label consistency.
Choose your action: - A) Fix the Helm chart helper to use consistent labels - B) Add a Helm test that verifies Service endpoints are populated after deploy - C) Add a CI check that validates selector-label consistency in Helm templates - D) All of the above
If you chose D (recommended):¶
[Result: Chart fixed with consistent label helper usage. Helm test added. CI validation prevents future mismatches. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes this instance but does not prevent future mismatches.]
If you chose B:¶
[Result: Post-deploy test catches the issue but after the damage is done.]
If you chose C:¶
[Result: CI check is the strongest prevention but you also need the chart fix and post-deploy verification.]
Round 4: Remediation¶
[Pressure cue: "Service healthy. Harden the deployment pipeline."]
Actions:
1. Verify endpoints are populated: kubectl get endpoints api-service
2. Verify all requests are routing correctly (test from external)
3. Deploy the fixed Helm chart via normal pipeline
4. Add helm test for endpoint health to the chart
5. Add selector-label consistency validation to CI
Damage Report¶
- Total downtime: 6 minutes of complete API outage
- Blast radius: All customer-facing API traffic; estimated $30K revenue impact
- Optimal resolution time: 3 minutes (compare selector vs labels -> patch service)
- If every wrong choice was made: 30+ minutes of outage with cascading downstream failures
Cross-References¶
- Primer: Kubernetes Services & Ingress
- Primer: Kubernetes Ops
- Primer: Helm
- Footguns: Kubernetes Services & Ingress