Skip to content

Solution

Triage

  1. Check the Service and its selector:
    kubectl describe svc user-service -n prod
    
  2. Check the endpoints:
    kubectl get endpoints user-service -n prod
    
  3. List pods with their labels:
    kubectl get pods -n prod -l app=user-service --show-labels
    kubectl get pods -n prod --show-labels | grep user
    
  4. Compare the selector label key/value with the actual pod labels.

Root Cause

The Service selector uses app: user-svc but the pods have the label app: user-service. This mismatch occurred when the junior engineer recreated the Service manifest from memory and used a shortened label value. Since no pods match the selector, the Endpoints object has empty subsets, and the Service has nothing to route traffic to.

The Service itself is functional (has a ClusterIP, port is configured), but with no backends, any connection to the ClusterIP is immediately refused.

Fix

  1. Update the Service selector to match the pod labels:

    kubectl patch svc user-service -n prod -p '{"spec":{"selector":{"app":"user-service"}}}'
    
    Or edit the manifest and apply:
    spec:
      selector:
        app: user-service  # was: user-svc
    

  2. Verify endpoints are now populated:

    kubectl get endpoints user-service -n prod
    
    Expected output should show pod IPs and ports.

  3. Test connectivity:

    kubectl run test --rm -it --image=busybox -- wget -qO- http://user-service.prod.svc.cluster.local:8080/health
    

Rollback / Safety

  • Changing a Service selector is non-disruptive; it takes effect immediately.
  • If the wrong pods are selected, traffic could be routed to the wrong backend. Verify pod labels are unique per service.
  • Update the source manifest (Helm chart, Kustomize, etc.) to prevent the fix from being overwritten on next deploy.

Common Traps

  • Checking only the Service, not the Endpoints. Always check kubectl get endpoints first when debugging service connectivity.
  • Assuming pods are the problem when they show Running/Ready. If pods are healthy, the issue is almost always in the Service selector or readiness probes.
  • Label key typos vs. value typos. Both cause mismatches. App: user-service (capital A) does not match app: user-service.
  • Multiple labels in selector. The selector is AND logic. All labels must match. A single mismatch on any label key-value pair means no match.
  • Not checking targetPort. Even with correct selectors, if targetPort does not match the container's listening port, connections will be refused despite endpoints being populated.