Skip to content

Thinking Out Loud: Kubernetes Services & Ingress

A senior SRE's internal monologue while working through a real services/ingress issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

After a routine ingress-nginx controller upgrade from 1.9 to 1.10, external traffic to three of our twelve production services returns 503. The other nine services work fine. The rollout was considered "non-breaking" per the release notes.

The Monologue

Okay, 503s on three services after an ingress controller upgrade. "Non-breaking" — sure. Let me start with which services are affected and what they have in common.

kubectl get ingress -A -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,HOSTS:.spec.rules[*].host,CLASS:.spec.ingressClassName' | sort

Twelve Ingress resources, all using ingressClassName: nginx. Let me check which ones are actually returning 503.

for host in api.example.com checkout.example.com catalog.example.com admin.example.com; do
  echo -n "$host: "; curl -s -o /dev/null -w "%{http_code}" https://$host/health
  echo
done

503 on checkout, admin, and catalog. 200 on api. Let me check the other 8 to confirm... all 200s. So it's specifically those three.

What's different about those three? Let me diff their Ingress specs against a working one.

kubectl get ingress checkout-ingress -n commerce -o yaml > /tmp/broken.yaml
kubectl get ingress api-ingress -n core -o yaml > /tmp/working.yaml
diff /tmp/broken.yaml /tmp/working.yaml

The broken one uses annotations nginx.ingress.kubernetes.io/backend-protocol: "GRPC". The working one doesn't. Interesting — all three broken services are gRPC backends. Let me verify...

for ns_ing in commerce/checkout-ingress commerce/catalog-ingress admin/admin-ingress; do
  ns=$(echo $ns_ing | cut -d/ -f1)
  ing=$(echo $ns_ing | cut -d/ -f2)
  echo -n "$ing: "
  kubectl get ingress $ing -n $ns -o jsonpath='{.metadata.annotations.nginx\.ingress\.kubernetes\.io/backend-protocol}'
  echo
done

All three: GRPC. The nine working services are all HTTP. So the ingress-nginx 1.10 upgrade broke gRPC backend handling.

Mental Model: Find the Common Denominator

When an upgrade breaks some-but-not-all services, the breakage pattern IS the diagnosis. Find what the broken services share that the working ones don't. It's almost always a config flag, annotation, or feature gate that changed behavior in the new version.

Let me check the ingress-nginx 1.10 changelog more carefully. I skimmed it before but clearly missed something.

kubectl get deployment ingress-nginx-controller -n ingress-nginx -o jsonpath='{.spec.template.spec.containers[0].image}'

Running registry.k8s.io/ingress-nginx/controller:v1.10.0. Let me check if there's a known issue with gRPC in this version.

kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=50 | grep -i grpc

Hmm, seeing upstream sent invalid header errors for the gRPC backends. That's an nginx-level error — the gRPC backends are sending something nginx doesn't like in the new version.

Actually, wait. I just remembered — ingress-nginx 1.10 changed the default value of use-http2 from true to "strict". For gRPC backends, you need HTTP/2 between nginx and the backend. If the strict mode is enforcing something different... let me check the ConfigMap.

kubectl get configmap ingress-nginx-controller -n ingress-nginx -o yaml | grep -i http2

Nothing explicitly set. So it's using the new default. Let me try explicitly enabling HTTP/2 in the ConfigMap.

kubectl edit configmap ingress-nginx-controller -n ingress-nginx
# Add: use-http2: "true"

Actually, no — let me not edit in place. Let me patch it properly so it's trackable.

kubectl patch configmap ingress-nginx-controller -n ingress-nginx \
  --type merge -p '{"data":{"use-http2":"true"}}'

Now I need to trigger a reload. The controller watches the ConfigMap, so it should pick this up.

kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=5 -f

"Configuration changes detected, backend reload required." Good. Let me test the gRPC services now.

curl -s -o /dev/null -w "%{http_code}" https://checkout.example.com/health

Still 503. Hmm. Let me think. For gRPC, the health check endpoint might not be the right test — gRPC uses HTTP/2 natively and /health might be an HTTP/1.1 REST endpoint. Let me test with grpcurl.

grpcurl -plaintext checkout.example.com:443 grpc.health.v1.Health/Check

Wait, that's wrong too — TLS is terminating at ingress. Let me use grpcurl with TLS.

grpcurl checkout.example.com:443 grpc.health.v1.Health/Check

Mental Model: Test at the Right Protocol Layer

When debugging gRPC through a proxy, you can't test with curl — you need a gRPC client. HTTP/1.1 health checks can succeed while gRPC traffic fails because they use different protocol negotiations. Always test with the actual protocol the service uses.

Still failing. Let me look at what the controller is actually sending upstream.

kubectl exec -it $(kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o name | head -1) -n ingress-nginx -- cat /etc/nginx/nginx.conf | grep -A 20 "checkout.example.com"

I see it now — the upstream block is using grpc_pass but the http2 directive isn't set on the listen block. The ConfigMap change might need a full pod restart, not just a reload, for HTTP/2 listener changes.

kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx
kubectl rollout status deployment/ingress-nginx-controller -n ingress-nginx

Waiting for rollout... done. Let me test again.

grpcurl checkout.example.com:443 grpc.health.v1.Health/Check

status: SERVING. And let me verify all three broken services.

for host in checkout.example.com catalog.example.com admin.example.com; do
  echo -n "$host: "
  grpcurl $host:443 grpc.health.v1.Health/Check 2>&1 | head -1
  echo
done

All serving. Now let me make sure I didn't break the HTTP services.

for host in api.example.com payments.example.com users.example.com; do
  echo -n "$host: "
  curl -s -o /dev/null -w "%{http_code}" https://$host/health
  echo
done

All 200. We're good. Let me go update the Helm values file for ingress-nginx so this ConfigMap setting persists through future upgrades.

Also making a note: we need to add gRPC-specific health checks to our canary deployment pipeline. If the canary had tested gRPC endpoints, we'd have caught this before it went to 100%.

What Made This Senior-Level

Junior Would... Senior Does... Why
Investigate each broken service individually Look for the common pattern (all gRPC backends) across broken vs working Pattern matching across the failure set is faster than individual debugging
Test gRPC services with curl Use grpcurl to test at the correct protocol layer HTTP/1.1 tests can pass while gRPC is completely broken
Edit the ConfigMap in place with kubectl edit Use kubectl patch for a trackable, scriptable change Edits in place are impossible to audit and can't be replicated
Fix the immediate issue and move on Update Helm values and flag the canary gap The fix needs to survive the next upgrade, and the process needs to catch this class of issue

Key Heuristics Used

  1. Common Denominator Analysis: When an upgrade breaks a subset of services, the shared trait of the broken set reveals the breaking change.
  2. Test at the Right Protocol Layer: Use protocol-appropriate tools (grpcurl for gRPC, not curl) to avoid false positives.
  3. Reload vs Restart: Some nginx configuration changes require a full process restart, not just a config reload — especially listener-level changes like HTTP/2.

Cross-References

  • Primer — How Services, Endpoints, and Ingress route traffic
  • Street Ops — Ingress controller debugging and annotation reference
  • Footguns — Ingress controller upgrades changing default behavior