Interview Gauntlet: Intermittent gRPC Failures¶
Category: Debugging Difficulty: L2-L3 Duration: 15-20 minutes Domains: gRPC, Load Balancing
Round 1: The Opening¶
Interviewer: "gRPC calls between Service A and Service B fail intermittently — about 5% of calls return UNAVAILABLE. Both services are running in Kubernetes and show healthy. Where do you start?"
Strong Answer:¶
"gRPC UNAVAILABLE typically means the client couldn't reach the server or the connection was refused/reset. With 5% failure rate and both services healthy, I'm thinking connection management or load balancing issue. gRPC uses HTTP/2, which multiplexes many requests over a single long-lived TCP connection. In Kubernetes, this creates a specific problem: if Service A connects to Service B via a Kubernetes Service (ClusterIP), kube-proxy does L4 load balancing at connection time. All requests on that single HTTP/2 connection go to the same backend pod. If that pod restarts or is evicted, all in-flight RPCs fail. I'd check: kubectl get pods -l app=service-b -o wide to see if pods are restarting or being rescheduled. kubectl logs service-a --tail=200 to see the error details — gRPC gives specific status codes and sometimes error messages. I'd also check if there's a load balancer or ingress between the services, and whether it supports HTTP/2 properly. Many L7 load balancers support HTTP/2 on the client side but downgrade to HTTP/1.1 on the backend, which changes the connection semantics."
Common Weak Answers:¶
- "Check if the service is down." — The premise says both services are healthy. The issue is intermittent, suggesting a connection or routing problem, not a total failure.
- "Increase timeout and retries." — This is a bandaid, not diagnosis. Retries might help availability but mask the underlying issue and increase latency.
- "It's probably a network issue." — Too vague. What specific network issue? The candidate needs to reason about HTTP/2 connection semantics in Kubernetes.
Round 2: The Probe¶
Interviewer: "You observe that the failures correlate with Service B pod restarts during rolling updates. When a Service B pod is terminated, the in-flight gRPC calls to that pod all fail. But you have a 30-second graceful shutdown period. Why aren't the in-flight calls completing?"
What the interviewer is testing: Understanding of gRPC connection lifecycle, Kubernetes graceful termination, and why HTTP/2 long-lived connections complicate rolling updates.
Strong Answer:¶
"There's a race condition in Kubernetes pod termination. When a pod is marked for termination, two things happen concurrently: (1) the kubelet sends SIGTERM to the container, starting the graceful shutdown period, and (2) the Endpoints controller removes the pod from the Service endpoints. But these are not synchronized. The Endpoints update takes time to propagate to kube-proxy (which updates iptables/IPVS rules) and to the DNS cache (for headless services). During this propagation window — which can be several seconds — new connections are still being routed to the terminating pod. For gRPC specifically, the issue is worse because existing HTTP/2 connections aren't automatically drained. When the pod receives SIGTERM, it should start a graceful drain: stop accepting new RPCs, finish in-flight RPCs, then shut down. But if the gRPC server doesn't implement graceful drain (calling server.GracefulStop() instead of server.Stop()), it kills connections immediately. I'd check: is Service B's gRPC server calling GracefulStop() on SIGTERM? And is there a preStop hook with a small sleep to allow the Endpoints update to propagate before the server starts draining?"
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's the difference between
server.Stop()andserver.GracefulStop()in gRPC?"Stop()forcefully closes all connections and cancels in-flight RPCs.GracefulStop()stops accepting new RPCs, waits for in-flight RPCs to complete (up to the deadline), then shuts down. This is a fundamental distinction. It's fine to say "I know gRPC has a graceful shutdown mechanism but I'd need to check the specific API name in the language we're using."
Round 3: The Constraint¶
Interviewer: "You fix the graceful shutdown. The rolling update failures stop. But you still have 5% UNAVAILABLE errors during normal operation — no rolling updates happening. Service B has 10 pods but all gRPC traffic from Service A goes to only 2 of them. What's happening?"
Strong Answer:¶
"This is the classic gRPC load balancing problem with HTTP/2 and Kubernetes Services. When Service A creates a gRPC channel to service-b.default.svc.cluster.local:50051, the DNS resolution returns a single ClusterIP address. Service A opens one (or a few) HTTP/2 connections to that ClusterIP. kube-proxy does L4 load balancing at connection establishment time, routing the TCP connection to one of the 10 backend pods. But HTTP/2 multiplexes all RPCs over that single connection — so all traffic goes to one pod. If Service A has a few gRPC channels (or the gRPC library creates a few subchannels), traffic goes to 2-3 pods while the other 7-8 sit idle. The 5% failure rate might be because those 2 overloaded pods occasionally hit resource limits. The fix is client-side load balancing. Options: (1) use a headless Service (clusterIP: None) so DNS returns all pod IPs, then configure the gRPC client's name resolver to use DNS and the round_robin load balancing policy. In Go: grpc.Dial('dns:///service-b.default.svc.cluster.local:50051', grpc.WithDefaultServiceConfig('{\"loadBalancingPolicy\":\"round_robin\"}')). (2) Use an L7 load balancer that understands HTTP/2 — Envoy, Linkerd, or Istio — which can balance individual gRPC requests across backend pods even over a single inbound connection."
The Senior Signal:¶
What separates a senior answer: Understanding that HTTP/2 + L4 load balancing = broken load distribution. This is the single most common gRPC operational issue in Kubernetes and many teams discover it the hard way when one pod is overloaded while nine are idle. Knowing the specific client-side fix (headless service + round_robin policy) or the proxy fix (Envoy/Istio) shows real experience. Also: knowing that the gRPC
dns:///resolver syntax uses three slashes, which is the authority-less URI format that triggers gRPC's built-in DNS resolver.
Round 4: The Curveball¶
Interviewer: "You implement client-side load balancing with a headless service. It works great — traffic distributes evenly. But now when Service B scales up from 10 to 20 pods, the new pods don't receive any traffic until Service A is restarted. Why?"
Strong Answer:¶
"The gRPC client's DNS resolver caches the DNS response. When the headless service returns pod IPs, the gRPC channel resolves them once (or periodically based on the DNS TTL) and establishes subchannels to each IP. When new pods are added, the DNS response changes, but the gRPC client doesn't re-resolve DNS until either the TTL expires or the channel is recreated. The default behavior varies by gRPC implementation. In Go, the DNS resolver re-resolves every 30 minutes by default. In Java, it depends on the JVM's DNS caching (networkaddress.cache.ttl in java.security, which defaults to 30 seconds for successful lookups in OpenJDK but can be infinite in some JVM distributions). The fix: configure the gRPC client's DNS re-resolution interval to something short, like 30 seconds. In Go, this is set via the service config's dns_min_resolution_rate. Alternatively, use a service mesh like Istio which handles service discovery and load balancing at the proxy level — the Envoy sidecar discovers new endpoints via the Istio control plane (xDS protocol) and routes to them immediately, without DNS resolution delays. A third option: gRPC supports xDS-based load balancing natively (without a sidecar), where the client subscribes to endpoint updates from a control plane."
Trap Question Variant:¶
The right answer is "I know this is a DNS caching issue but the specific re-resolution defaults vary by language." The gRPC DNS resolver behavior is different in Go, Java, Python, and C++. Claiming to know all the defaults precisely is a bluffing signal. The important insight is knowing that DNS-based service discovery has an inherent staleness problem and that more dynamic solutions (xDS, service mesh) solve it. Saying "I'd need to check the specific behavior for our gRPC client library" is perfectly fine.
Round 5: The Synthesis¶
Interviewer: "gRPC in Kubernetes requires special handling for load balancing, graceful shutdown, and service discovery. Was it a mistake to choose gRPC? When would you recommend gRPC vs REST?"
Strong Answer:¶
"gRPC has real advantages: strongly typed contracts via protobuf, efficient binary serialization, bidirectional streaming, and code generation for clients and servers. For internal service-to-service communication with high throughput requirements, it's measurably better than JSON over REST. But the operational overhead is real — HTTP/2 load balancing issues, debugging is harder (you can't just curl a gRPC endpoint without grpcurl), and the tooling ecosystem for monitoring and tracing needs gRPC-aware components. My framework for choosing: use gRPC when you have high-throughput internal communication, need streaming, or have a polyglot environment where shared protobuf contracts prevent interface drift. Use REST when the service is externally-facing (browsers, third-party clients), when the team is small and operational simplicity matters more than performance, or when the request volume is low enough that serialization overhead doesn't matter. The hybrid approach works well: REST for the external API gateway, gRPC for internal service-to-service. And if you adopt gRPC, invest in the infrastructure up front — service mesh for load balancing, gRPC-aware health checks, grpcurl for debugging, and OpenTelemetry instrumentation for tracing. Don't bolt it on after the issues appear."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | gRPC fundamentals and HTTP/2 connection model in Kubernetes |
| 2 | Kubernetes graceful termination mechanics and gRPC drain behavior |
| 3 | gRPC load balancing problem diagnosis and client-side vs proxy solutions |
| 4 | DNS-based service discovery limitations and dynamic endpoint discovery |
| 5 | Technology selection framework and pragmatic trade-off communication |