gRPC Footguns¶

Mistakes that cause outages or wasted hours.

1. Not propagating deadlines — downstream services keep running after the caller times out¶

The upstream service sets a 500ms deadline. Your service calls a downstream service but creates a context.Background() instead of passing the caller's context. When the upstream times out, the downstream call keeps running for seconds or minutes, wasting resources and causing a thundering herd when retries pile up. Every retry from upstream spawns another orphaned downstream call. Fix: Always pass ctx through. Never use context.Background() or context.TODO() in production request handlers except at the very top of the call stack. In Go: downstream.Call(ctx, req) not downstream.Call(context.Background(), req). Add linters or code review checks for context.Background() in handler code.

2. Treating gRPC status codes like HTTP status codes — wrong retry logic¶

Your retry framework retries on status code 13 (INTERNAL), assuming it means "server error, try again." But INTERNAL means the server hit an unexpected bug — the operation may have partially completed, and retrying can cause duplicate actions, double charges, or data corruption. Conversely, teams never retry UNAVAILABLE because they think it's permanent, missing that it's specifically designed to signal "try again." Fix: Follow the gRPC retryability spec: UNAVAILABLE (14) and RESOURCE_EXHAUSTED (8) are safe to retry with backoff. INTERNAL (13), UNKNOWN (2), and DATA_LOSS (15) are not safe to retry without idempotency guarantees. DEADLINE_EXCEEDED (4) depends on whether the operation was idempotent. Build this logic into your interceptors rather than per-call retry code.

3. Layer 4 load balancer in front of gRPC — one backend gets all the traffic¶

You deploy a gRPC service behind a standard TCP load balancer (AWS NLB, DigitalOcean LB). HTTP/2 multiplexes all RPCs over a single TCP connection. The load balancer sees one long-lived TCP connection per client and sends all RPCs from that client to the same backend. One pod gets hammered while others are idle. This shows up as high latency on one backend and near-zero load on others. Fix: Use an HTTP/2-aware (Layer 7) load balancer: Envoy, nginx with grpc_pass, GCP's Cloud Load Balancing, or AWS ALB with gRPC target. Alternatively, use client-side load balancing with DNS discovery (Kubernetes headless services + gRPC's built-in name resolver). Verify the fix by checking per-pod RPC counts in your metrics — they should be roughly equal.

4. Server reflection disabled in production — losing visibility when you need it most¶

Disabling server reflection in production is common for security reasons. But during an incident, you cannot use grpcurl to inspect the live service — you can't quickly check what methods exist, what the request format is, or call a diagnostic RPC. You're debugging blind. Fix: Consider using a separate admin port with reflection enabled, accessible only within the cluster (not exposed externally). Alternatively, generate and ship a .protoset file with each deployment and document the grpcurl command to use it: grpcurl -protoset myservice.protoset <host>:<port> list. At minimum, keep the protoset in your runbook so on-call engineers can use it.

5. Ignoring connection pool behavior — creating a new gRPC channel per request¶

Client code creates a new grpc.Dial() (or equivalent) for every request. Each dial creates a new HTTP/2 connection, performs TLS negotiation, and does HTTP/2 handshake. Under load, the server sees thousands of short-lived connections. Connection setup overhead dominates latency. The server's file descriptor limit may be exhausted. Fix: Create one gRPC channel (client connection) per target service at startup and reuse it for all requests. The channel is thread-safe and handles connection pooling internally. In Go: conn, err := grpc.Dial(addr, opts...) once, store the conn, create clients from it. Only create multiple channels if you need multiple backends or explicit connection sharding.

6. Not handling GOAWAY — connections dropped during rolling deployments cause stuck RPCs¶

During a rolling deployment, your gRPC server sends a GOAWAY frame to signal it's shutting down. Well-behaved clients receive GOAWAY and re-send in-flight RPCs on a new connection. Clients that don't handle GOAWAY (or have a long timeout) have their RPCs killed mid-flight with UNAVAILABLE. If the timeout is long, users see a multi-second hang during every deployment. Fix: Ensure your gRPC client library handles GOAWAY (most do by default). On the server side, use graceful shutdown: stop accepting new streams, wait for existing streams to complete, then close. In Go: server.GracefulStop() instead of server.Stop(). Set a WaitForHandlers deadline (30s is common). Add maxAge to the server's keepalive parameters so connections are periodically cycled before deployment.

7. Protobuf field numbering mistake — silently breaking wire compatibility¶

A developer adds a new field to a proto message and reuses a previously deleted field number. Or they rename a field without changing the number, thinking the name is the key. Protobuf encoding uses field numbers, not names — old clients sending data with field number 5 will have it silently interpreted as the new field's type, causing data corruption or panics on the server, with no error at the serialization layer. Fix: Never reuse field numbers. When you delete a field, add it to the reserved list: reserved 5; and reserved "old_field_name";. Treat proto files like database schemas — backwards compatibility is mandatory. Use buf lint and buf breaking to automatically catch field number reuse and other breaking changes in CI.

8. gRPC-web without a proxy — browser clients can't speak native gRPC¶

You build a browser frontend that needs to call your gRPC backend. Browsers don't support HTTP/2 trailers, which gRPC requires for status codes. Native gRPC doesn't work from browsers. The team assumes they need to rewrite the backend as REST. Fix: Use gRPC-web with a proxy. Envoy has built-in gRPC-web transcoding. Alternatively, use the gRPC-gateway to generate a REST/HTTP+JSON proxy from your proto definitions — this gives you both gRPC (for services) and REST (for browsers/external clients) from one proto file. For pure browser-to-gRPC, grpc-web client libraries exist for TypeScript/JavaScript and work with an Envoy or grpcwebproxy sidecar.

9. Streaming RPC without flow control — sender overwhelms receiver¶

You use a client or bidirectional streaming RPC and the sender writes messages as fast as possible. gRPC's HTTP/2 layer has flow control (WINDOW_UPDATE frames), but if the sender never checks the send error or the buffer in the framework is large, messages queue up in memory. The sender appears healthy while the receiver is overwhelmed. Under GC pressure or OOM, both sides may crash. Fix: In streaming RPCs, check errors after every Send() call. Implement application-level backpressure: read from the stream at a controlled rate, or send explicit flow control signals. Don't assume the framework will handle it — it will eventually apply back-pressure via HTTP/2 flow control, but by then memory is already under pressure. Set reasonable max message sizes: grpc.MaxRecvMsgSize(4*1024*1024).

10. Using grpcurl with `-plaintext` against a TLS-only server — getting a confusing error¶

You try to debug a production gRPC endpoint with grpcurl -plaintext myserver:443 list and get an error like "transport: received the unexpected content-type 'text/html; charset=utf-8'" or "EOF". You assume the server is down or the endpoint is wrong. The real problem is that the server expects TLS, and you're sending plaintext HTTP/2. Fix: Remove -plaintext for TLS servers: grpcurl myserver:443 list. If the cert is self-signed or from an internal CA: grpcurl -cacert /path/to/ca.pem myserver:443 list. For dev/test with untrusted certs only: grpcurl -insecure myserver:443 list. The error "unexpected content-type: text/html" usually means an HTTP/1.1 response (a proxy or load balancer sending an error page), not a gRPC response — check if there's a Layer 7 proxy in front.