Portal | Level: L2 | Domain: DevOps

GraphQL Footguns¶

Common mistakes that hurt production GraphQL services — and how to avoid them.

Not using DataLoader — shipping the N+1 problem to production. Your User.orders resolver looks innocent: return db.orders.findByUserId(parent.id). A query that fetches 50 users now fires 51 DB queries. Under modest load this becomes the dominant cost in your service. The slowness is hard to attribute because each individual query is fast — the damage is the count.

Fix: Wrap every resolver that calls db.findById(someParentId) with a DataLoader. Instantiate one DataLoader per resource type, per request, in the context factory. Never share DataLoader instances across requests — per-request scoping prevents cache poisoning between users. The batch function must return results in the same order as input IDs.

No query depth or complexity limits — leaving the door open for DoS. GraphQL lets clients write arbitrarily nested queries. A deeply recursive query — { user { orders { user { orders { user { ... } } } } } } — can exhaust memory or CPU in seconds with no rate-limit protection, because it arrives as a single HTTP request. An attacker (or a poorly-written client) can take down your API without writing a single line of unusual code.

Fix: Install depth limiting (graphql-depth-limit) and complexity scoring (graphql-validation-complexity or graphql-cost-analysis) as validation rules. Set your depth limit based on your deepest legitimate query (typically 5–7 for most APIs). Set complexity limits based on load testing. Treat these as the first line of defense — add them before you go to production, not after the first incident.

Returning null instead of throwing typed errors — clients can't tell what went wrong. A resolver catches an exception and returns null. The client receives { "data": { "user": null } } with no errors field. It cannot distinguish "user not found", "database is down", and "you don't have permission". Clients handle all three wrong — usually by silently rendering nothing.

Fix: Throw meaningful error objects in resolvers, or use error union types. At minimum, use ApolloError subclasses with machine-readable extensions.code values (NOT_FOUND, FORBIDDEN, INTERNAL). Use formatError to translate unexpected errors to INTERNAL_SERVER_ERROR before they leave the server, so stack traces never reach clients.

Breaking schema changes deployed without deprecation — surprise outages for clients. Renaming user.fullName to user.name, removing a field, or changing an argument from optional to required — any of these will cause every client that references the old shape to start receiving errors at query execution time. In a federated setup, a breaking change in one subgraph can break the entire composed schema.

Fix: Mark fields deprecated before removing them: @deprecated(reason: "Use 'name' instead. Removing 2024-07-01."). Give clients at least one release cycle (or a defined deprecation window) before deletion. Enforce breaking-change detection in CI using graphql-inspector diff against the production schema. Block deployments that introduce breaking changes without an explicit override.

Over-fetching in resolvers — fetching full objects when only an ID is needed. A resolver for Order.userId fetches the entire User row from the database just to return user.id. Worse: a resolver that resolves Product.categoryName fetches the whole Category object to extract one field. This wastes DB bandwidth and query time for every resolution of that field.

Fix: Inspect what the client actually requested using the info argument: info.fieldNodes shows which sub-fields were selected. If only id was requested, return { id: parent.userId } directly — GraphQL will resolve scalar fields without re-querying. For common patterns, use DataLoader to batch the fetch, and defer the full object load only if non-ID fields are requested.

Not validating input types beyond schema types — business rule violations slip through. GraphQL's type system enforces String, Int, and ID types, but not business constraints. createUser(email: "notanemail") passes schema validation. createOrder(quantity: -500) passes. A date range with startDate after endDate passes. These reach your resolvers unchecked.

Fix: Validate inputs at the resolver boundary before they touch the database. Use a validation library (zod, joi, class-validator) against the input arguments. Custom scalars can carry validation logic (e.g., an Email scalar that validates format). Return clear validation errors with extensions.code: 'VALIDATION_ERROR' and field-level details so clients can surface the right error to the right form field.

Subscription memory leaks — active listeners accumulate when clients disconnect. Each WebSocket subscription registers a PubSub listener or event emitter. When the client disconnects — due to page refresh, network drop, or app close — if your subscribe resolver doesn't properly clean up, the listener stays registered. Over hours, hundreds of zombie listeners accumulate. Memory grows. The PubSub system degrades. Events fan out to dead sockets.

Fix: Use async generators with try/finally in subscription resolvers — the finally block runs on both normal completion and client disconnect. Test disconnect handling explicitly: connect a client, verify the subscription metric increases, force-close the connection, verify the metric decreases. Alert on monotonically increasing subscription counts.

Exposing introspection in production — giving attackers a free schema blueprint. __schema and __type queries return your full schema: every type, every field, every argument, every deprecated field with its deprecation reason (which often contains migration instructions that reveal internal design). This is a comprehensive map of your API surface, delivered on demand to anyone who can reach the endpoint.

Fix: Disable introspection in production: introspection: process.env.NODE_ENV !== 'production'. If your internal tooling (Apollo Studio, GraphiQL for internal use) requires introspection, gate it behind authentication: allow introspection only for requests with a valid admin token. Publish your schema through a schema registry instead of exposing it via introspection.

Not using persisted queries for public APIs — accepting arbitrary query injection. Allowing clients to send any query string is equivalent to accepting arbitrary SQL on a REST endpoint. A malicious or automated client can probe your schema depth, identify expensive operations, and craft queries that maximize server load. The query surface is as wide as your schema.

Fix: Implement Automatic Persisted Queries (APQ) for production traffic. APQ lets clients send a SHA-256 hash of the query; the server caches known queries and rejects unknown hashes after the initial registration. This limits the executable query set to what your clients actually ship. For especially sensitive APIs, use a pre-approved query allowlist (server-side registered queries only, no APQ registration flow).

Ignoring resolver error boundaries — one crashing resolver kills the entire request. An unhandled exception in a resolver for a non-critical field — say, User.avatarUrl — can propagate up and cause the entire query to fail, returning a 500 with no data at all. The client loses the user record, the orders list, and everything else it requested, just because an avatar fetch failed.

Fix: Wrap resolvers in error boundaries. For non-critical nullable fields, catch exceptions explicitly, log them, and return null with a non-fatal error added to the errors array. Reserve hard failures (throwing uncaught) for mutations and required fields where a partial result would be misleading. Use formatError to ensure internal exceptions are sanitized before reaching the client. Test fault isolation explicitly: mock a non-critical resolver to throw, verify other fields still resolve.

GraphQL Footguns¶

Wiki Navigation¶

Pages that link here¶