Skip to content

Portal | Level: L2 | Domain: Kubernetes

Consul Footguns

The mistakes that take down clusters, leak data, or create ghost services at 3 AM.


1. Running an Even Number of Server Agents

You deploy 2 or 4 Consul servers because it feels more redundant. On a network partition, both halves of the cluster have the same number of nodes, so neither side can achieve quorum. Both sides refuse to elect a leader. Your entire service discovery layer goes down.

Fix: Always run an odd number of servers: 3 (tolerate 1 failure) or 5 (tolerate 2 failures). Never 2, 4, or 6. Quorum requires (n/2)+1 — odd numbers guarantee exactly one side can achieve quorum during a partition.

Under the hood: Consul uses the Raft consensus protocol. A 4-node cluster tolerates exactly 1 failure (same as 3 nodes) because quorum for 4 is 3. You pay for an extra server with zero additional fault tolerance. With 3 nodes, quorum is 2, tolerating 1 failure. With 5, quorum is 3, tolerating 2. Even numbers give you the cost of n nodes but only the fault tolerance of n-1.


2. Not Enabling ACLs (Open Cluster)

You skip ACL bootstrapping to "get up and running faster." Anyone on the network can read your entire KV store (which may contain database URLs, API keys, and feature flags), register fake services, or inject malicious health check data.

Fix: Bootstrap ACLs on day 1 before registering any services. Use consul acl bootstrap, store the bootstrap token in Vault or AWS Secrets Manager, then immediately create narrow-scoped tokens for each service and agent. The one-time cost is hours; the blast radius of an open cluster is unlimited.


3. Health Check Interval Too Aggressive

You set interval: "1s" on 200 services across 50 nodes. The Consul servers receive 10,000 health check results per second. Leader CPU spikes; Raft writes back up; the cluster starts returning stale data or timing out. The health check system meant to improve reliability now causes instability.

Fix: Use realistic intervals: 10–30 seconds for web services, 15–60 seconds for databases. For services that can detect their own health faster, use TTL checks where the service pushes status, rather than having Consul poll. Reserve 1-second intervals for critical, low-volume services only.


4. Forgetting deregister_critical_service_after

A service crashes. Its node is still up and the client agent is still running, so Consul marks the service critical but keeps it in the catalog. DNS stops returning it (because you filter on passing), but the catalog entry remains. Hours later you have hundreds of ghost service registrations from crashed containers. Catalog queries slow down; operators cannot tell what is actually running.

Fix: Set deregister_critical_service_after on every health check, tuned to how long you would expect a crashed service to be replaced. A Docker container: "60s". A VM: "10m". A batch job: "1h". This is the most commonly omitted health check field.

Default trap: Consul has no default for deregister_critical_service_after — if you don't set it, the service stays in the catalog as critical forever. The catalog grows unbounded. After months of container churn, you can end up with thousands of ghost entries that slow DNS and API queries. There is no built-in garbage collection for this.


5. Using default as the Datacenter Name in Multi-DC Setup

You leave the datacenter name as default (Consul's built-in default). Later you add a second datacenter. Now your cross-DC queries are inconsistent — some services are registered under default, others under the real name. DNS queries for service.dc1.consul fail because your services are in default. Renaming a datacenter after services are registered requires a full re-registration.

Fix: Choose meaningful, lowercase datacenter names before deploying (us-east-1, prod-eu, dc1). Set datacenter in Consul agent config at cluster initialization. Renaming later is painful — datacenter names are baked into service registrations, ACL policies, prepared queries, and intentions.


6. Not Taking Snapshots Before Upgrades

You upgrade Consul servers from 1.15 to 1.17 without a snapshot. The upgrade hits a Raft log migration bug. The cluster fails to start after the upgrade. You have no snapshot, so you lose all KV data, ACL tokens, intentions, and prepared queries. Re-registration of services restores the catalog, but application configs, feature flags, and distributed locks stored in KV are gone.

Fix: consul snapshot save pre-upgrade-$(date +%Y%m%d-%H%M%S).snap before every upgrade, every time. Takes 5 seconds. Store the snapshot off-node (S3, GCS, a separate host). Verify the snapshot with consul snapshot inspect after saving.


7. Gossip Encryption Key Rotation Without Staging

You decide to rotate the gossip encryption key in production during business hours. You install the new key, then immediately remove the old key before all agents have received the update. Agents that have not yet pulled the new key can no longer gossip. Half your client agents go dark. Services on those nodes start timing out.

Fix: Key rotation is a four-step process with mandatory verification between each step: (1) install new key, (2) verify all agents have it via consul keyring -list, (3) promote new key, (4) remove old key. Run through the entire procedure in a staging cluster first. Never remove the old key until consul keyring -list confirms every agent has received the new one.


8. Intentions Default-Allow in Production

You deploy Connect without configuring intentions. Consul's default behavior (without any intentions) is to allow all traffic between services. You believe you have a zero-trust mesh because you enabled mTLS — but any service can reach any other service. A compromised service can reach your database directly.

Fix: Create a global deny-all intention on day 1: consul intention create -deny '*' '*'. Then explicitly allow the connections you need: consul intention create web api. This inverts the default to zero-trust and forces you to model the actual dependency graph. In Consul 1.9+, you can use config entries with service-intentions for more expressive policies.

Gotcha: Enabling mTLS (Connect) and having intentions are separate concerns. mTLS provides encryption and identity verification — it proves who is talking. Intentions provide authorization — they decide who is allowed to talk. mTLS without intentions is like having ID badges but no access control lists. Every badge-holder gets into every room.


9. Not Setting Session TTL on KV Locks (Orphaned Locks)

Your application acquires a KV lock using a Consul session but does not set a TTL. The application crashes without releasing the lock. Because there is no TTL, the session persists indefinitely. The lock is permanently held. Any other instance attempting to acquire the same lock is blocked forever. Your leader-election mechanism is now stuck.

Fix: Always specify a TTL when creating sessions: "TTL": "30s". The TTL must be between 10 seconds and 86400 seconds. Pair the TTL with a heartbeat: the lock holder should periodically call PUT /v1/session/renew/<session-id> while it holds the lock. If it crashes, the TTL expires and the lock is released automatically.


10. Ignoring Anti-Entropy Warnings

Your Consul logs show repeated "Anti-entropy sync slow" or "Skipping service registration" warnings. You ignore them because services seem to be working. Over time, client agents fall out of sync with the server catalog. Services are registered in the local agent's state but not reflected in the catalog. DNS returns incomplete results. The warnings were telling you that the servers were overloaded (high latency writing to Raft) or that the agent's sync loop was timing out.

Fix: Anti-entropy warnings are an early signal of cluster stress. Investigate immediately: check server CPU and disk I/O (iostat -x 1), check Raft commit latency (consul operator raft list-peers), and check the number of services per client agent. If you have thousands of services registered on a single client, distribute registration across multiple clients or increase server capacity.



Prerequisites