Portal | Level: L2 | Domain: Kubernetes

HashiCorp Consul - Street-Level Ops¶

The commands and patterns that experienced Consul operators reach for first — and the gotchas that bite you if you skip the theory.

Core Diagnostic Commands¶

Cluster Membership¶

# Show all agents (servers + clients) known to the cluster
consul members

# Filter to servers only
consul members -status=alive | grep server

# Check WAN members (multi-DC)
consul members -wan

# Inspect Raft state: who is the leader, what are follower lag counts
consul operator raft list-peers

list-peers output includes each server's address, node ID, suffrage (Voter/Nonvoter), and commit index. A large delta between the leader's CommitIndex and a follower's LastApplied indicates replication lag — investigate disk I/O or network issues.

Service Catalog¶

# List all registered services (just names)
consul catalog services

# List services with tags
consul catalog services -tags

# List nodes registered in the catalog
consul catalog nodes

# Nodes providing a specific service
consul catalog nodes -service=web

# Health of a service (all instances)
consul health service web

# Only healthy instances
consul health service web -passing

KV Operations¶

# Read a key
consul kv get config/web/max_connections

# Write a key
consul kv put config/web/max_connections 200

# Read all keys under a prefix
consul kv get -recurse config/web/

# Export entire KV tree as JSON
consul kv export config/ > kv-backup.json

# Import KV tree
consul kv import @kv-backup.json

# Delete a key
consul kv delete config/web/old_setting

# Delete an entire prefix tree
consul kv delete -recurse config/old/

Connect / Intentions¶

# Check if source can reach destination
consul intention check web api

# List all intentions
consul intention list

# Create allow intention (positional: source destination)
consul intention create web api

# Create deny intention
consul intention create -deny web db

# Delete an intention
consul intention delete web db

Snapshots¶

# Save a cluster snapshot (includes KV, ACLs, catalog, sessions)
consul snapshot save backup-$(date +%Y%m%d-%H%M%S).snap

# Inspect a snapshot (metadata only, no restore)
consul snapshot inspect backup-20260319-120000.snap

# Restore from snapshot (stops the cluster briefly — plan a maintenance window)
consul snapshot restore backup-20260319-120000.snap

Snapshots are binary. Take them before every Consul server upgrade and after major KV changes.

Default trap: Consul snapshots do NOT include ACL tokens stored in Vault or external secret managers. After restoring a snapshot, your ACL policies will be restored but any tokens whose SecretIDs were only stored externally need to be re-created. Always verify token functionality after a restore.

Incident Runbooks¶

Split-Brain / No Leader Elected¶

Symptoms: consul members shows some servers in failed state; API calls return "No cluster leader".

1. Count alive servers: consul members | grep server | grep alive | wc -l
   - If count < quorum (e.g., 1 alive in a 3-server cluster), you have lost quorum
   - Do NOT restart all servers simultaneously — bring up servers one at a time

2. Check Raft state on each surviving server:
   consul operator raft list-peers

3. If the cluster is permanently partitioned and you need to force a new leader:
   consul operator raft remove-peer -id=<dead-node-id>
   # Only run on the server you want to become leader, with the -dangerous flag if needed

4. Last resort — wipe and restore from snapshot:
   a. Stop all Consul server agents
   b. Remove data/ directories
   c. Start servers fresh
   d. consul snapshot restore <latest-snap>

5. Post-incident: identify why you lost quorum
   - Odd number of servers? (even = split-brain risk on network partition)
   - Disk full? (Raft log cannot be appended)
   - EC2 spot interruptions? (server terminated without graceful leave)

Stale Service Registrations (Ghost Services)¶

Symptoms: DNS returns IPs for services that are not running; load balancer hits dead instances.

1. Identify stale services:
   consul health service <name>
   # Look for services in "critical" state for an extended period

2. Manually deregister if the node is gone:
   curl -X PUT http://localhost:8500/v1/agent/service/deregister/<service-id>

3. Check if deregister_critical_service_after is set in service definition:
   consul kv get -detailed service/<name>/...
   # If missing, ghost services accumulate after crashes

4. Enable auto-deregistration in service config:
   "check": {
     "deregister_critical_service_after": "60s"
   }

5. If the node itself is gone from the catalog:
   consul catalog deregister -node=<dead-node>

ACL Bootstrap Failure¶

Symptoms: consul acl bootstrap returns "ACL bootstrap no longer allowed" — someone already bootstrapped.

1. Check if a bootstrap token exists in your secrets manager (Vault, AWS Secrets Manager)
   — it should have been stored at cluster creation time

2. If truly lost: you must use the bootstrap reset procedure (Consul 1.4+):
   a. Stop all Consul server agents
   b. On each server, edit the raft state file to set ACL bootstrap to false:
      consul acl bootstrap -reset-index=<index>
      # The index comes from a previous failed bootstrap error message

3. After recovery, immediately rotate the bootstrap token and store it in Vault.

4. Prevention: use Terraform or Vault to automate token generation at cluster creation.

Snapshot Restore Not Taking Effect¶

1. Verify the snapshot is valid before restoring:
   consul snapshot inspect <file>

2. The restore API requires a leader:
   consul operator raft list-peers  # confirm leader exists

3. Restore via API (the CLI wraps this):
   curl -X PUT --data-binary @backup.snap \
     http://localhost:8500/v1/snapshot

4. After restore, Consul broadcasts the new state to all servers.
   Wait 30 seconds, then verify with:
   consul kv get -recurse config/
   consul catalog services

Operational Patterns¶

Health Check Tuning¶

Aggressive intervals waste CPU and fill logs; lenient intervals mean slow failure detection.

Service Type	Recommended interval	Recommended timeout	deregister_after
Web/API	10s	3s	90s
Database	15s	5s	120s
Background worker	30s	10s	300s
Batch job	60s	30s	600s

For services that accept bursts of connections (e.g., connection pool exhaustion), use a TTL check with a heartbeat instead of HTTP polling — the service controls its own health signal.

Prepared Queries¶

Prepared queries are saved, parameterized service discovery queries stored in Consul. They support failover, near-affinity routing, and templating.

# Create a prepared query: prefer local DC, fall back to dc2
curl -X POST -d '{
  "Name": "web-ha",
  "Service": {
    "Service": "web",
    "Failover": {"NearestN": 2, "Datacenters": ["dc1", "dc2"]}
  }
}' http://localhost:8500/v1/query

# Execute it
curl http://localhost:8500/v1/query/web-ha/execute

# Use via DNS
dig @127.0.0.1 -p 8600 web-ha.query.consul

Prepared queries are underused. They are the right answer when you need geo-aware failover without changing application code.

Service Mesh Debugging¶

# List all Envoy sidecar proxies registered in the catalog
consul catalog services | grep envoy

# Check Connect config for a service (proxy config)
consul config read -kind service-defaults -name web

# Dump Envoy stats via the admin interface (sidecar listens on 19000 by default)
curl http://localhost:19000/stats | grep upstream

# Check Envoy clusters (upstream services)
curl http://localhost:19000/clusters

# Verify certificate for a service
consul connect ca get-config

# Check leaf certificate issued to a service
consul connect ca get-config -service-name web

When a Connect-enabled service cannot reach its upstream: 1. Check consul intention check <source> <dest> — intention may deny 2. Check both sidecars are running (kubectl get pods or ps aux | grep envoy) 3. Check Envoy metrics for connection errors (curl localhost:19000/stats | grep cx_none) 4. Verify the destination service is registered and passing health checks

Gossip Encryption Key Rotation¶

Rotate the gossip encryption key without cluster downtime:

# Step 1: Generate a new key
consul keygen

# Step 2: Add the new key (cluster now accepts both old and new)
consul keyring -install <new-key>

# Step 3: Verify all agents have the new key
consul keyring -list

# Step 4: Promote the new key as primary
consul keyring -use <new-key>

# Step 5: Remove the old key
consul keyring -remove <old-key>

Critical: always stage this in a non-production cluster first. If you accidentally remove the active key before promoting the new one, agents will fail to gossip and the cluster will partition.

Remember: Consul's quorum requirement: a 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2. Never use even numbers -- a 4-node cluster still only tolerates 1 failure (quorum = 3) while costing an extra server. The formula is: failures tolerated = (N - 1) / 2.

Reference: Key Ports¶

Port	Protocol	Purpose
8300	TCP	Server RPC (Raft, queries)
8301	TCP/UDP	LAN gossip (Serf)
8302	TCP/UDP	WAN gossip (Serf, multi-DC)
8500	TCP	HTTP API and UI
8501	TCP	HTTPS API (if TLS enabled)
8600	TCP/UDP	DNS interface
19000	TCP	Envoy admin (per sidecar, local only)
21000+	TCP	Envoy inbound/outbound proxy listeners

Prerequisites¶

Consul Primer

Secrets Management (Topic Pack, L2)
Envoy Proxy (Topic Pack, L2)