Portal | Level: L2 | Domain: Kubernetes
HashiCorp Consul - Street-Level Ops¶
The commands and patterns that experienced Consul operators reach for first — and the gotchas that bite you if you skip the theory.
Core Diagnostic Commands¶
Cluster Membership¶
# Show all agents (servers + clients) known to the cluster
consul members
# Filter to servers only
consul members -status=alive | grep server
# Check WAN members (multi-DC)
consul members -wan
# Inspect Raft state: who is the leader, what are follower lag counts
consul operator raft list-peers
list-peers output includes each server's address, node ID, suffrage (Voter/Nonvoter), and commit index. A large delta between the leader's CommitIndex and a follower's LastApplied indicates replication lag — investigate disk I/O or network issues.
Service Catalog¶
# List all registered services (just names)
consul catalog services
# List services with tags
consul catalog services -tags
# List nodes registered in the catalog
consul catalog nodes
# Nodes providing a specific service
consul catalog nodes -service=web
# Health of a service (all instances)
consul health service web
# Only healthy instances
consul health service web -passing
KV Operations¶
# Read a key
consul kv get config/web/max_connections
# Write a key
consul kv put config/web/max_connections 200
# Read all keys under a prefix
consul kv get -recurse config/web/
# Export entire KV tree as JSON
consul kv export config/ > kv-backup.json
# Import KV tree
consul kv import @kv-backup.json
# Delete a key
consul kv delete config/web/old_setting
# Delete an entire prefix tree
consul kv delete -recurse config/old/
Connect / Intentions¶
# Check if source can reach destination
consul intention check web api
# List all intentions
consul intention list
# Create allow intention (positional: source destination)
consul intention create web api
# Create deny intention
consul intention create -deny web db
# Delete an intention
consul intention delete web db
Snapshots¶
# Save a cluster snapshot (includes KV, ACLs, catalog, sessions)
consul snapshot save backup-$(date +%Y%m%d-%H%M%S).snap
# Inspect a snapshot (metadata only, no restore)
consul snapshot inspect backup-20260319-120000.snap
# Restore from snapshot (stops the cluster briefly — plan a maintenance window)
consul snapshot restore backup-20260319-120000.snap
Snapshots are binary. Take them before every Consul server upgrade and after major KV changes.
Default trap: Consul snapshots do NOT include ACL tokens stored in Vault or external secret managers. After restoring a snapshot, your ACL policies will be restored but any tokens whose SecretIDs were only stored externally need to be re-created. Always verify token functionality after a restore.
Incident Runbooks¶
Split-Brain / No Leader Elected¶
Symptoms: consul members shows some servers in failed state; API calls return "No cluster leader".
1. Count alive servers: consul members | grep server | grep alive | wc -l
- If count < quorum (e.g., 1 alive in a 3-server cluster), you have lost quorum
- Do NOT restart all servers simultaneously — bring up servers one at a time
2. Check Raft state on each surviving server:
consul operator raft list-peers
3. If the cluster is permanently partitioned and you need to force a new leader:
consul operator raft remove-peer -id=<dead-node-id>
# Only run on the server you want to become leader, with the -dangerous flag if needed
4. Last resort — wipe and restore from snapshot:
a. Stop all Consul server agents
b. Remove data/ directories
c. Start servers fresh
d. consul snapshot restore <latest-snap>
5. Post-incident: identify why you lost quorum
- Odd number of servers? (even = split-brain risk on network partition)
- Disk full? (Raft log cannot be appended)
- EC2 spot interruptions? (server terminated without graceful leave)
Stale Service Registrations (Ghost Services)¶
Symptoms: DNS returns IPs for services that are not running; load balancer hits dead instances.
1. Identify stale services:
consul health service <name>
# Look for services in "critical" state for an extended period
2. Manually deregister if the node is gone:
curl -X PUT http://localhost:8500/v1/agent/service/deregister/<service-id>
3. Check if deregister_critical_service_after is set in service definition:
consul kv get -detailed service/<name>/...
# If missing, ghost services accumulate after crashes
4. Enable auto-deregistration in service config:
"check": {
"deregister_critical_service_after": "60s"
}
5. If the node itself is gone from the catalog:
consul catalog deregister -node=<dead-node>
ACL Bootstrap Failure¶
Symptoms: consul acl bootstrap returns "ACL bootstrap no longer allowed" — someone already bootstrapped.
1. Check if a bootstrap token exists in your secrets manager (Vault, AWS Secrets Manager)
— it should have been stored at cluster creation time
2. If truly lost: you must use the bootstrap reset procedure (Consul 1.4+):
a. Stop all Consul server agents
b. On each server, edit the raft state file to set ACL bootstrap to false:
consul acl bootstrap -reset-index=<index>
# The index comes from a previous failed bootstrap error message
3. After recovery, immediately rotate the bootstrap token and store it in Vault.
4. Prevention: use Terraform or Vault to automate token generation at cluster creation.
Snapshot Restore Not Taking Effect¶
1. Verify the snapshot is valid before restoring:
consul snapshot inspect <file>
2. The restore API requires a leader:
consul operator raft list-peers # confirm leader exists
3. Restore via API (the CLI wraps this):
curl -X PUT --data-binary @backup.snap \
http://localhost:8500/v1/snapshot
4. After restore, Consul broadcasts the new state to all servers.
Wait 30 seconds, then verify with:
consul kv get -recurse config/
consul catalog services
Operational Patterns¶
Health Check Tuning¶
Aggressive intervals waste CPU and fill logs; lenient intervals mean slow failure detection.
| Service Type | Recommended interval | Recommended timeout | deregister_after |
|---|---|---|---|
| Web/API | 10s | 3s | 90s |
| Database | 15s | 5s | 120s |
| Background worker | 30s | 10s | 300s |
| Batch job | 60s | 30s | 600s |
For services that accept bursts of connections (e.g., connection pool exhaustion), use a TTL check with a heartbeat instead of HTTP polling — the service controls its own health signal.
Prepared Queries¶
Prepared queries are saved, parameterized service discovery queries stored in Consul. They support failover, near-affinity routing, and templating.
# Create a prepared query: prefer local DC, fall back to dc2
curl -X POST -d '{
"Name": "web-ha",
"Service": {
"Service": "web",
"Failover": {"NearestN": 2, "Datacenters": ["dc1", "dc2"]}
}
}' http://localhost:8500/v1/query
# Execute it
curl http://localhost:8500/v1/query/web-ha/execute
# Use via DNS
dig @127.0.0.1 -p 8600 web-ha.query.consul
Prepared queries are underused. They are the right answer when you need geo-aware failover without changing application code.
Service Mesh Debugging¶
# List all Envoy sidecar proxies registered in the catalog
consul catalog services | grep envoy
# Check Connect config for a service (proxy config)
consul config read -kind service-defaults -name web
# Dump Envoy stats via the admin interface (sidecar listens on 19000 by default)
curl http://localhost:19000/stats | grep upstream
# Check Envoy clusters (upstream services)
curl http://localhost:19000/clusters
# Verify certificate for a service
consul connect ca get-config
# Check leaf certificate issued to a service
consul connect ca get-config -service-name web
When a Connect-enabled service cannot reach its upstream:
1. Check consul intention check <source> <dest> — intention may deny
2. Check both sidecars are running (kubectl get pods or ps aux | grep envoy)
3. Check Envoy metrics for connection errors (curl localhost:19000/stats | grep cx_none)
4. Verify the destination service is registered and passing health checks
Gossip Encryption Key Rotation¶
Rotate the gossip encryption key without cluster downtime:
# Step 1: Generate a new key
consul keygen
# Step 2: Add the new key (cluster now accepts both old and new)
consul keyring -install <new-key>
# Step 3: Verify all agents have the new key
consul keyring -list
# Step 4: Promote the new key as primary
consul keyring -use <new-key>
# Step 5: Remove the old key
consul keyring -remove <old-key>
Critical: always stage this in a non-production cluster first. If you accidentally remove the active key before promoting the new one, agents will fail to gossip and the cluster will partition.
Remember: Consul's quorum requirement: a 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2. Never use even numbers -- a 4-node cluster still only tolerates 1 failure (quorum = 3) while costing an extra server. The formula is: failures tolerated = (N - 1) / 2.
Reference: Key Ports¶
| Port | Protocol | Purpose |
|---|---|---|
| 8300 | TCP | Server RPC (Raft, queries) |
| 8301 | TCP/UDP | LAN gossip (Serf) |
| 8302 | TCP/UDP | WAN gossip (Serf, multi-DC) |
| 8500 | TCP | HTTP API and UI |
| 8501 | TCP | HTTPS API (if TLS enabled) |
| 8600 | TCP/UDP | DNS interface |
| 19000 | TCP | Envoy admin (per sidecar, local only) |
| 21000+ | TCP | Envoy inbound/outbound proxy listeners |
Related Resources¶
Prerequisites¶
Related Content¶
- Secrets Management (Topic Pack, L2)
- Envoy Proxy (Topic Pack, L2)