Skip to content

Portal | Level: L2: Operations | Topics: HashiCorp Consul | Domain: Kubernetes

HashiCorp Consul - Primer

Why This Matters

Who made it: Consul was created by HashiCorp (Armon Dadgar and Mitchell Hashimoto) and released in 2014. It was designed to solve the service discovery problem that HashiCorp encountered while building their other tools (Vagrant, Packer, Terraform). HashiCorp changed Consul's license to BSL 1.1 in August 2023 (along with all their products). Unlike Terraform (which was forked as OpenTofu), no major fork of Consul has emerged because the service mesh space has strong alternatives (Istio, Linkerd, Cilium).

Consul is HashiCorp's answer to a persistent operational problem: how do services find each other, prove they are healthy, communicate securely, and share configuration — in any environment, at any scale? Unlike purpose-built tools that solve one problem (etcd for KV, Envoy alone for mesh), Consul bundles service discovery, health checking, a KV store, and a full service mesh with mTLS into a single, coherent system. It runs on bare metal, VMs, Kubernetes, and spans multiple datacenters. Understanding Consul means understanding how modern distributed systems stay connected and secure.


Architecture

Server Agents and Client Agents

Consul runs as a cluster of server agents (usually 3 or 5) backed by Raft for consensus, and client agents running on every node that needs service registration or discovery.

┌────────────────────────────────────────┐
│              Consul Cluster            │
│                                        │
│   ┌──────────┐   ┌──────────┐          │
│   │ Server 1 │◄──► Server 2 │          │
│   │ (Leader) │   │(Follower)│          │
│   └────┬─────┘   └──────────┘          │
│        │  Raft consensus               │
│   ┌────▼─────┐                         │
│   │ Server 3 │                         │
│   │(Follower)│                         │
│   └──────────┘                         │
│                                        │
│   Client Agents (one per node):        │
│   ┌──────────┐   ┌──────────┐          │
│   │ Client A │   │ Client B │          │
│   │ (app01)  │   │ (app02)  │          │
│   └──────────┘   └──────────┘          │
└────────────────────────────────────────┘

Server agents store all cluster state (service catalog, KV data, ACL tokens). Client agents handle service registration for local services, forward queries to servers, and run health checks.

Raft Consensus

Name origin: Raft was named as a deliberate contrast to Paxos (the previous dominant consensus algorithm, named after the Greek island). Diego Ongaro chose "Raft" because it is "a small craft for sailing" — something simple and understandable, unlike the notoriously complex Paxos. The Raft paper's subtitle is literally "In Search of an Understandable Consensus Algorithm."

Consul uses the Raft consensus algorithm (Diego Ongaro's 2014 paper) for server-to-server replication. Raft requires a quorum of (n/2)+1 servers to commit a write. With 3 servers you can tolerate 1 failure; with 5 servers, 2 failures. The server elected as leader handles all writes; followers replicate and can serve stale reads.

Serf Gossip Protocol

Under the hood: Consul's gossip protocol has two layers: LAN gossip (within a datacenter, all agents) and WAN gossip (between datacenters, server agents only). LAN gossip runs on port 8301, WAN gossip on port 8302. A failure detected by gossip propagates to all nodes in O(log N) time — in a 1,000-node cluster, every node knows about a failure within about 2 seconds.

Serf (also HashiCorp) provides the gossip layer used by both Consul server clusters (LAN gossip) and multi-datacenter communication (WAN gossip). Serf implements the SWIM (Scalable Weakly-consistent Infection-style Membership) protocol to propagate membership events, detect failures, and deliver user events. Gossip is O(log N) — it scales to thousands of nodes without central coordination.


Service Discovery

Service Registration

Services register with the local Consul client agent via: - Config file: drop a JSON or HCL file in the Consul config directory - HTTP API: PUT /v1/agent/service/register - Kubernetes: via the Helm chart's connect-inject or sync-catalog controllers

{
  "service": {
    "name": "web",
    "id": "web-01",
    "port": 8080,
    "tags": ["v2"],
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s",
      "timeout": "2s",
      "deregister_critical_service_after": "60s"
    }
  }
}

DNS Interface

Consul serves a DNS interface on port 8600 by default. Services are reachable via:

<service>.service.<datacenter>.consul
<tag>.<service>.service.<datacenter>.consul
<id>.service.<datacenter>.consul

Example: web.service.dc1.consul returns A records for all healthy web instances. SRV records include port information.

In practice you configure your system resolver to forward .consul queries to 127.0.0.1:8600 (or a local dnsmasq/systemd-resolved stub).

HTTP API

The catalog API exposes everything:

# List healthy instances of a service
GET /v1/health/service/web?passing=true

# Full catalog — all registered services
GET /v1/catalog/services

# Nodes providing a service
GET /v1/catalog/service/web

Health Checks

Consul supports multiple check types: - HTTP — GET to an endpoint, passes on 2xx - TCP — TCP connect check - Script — run a shell command, passes on exit 0 - TTL — service calls /v1/agent/check/pass/<id> within the TTL - gRPC — gRPC health protocol - Alias — mirrors another check's status

Health check states: passing, warning, critical. Only passing services are returned by default in DNS and /v1/health/service with ?passing=true.


KV Store

Hierarchical Keys

Consul's KV is a flat store with path-like key names. There is no actual hierarchy — / in keys is a naming convention only. Keys can be up to 512 bytes; values up to 512 KB.

consul kv put config/web/max_connections 200
consul kv get config/web/max_connections
consul kv get -recurse config/web/
consul kv delete config/web/max_connections

Watches

Watches poll or long-poll the KV API and trigger a handler when a key changes. They are configured in Consul agent config or started with consul watch:

consul watch -type=key -key=config/web/max_connections /usr/local/bin/reload-web.sh

The handler receives the new value on stdin as JSON.

Sessions and Locking

Sessions are the building block for distributed locking. A session is associated with a set of health checks; if the session's checks go critical, the session is invalidated and any locks it held are released (with configurable behavior: release or delete).

# Create a session (returns session ID)
curl -X PUT -d '{"Name":"my-lock","TTL":"30s"}' \
  http://localhost:8500/v1/session/create

# Acquire a lock on a key
curl -X PUT -d 'lock-holder-data' \
  "http://localhost:8500/v1/kv/locks/my-service?acquire=<session-id>"

# Release the lock
curl -X PUT "http://localhost:8500/v1/kv/locks/my-service?release=<session-id>"

If a lock-holder crashes, the session TTL expires and the lock is automatically released — preventing orphaned locks.


Connect (Service Mesh)

Sidecar Proxies

Consul Connect adds a service mesh layer on top of service discovery. Each service gets a sidecar proxy (Envoy by default, or the built-in proxy for simple cases) that handles all inbound/outbound connections.

App A ──► Envoy sidecar A ──(mTLS)──► Envoy sidecar B ──► App B

The sidecar intercepts traffic, establishes mutual TLS using certificates from Consul's built-in CA, and enforces intentions (access control).

Intentions

Intentions define which services can communicate. They are evaluated at the destination sidecar:

# Allow web to talk to api
consul intention create web api

# Deny everything by default (recommended for production)
consul intention create -deny '*' '*'

# Check if a connection would be permitted
consul intention check web api

Intentions are stored in Consul and enforced by the sidecar proxies without any proxy restarts — changes take effect within seconds.

mTLS

Connect uses a built-in certificate authority to issue short-lived (72-hour) leaf certificates to each service. Certificates encode the SPIFFE-compatible service identity (spiffe://<trust-domain>/ns/<ns>/dc/<dc>/svc/<name>). Services authenticate each other via certificate, so the network path is irrelevant.


ACLs and Security

Tokens and Policies

Consul's ACL system uses tokens (bearer tokens UUID format) that are associated with policies (HCL or JSON rules granting read/write/deny on resources).

# Policy: allow web service to register and query itself
service "web" {
  policy = "write"
}
service_prefix "" {
  policy = "read"
}
node_prefix "" {
  policy = "read"
}

Tokens can also reference roles (named collections of policies) for easier management.

Bootstrapping ACLs

ACL bootstrapping is a one-time operation on a fresh cluster:

consul acl bootstrap
# Returns: SecretID = <bootstrap token>

After bootstrapping, the bootstrap token is the only initial token with global management privileges. Store it securely — it cannot be recovered.

Gotcha: ACL bootstrapping can only be performed once. If you lose the bootstrap token and have not created any other management tokens, you must reset the ACL system entirely (which requires stopping all servers). Always create a backup management token immediately after bootstrapping and store it in a vault. The single most common Consul disaster recovery failure is a lost bootstrap token.

Token Hierarchy

  • Bootstrap token — global management, store in vault
  • Agent tokens — one per agent, allow agent to register itself
  • Service tokens — scoped to the services a node runs
  • Anonymous token — used for unauthenticated requests (restrict heavily or disable)

Multi-Datacenter

WAN Federation

The classic approach: server agents in each datacenter join a shared WAN gossip pool via consul join -wan. Cross-DC queries are forwarded transparently:

# Query the web service in dc2 from dc1
dig @127.0.0.1 -p 8600 web.service.dc2.consul

WAN federation requires direct reachability between server agents across datacenters on port 8302 (WAN gossip) and 8300 (RPC).

Mesh Gateways

Mesh gateways are an alternative for environments where direct server-to-server connectivity across DCs is not available (e.g., when DCs are in separate cloud VPCs behind NAT). A mesh gateway sits at the edge of each datacenter and proxies Connect traffic between DCs without requiring full server connectivity.

DC1                          DC2
Services ──► MeshGateway ──(mTLS)──► MeshGateway ──► Services

Mesh gateways are the recommended approach for multi-cloud and strict network segmentation scenarios.


Consul on Kubernetes

Helm Chart

HashiCorp publishes an official Helm chart for Consul on Kubernetes:

helm repo add hashicorp https://helm.releases.hashicorp.com
helm install consul hashicorp/consul --values consul-values.yaml

Key values:

global:
  name: consul
  datacenter: dc1
server:
  replicas: 3
  bootstrapExpect: 3
connectInject:
  enabled: true
  default: false  # opt-in per pod
syncCatalog:
  enabled: true
  toConsul: true
  toK8S: false

connect-inject

The connect-inject admission webhook intercepts Pod creation. When a Pod has the annotation consul.hashicorp.com/connect-inject: "true", the webhook adds an Envoy sidecar container and an init container to configure it.

sync-catalog

The sync-catalog controller bidirectionally syncs Kubernetes Services to the Consul service catalog (and optionally back). This enables non-Kubernetes services registered in Consul to be discovered by Kubernetes services (and vice versa), bridging hybrid environments.


Anti-Entropy and Consistency Model

Anti-Entropy

Each Consul client agent periodically reconciles its local state (registered services, health checks) with the servers. If a service registration is missing from the catalog (e.g., after a server restart), the client re-registers it. Anti-entropy runs every ~60 seconds by default.

Consistency Modes

Consul offers three consistency modes for reads: - default — leader reads with stale fallback (1-2 reads behind under high load, fast) - consistent — linearizable reads via leader, most expensive, never stale - stale — any server can respond, up to ~50ms stale, fastest, appropriate for discovery

For service discovery workloads, stale reads are usually acceptable. For lock coordination (KV sessions), always use consistent.


Key Takeaways

  • Consul is a unified control plane: service discovery, KV, health, mesh, ACLs, multi-DC — one binary
  • Server agents run Raft (3 or 5, odd number); client agents run on every node
  • The gossip layer (Serf/SWIM) handles membership detection and propagates events at scale
  • Service discovery works via DNS (.consul domain) or HTTP API; health checks gate traffic
  • Connect adds mTLS and intentions with zero application code changes (sidecar model)
  • ACLs protect everything — bootstrap them on day 1, not day 30
  • Multi-DC works via WAN federation (direct connectivity) or mesh gateways (NAT-friendly)
  • On Kubernetes, the Helm chart wires everything: connect-inject for mesh, sync-catalog for hybrid

Wiki Navigation

Prerequisites

  • Consul Flashcards (CLI) (flashcard_deck, L1) — HashiCorp Consul