Portal | Level: L2: Operations | Topics: Istio Service Mesh | Domain: Kubernetes
Istio Service Mesh — Primer¶
Why This Matters¶
Modern microservice architectures run dozens or hundreds of services. Without a service mesh, every team must independently solve the same problems: mutual TLS between services, circuit breaking, retries with backoff, distributed tracing, and traffic shaping for canary releases. Each team solves it differently, or not at all, leading to inconsistent security posture and gaps in observability.
Istio imposes a uniform layer across all services without requiring application code changes. A single PeerAuthentication policy enables mTLS cluster-wide. A single VirtualService routes 5% of traffic to a new version. Access logs, metrics, and traces flow automatically to your observability stack. The application just makes HTTP or gRPC calls; Istio handles the rest at the sidecar level.
For operators, understanding Istio means you can diagnose mysterious latency (wrong DestinationRule timeout), debug authorization failures (AuthorizationPolicy blocking health checks), and confidently roll out canary releases without redeploying your application. Without this knowledge you are debugging a black box where the network behavior is controlled by resources you cannot see.
Core Concepts¶
1. Architecture: Control Plane and Data Plane¶
Istio has two logical layers:
Data plane: Envoy proxy sidecars injected into every pod. They intercept all inbound and outbound traffic for the pod. Envoy handles load balancing, retries, circuit breaking, mTLS, and telemetry. The application is unaware of its presence.
Control plane: istiod — a single binary that consolidates three former components:
| Former component | Role now inside istiod |
|---|---|
| Pilot | Converts Istio config (VirtualService, DestinationRule, etc.) into Envoy xDS configuration and pushes it to sidecars |
| Citadel | Issues and rotates SPIFFE/X.509 certificates for workload identity (mTLS) |
| Galley | Validates and ingests Istio config from the Kubernetes API |
Developer applies VirtualService
↓
Kubernetes API Server
↓
istiod (Pilot watches, converts to xDS)
↓
Envoy sidecars (receive updated listeners/routes/clusters via xDS gRPC stream)
↓
Traffic is shaped per the new config
The merge into istiod (Istio 1.5, 2020) dramatically simplified operations — previously you managed three separate deployments with separate failure modes.
Name origin: "Istio" is Greek for "sail." The project continues the Kubernetes nautical naming theme (Kubernetes = "helmsman," Helm = the tool, etc.). Istio was jointly created by Google, IBM, and Lyft in 2017. Lyft contributed the Envoy proxy, which became the data plane.
2. Traffic Management¶
VirtualService¶
Defines how requests to a hostname are routed. It is not a Kubernetes Service replacement — it sits on top of Services and adds routing logic.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
namespace: bookinfo
spec:
hosts:
- reviews # must match the Kubernetes Service name (or FQDN)
http:
- match:
- headers:
x-user-group:
exact: canary
route:
- destination:
host: reviews
subset: v2 # defined in DestinationRule
- route:
- destination:
host: reviews
subset: v1
weight: 95
- destination:
host: reviews
subset: v2
weight: 5
Key fields: hosts (must exactly match the Service name or FQDN), http[].match (header/URI/method conditions), route[].destination.subset (maps to DestinationRule subsets), retries, timeout, fault.
DestinationRule¶
Defines traffic policies for a destination after routing. Subsets label pod groups (e.g., version labels). Load balancing, connection pool, and outlier detection live here.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
namespace: bookinfo
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutiveGatewayErrors: 5
interval: 30s
baseEjectionTime: 30s
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
loadBalancer:
simple: ROUND_ROBIN
Gateway¶
Manages inbound and outbound traffic at the edge of the mesh. An Istio Gateway runs on the ingress gateway pod (a standalone Envoy, not a sidecar) and replaces a traditional Ingress controller for mesh-aware traffic.
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: bookinfo-gateway
namespace: istio-system
spec:
selector:
istio: ingressgateway # targets the ingress gateway pod
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: bookinfo-tls # Kubernetes Secret with TLS cert
hosts:
- bookinfo.example.com
---
# VirtualService must bind to the Gateway
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: bookinfo
spec:
hosts:
- bookinfo.example.com
gateways:
- bookinfo-gateway
- mesh # "mesh" applies to internal east-west traffic too
http:
- route:
- destination:
host: productpage
port:
number: 9080
For egress (outbound from the mesh to external services), use an EgressGateway and a ServiceEntry to register the external hostname in the mesh.
3. Security¶
mTLS: STRICT vs PERMISSIVE¶
Istio's PeerAuthentication controls whether mTLS is required on inbound connections to a workload:
# Cluster-wide STRICT: all inbound connections must use mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
| Mode | Behavior |
|---|---|
STRICT |
Only mTLS connections accepted. Plaintext rejected. |
PERMISSIVE |
Both mTLS and plaintext accepted. Used during migration. |
DISABLE |
No mTLS. Do not use in production. |
Hierarchy: mesh-wide (istio-system namespace) → namespace-level → workload-level. More specific wins.
SPIFFE Identity¶
Every Istio workload gets a SPIFFE identity: spiffe://<trust-domain>/ns/<namespace>/sa/<service-account>. Citadel (inside istiod) issues X.509 certificates encoding this identity. Certificates rotate every 24 hours by default. This enables identity-based AuthorizationPolicy — you authorize by SPIFFE identity, not IP address.
AuthorizationPolicy¶
Controls which workloads (principals) can call which workloads and via which paths. Evaluated at the sidecar, after mTLS handshake.
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: reviews-policy
namespace: bookinfo
spec:
selector:
matchLabels:
app: reviews
action: ALLOW
rules:
- from:
- source:
principals:
- cluster.local/ns/bookinfo/sa/productpage
to:
- operation:
methods: ["GET"]
paths: ["/reviews/*"]
Gotcha: An empty
AuthorizationPolicywithaction: ALLOWand no rules denies all traffic to the selected workloads. This is counter-intuitive — you might expect "allow with no rules" to mean "allow everything," but it means "allow nothing because nothing matches."
Default behavior: when any AuthorizationPolicy exists in a namespace, all traffic not explicitly allowed is denied. An empty AuthorizationPolicy with no rules denies everything. A common footgun: creating a policy for one workload accidentally blocks health check paths for others.
4. Observability¶
Metrics¶
The Envoy sidecar exposes a rich set of metrics. The key metric for request observability is:
istio_requests_total{
reporter="destination", # or "source"
source_workload="productpage",
destination_workload="reviews",
response_code="200",
connection_security_policy="mutual_tls"
}
Additional metrics: istio_request_duration_milliseconds, istio_request_bytes, istio_response_bytes, istio_tcp_connections_opened_total.
Prometheus scrapes these from each sidecar on port 15090 (merged stats port).
Distributed Tracing¶
Istio propagates trace context headers (B3, W3C TraceContext, or Datadog format) between services. The application must forward these headers when making downstream calls — Istio injects them on ingress but cannot propagate them through application logic automatically.
Headers to forward: x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context.
Kiali¶
Kiali is the Istio-native service graph UI. It reads Prometheus metrics and Istio config to render: - A live topology graph of service-to-service communication - Traffic volume and error rate on each edge - mTLS status per connection - Config validation warnings (VirtualService host mismatches, missing DestinationRule subsets)
Access Logs¶
Envoy access logs capture every request through the sidecar. By default they go to stdout of the sidecar container:
Enable structured JSON access logging for easier parsing:
# In IstioOperator or MeshConfig
spec:
meshConfig:
accessLogFile: /dev/stdout
accessLogFormat: |
{"start_time":"%START_TIME%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
"response_code":"%RESPONSE_CODE%","duration":"%DURATION%","upstream_host":"%UPSTREAM_HOST%"}
5. Canary Deployments with Weighted Routing¶
Istio enables traffic-percentage-based canary releases independent of replica counts. This is the key difference from Kubernetes native rollouts (which split traffic by pod ratio).
# 95% → v1, 5% → v2 regardless of replica count
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: frontend
spec:
hosts:
- frontend
http:
- route:
- destination:
host: frontend
subset: v1
weight: 95
- destination:
host: frontend
subset: v2
weight: 5
Progressive rollout: adjust the weight in the VirtualService (5 → 20 → 50 → 100). When you reach 100% on v2, update the Deployment's default image and remove the v1 subset. This lets you test on 5% of real traffic without scaling up v2 replicas to match v1.
6. Fault Injection for Chaos Testing¶
Istio can inject faults into traffic at the proxy level, without touching application code. Useful for testing resilience.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- fault:
delay:
percentage:
value: 50 # inject 7s delay on 50% of requests
fixedDelay: 7s
abort:
percentage:
value: 10 # return HTTP 503 on 10% of requests
httpStatus: 503
route:
- destination:
host: ratings
subset: v1
War story: A team left a
fault.delayblock in a production VirtualService after a chaos test. For three weeks, 50% of requests to the ratings service had an artificial 7-second delay. The SLO dashboard showed degradation but it was attributed to "backend slowness." Only when someone re-read the VirtualService YAML during an unrelated investigation did they find the stale fault block.
Always remove or disable fault injection after testing. Leaving a fault block in a production VirtualService is one of the easiest ways to cause a self-inflicted outage.
7. Sidecar Injection¶
Automatic injection is enabled by labeling a namespace:
Once labeled, every new pod in that namespace gets an istio-proxy sidecar and an istio-init init container (which programs iptables to redirect traffic through the proxy). Existing pods are not affected — they must be restarted.
Manual injection for one-off cases:
Opt-out per pod (e.g., for a batch Job that doesn't need the mesh):
Sidecar resource for config scoping: by default, every sidecar receives xDS configuration for the entire mesh — all services across all namespaces. In large meshes this wastes memory and slows push convergence. The Sidecar resource scopes what a workload can see:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: default
namespace: bookinfo
spec:
egress:
- hosts:
- "./*" # all services in same namespace
- "istio-system/*" # control plane services
8. Ingress and Egress Gateways¶
The ingress gateway is a dedicated Envoy pod (not a sidecar) running at the cluster edge. It handles TLS termination and routes traffic into the mesh. It is controlled by Gateway + VirtualService resources, not Kubernetes Ingress.
The egress gateway is the symmetric counterpart: all outbound traffic to external services routes through it, providing a single egress point with logging, TLS origination, and policy enforcement.
# Check ingress gateway external IP
kubectl -n istio-system get svc istio-ingressgateway
# Verify gateway pod health
kubectl -n istio-system get pods -l istio=ingressgateway
9. Multi-Cluster and Multi-Mesh¶
Multi-primary: Multiple clusters each run istiod. They share a common root CA (for cross-cluster mTLS) and watch each other's service endpoints. East-west gateways handle cross-cluster traffic.
Primary-remote: One cluster runs istiod; remote clusters run only data plane (sidecars + east-west gateway). Simpler control plane topology but the primary is a single point of failure for config.
Multi-mesh federation: Completely separate meshes that expose selected services to each other via ServiceEntry and cross-mesh trust. Largest blast radius containment.
10. Performance Considerations¶
Istio adds latency because every request traverses two additional network hops (source sidecar → destination sidecar). In practice, p50 overhead is 1–2ms, p99 can be 5–10ms for complex routing configs.
Key performance levers:
| Concern | Mitigation |
|---|---|
| Large mesh xDS config | Use Sidecar resource to scope what each proxy sees |
| Sidecar memory usage | Set resources.limits in IstioOperator (proxy typically needs 128–256Mi) |
| Init container timing | Set holdApplicationUntilProxyStarts: true to prevent app-before-proxy races |
| Envoy config churn | Avoid frequent label/annotation changes that trigger xDS pushes |
| Tracing overhead | Tune sampling rate in MeshConfig.defaultConfig.tracing (1% is common for high-traffic) |
Under the hood: Envoy uses iptables rules (injected by the
istio-initcontainer) to intercept all traffic. Inbound traffic is redirected to port 15006, outbound to 15001. If you see unexpected connection resets after enabling Istio, checkiptables -t nat -Linside the pod's network namespace to verify the redirect rules.
Key Takeaways¶
- Istio decouples network policy from application code. mTLS, retries, canary routing, and authorization are configured in YAML, not in application libraries.
istiodconsolidates Pilot + Citadel + Galley. One deployment, one failure domain.VirtualService= routing rules.DestinationRule= traffic policy and subsets.Gateway= edge traffic. These three resources cover 90% of day-to-day Istio config.- mTLS
PERMISSIVEmode is for migration only. Every production cluster should beSTRICTor you have no actual transport security guarantees. - The
Sidecarresource is the most underused performance optimization. Without it, every proxy in a large mesh carries config it will never use. - Fault injection must be removed after testing. It is the most likely Istio config to be left on accidentally.
AuthorizationPolicyis default-deny once you create any policy. Health check paths must be explicitly allowed or carved out.
Wiki Navigation¶
Prerequisites¶
- Service Mesh (Topic Pack, L3)
- Envoy Proxy (Topic Pack, L2)
Related Content¶
- Istio Flashcards (CLI) (flashcard_deck, L1) — Istio Service Mesh