Portal | Level: L2: Operations | Topics: Istio Service Mesh | Domain: Kubernetes

Istio Service Mesh — Primer¶

Why This Matters¶

Modern microservice architectures run dozens or hundreds of services. Without a service mesh, every team must independently solve the same problems: mutual TLS between services, circuit breaking, retries with backoff, distributed tracing, and traffic shaping for canary releases. Each team solves it differently, or not at all, leading to inconsistent security posture and gaps in observability.

Istio imposes a uniform layer across all services without requiring application code changes. A single PeerAuthentication policy enables mTLS cluster-wide. A single VirtualService routes 5% of traffic to a new version. Access logs, metrics, and traces flow automatically to your observability stack. The application just makes HTTP or gRPC calls; Istio handles the rest at the sidecar level.

For operators, understanding Istio means you can diagnose mysterious latency (wrong DestinationRule timeout), debug authorization failures (AuthorizationPolicy blocking health checks), and confidently roll out canary releases without redeploying your application. Without this knowledge you are debugging a black box where the network behavior is controlled by resources you cannot see.

Core Concepts¶

1. Architecture: Control Plane and Data Plane¶

Istio has two logical layers:

Data plane: Envoy proxy sidecars injected into every pod. They intercept all inbound and outbound traffic for the pod. Envoy handles load balancing, retries, circuit breaking, mTLS, and telemetry. The application is unaware of its presence.

Control plane: istiod — a single binary that consolidates three former components:

Former component	Role now inside istiod
Pilot	Converts Istio config (VirtualService, DestinationRule, etc.) into Envoy xDS configuration and pushes it to sidecars
Citadel	Issues and rotates SPIFFE/X.509 certificates for workload identity (mTLS)
Galley	Validates and ingests Istio config from the Kubernetes API

Developer applies VirtualService
       ↓
Kubernetes API Server
       ↓
istiod (Pilot watches, converts to xDS)
       ↓
Envoy sidecars (receive updated listeners/routes/clusters via xDS gRPC stream)
       ↓
Traffic is shaped per the new config

The merge into istiod (Istio 1.5, 2020) dramatically simplified operations — previously you managed three separate deployments with separate failure modes.

Name origin: "Istio" is Greek for "sail." The project continues the Kubernetes nautical naming theme (Kubernetes = "helmsman," Helm = the tool, etc.). Istio was jointly created by Google, IBM, and Lyft in 2017. Lyft contributed the Envoy proxy, which became the data plane.

2. Traffic Management¶

VirtualService¶

Defines how requests to a hostname are routed. It is not a Kubernetes Service replacement — it sits on top of Services and adds routing logic.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
  namespace: bookinfo
spec:
  hosts:
    - reviews           # must match the Kubernetes Service name (or FQDN)
  http:
    - match:
        - headers:
            x-user-group:
              exact: canary
      route:
        - destination:
            host: reviews
            subset: v2   # defined in DestinationRule
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 95
        - destination:
            host: reviews
            subset: v2
          weight: 5

Key fields: hosts (must exactly match the Service name or FQDN), http[].match (header/URI/method conditions), route[].destination.subset (maps to DestinationRule subsets), retries, timeout, fault.

DestinationRule¶

Defines traffic policies for a destination after routing. Subsets label pod groups (e.g., version labels). Load balancing, connection pool, and outlier detection live here.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews
  namespace: bookinfo
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutiveGatewayErrors: 5
      interval: 30s
      baseEjectionTime: 30s
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN

Gateway¶

Manages inbound and outbound traffic at the edge of the mesh. An Istio Gateway runs on the ingress gateway pod (a standalone Envoy, not a sidecar) and replaces a traditional Ingress controller for mesh-aware traffic.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: bookinfo-gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway   # targets the ingress gateway pod
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: bookinfo-tls   # Kubernetes Secret with TLS cert
      hosts:
        - bookinfo.example.com
---
# VirtualService must bind to the Gateway
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: bookinfo
spec:
  hosts:
    - bookinfo.example.com
  gateways:
    - bookinfo-gateway
    - mesh                 # "mesh" applies to internal east-west traffic too
  http:
    - route:
        - destination:
            host: productpage
            port:
              number: 9080

For egress (outbound from the mesh to external services), use an EgressGateway and a ServiceEntry to register the external hostname in the mesh.

3. Security¶

mTLS: STRICT vs PERMISSIVE¶

Istio's PeerAuthentication controls whether mTLS is required on inbound connections to a workload:

# Cluster-wide STRICT: all inbound connections must use mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT

Mode	Behavior
`STRICT`	Only mTLS connections accepted. Plaintext rejected.
`PERMISSIVE`	Both mTLS and plaintext accepted. Used during migration.
`DISABLE`	No mTLS. Do not use in production.

Hierarchy: mesh-wide (istio-system namespace) → namespace-level → workload-level. More specific wins.

SPIFFE Identity¶

Every Istio workload gets a SPIFFE identity: spiffe://<trust-domain>/ns/<namespace>/sa/<service-account>. Citadel (inside istiod) issues X.509 certificates encoding this identity. Certificates rotate every 24 hours by default. This enables identity-based AuthorizationPolicy — you authorize by SPIFFE identity, not IP address.

AuthorizationPolicy¶

Controls which workloads (principals) can call which workloads and via which paths. Evaluated at the sidecar, after mTLS handshake.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: reviews-policy
  namespace: bookinfo
spec:
  selector:
    matchLabels:
      app: reviews
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              - cluster.local/ns/bookinfo/sa/productpage
      to:
        - operation:
            methods: ["GET"]
            paths: ["/reviews/*"]

Gotcha: An empty AuthorizationPolicy with action: ALLOW and no rules denies all traffic to the selected workloads. This is counter-intuitive — you might expect "allow with no rules" to mean "allow everything," but it means "allow nothing because nothing matches."

Default behavior: when any AuthorizationPolicy exists in a namespace, all traffic not explicitly allowed is denied. An empty AuthorizationPolicy with no rules denies everything. A common footgun: creating a policy for one workload accidentally blocks health check paths for others.

4. Observability¶

Metrics¶

The Envoy sidecar exposes a rich set of metrics. The key metric for request observability is:

istio_requests_total{
  reporter="destination",       # or "source"
  source_workload="productpage",
  destination_workload="reviews",
  response_code="200",
  connection_security_policy="mutual_tls"
}

Additional metrics: istio_request_duration_milliseconds, istio_request_bytes, istio_response_bytes, istio_tcp_connections_opened_total.

Prometheus scrapes these from each sidecar on port 15090 (merged stats port).

Distributed Tracing¶

Istio propagates trace context headers (B3, W3C TraceContext, or Datadog format) between services. The application must forward these headers when making downstream calls — Istio injects them on ingress but cannot propagate them through application logic automatically.

Headers to forward: x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context.

Kiali¶

Kiali is the Istio-native service graph UI. It reads Prometheus metrics and Istio config to render: - A live topology graph of service-to-service communication - Traffic volume and error rate on each edge - mTLS status per connection - Config validation warnings (VirtualService host mismatches, missing DestinationRule subsets)

istioctl dashboard kiali

Access Logs¶

Envoy access logs capture every request through the sidecar. By default they go to stdout of the sidecar container:

kubectl logs <pod> -c istio-proxy

Enable structured JSON access logging for easier parsing:

# In IstioOperator or MeshConfig
spec:
  meshConfig:
    accessLogFile: /dev/stdout
    accessLogFormat: |
      {"start_time":"%START_TIME%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
       "response_code":"%RESPONSE_CODE%","duration":"%DURATION%","upstream_host":"%UPSTREAM_HOST%"}

5. Canary Deployments with Weighted Routing¶

Istio enables traffic-percentage-based canary releases independent of replica counts. This is the key difference from Kubernetes native rollouts (which split traffic by pod ratio).

# 95% → v1, 5% → v2 regardless of replica count
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
    - frontend
  http:
    - route:
        - destination:
            host: frontend
            subset: v1
          weight: 95
        - destination:
            host: frontend
            subset: v2
          weight: 5

Progressive rollout: adjust the weight in the VirtualService (5 → 20 → 50 → 100). When you reach 100% on v2, update the Deployment's default image and remove the v1 subset. This lets you test on 5% of real traffic without scaling up v2 replicas to match v1.

6. Fault Injection for Chaos Testing¶

Istio can inject faults into traffic at the proxy level, without touching application code. Useful for testing resilience.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ratings
spec:
  hosts:
    - ratings
  http:
    - fault:
        delay:
          percentage:
            value: 50    # inject 7s delay on 50% of requests
          fixedDelay: 7s
        abort:
          percentage:
            value: 10    # return HTTP 503 on 10% of requests
          httpStatus: 503
      route:
        - destination:
            host: ratings
            subset: v1

War story: A team left a fault.delay block in a production VirtualService after a chaos test. For three weeks, 50% of requests to the ratings service had an artificial 7-second delay. The SLO dashboard showed degradation but it was attributed to "backend slowness." Only when someone re-read the VirtualService YAML during an unrelated investigation did they find the stale fault block.

Always remove or disable fault injection after testing. Leaving a fault block in a production VirtualService is one of the easiest ways to cause a self-inflicted outage.

7. Sidecar Injection¶

Automatic injection is enabled by labeling a namespace:

kubectl label namespace bookinfo istio-injection=enabled

Once labeled, every new pod in that namespace gets an istio-proxy sidecar and an istio-init init container (which programs iptables to redirect traffic through the proxy). Existing pods are not affected — they must be restarted.

Manual injection for one-off cases:

istioctl kube-inject -f deployment.yaml | kubectl apply -f -

Opt-out per pod (e.g., for a batch Job that doesn't need the mesh):

metadata:
  annotations:
    sidecar.istio.io/inject: "false"

Sidecar resource for config scoping: by default, every sidecar receives xDS configuration for the entire mesh — all services across all namespaces. In large meshes this wastes memory and slows push convergence. The Sidecar resource scopes what a workload can see:

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: bookinfo
spec:
  egress:
    - hosts:
        - "./*"              # all services in same namespace
        - "istio-system/*"  # control plane services

8. Ingress and Egress Gateways¶

The ingress gateway is a dedicated Envoy pod (not a sidecar) running at the cluster edge. It handles TLS termination and routes traffic into the mesh. It is controlled by Gateway + VirtualService resources, not Kubernetes Ingress.

The egress gateway is the symmetric counterpart: all outbound traffic to external services routes through it, providing a single egress point with logging, TLS origination, and policy enforcement.

# Check ingress gateway external IP
kubectl -n istio-system get svc istio-ingressgateway

# Verify gateway pod health
kubectl -n istio-system get pods -l istio=ingressgateway

9. Multi-Cluster and Multi-Mesh¶

Multi-primary: Multiple clusters each run istiod. They share a common root CA (for cross-cluster mTLS) and watch each other's service endpoints. East-west gateways handle cross-cluster traffic.

Primary-remote: One cluster runs istiod; remote clusters run only data plane (sidecars + east-west gateway). Simpler control plane topology but the primary is a single point of failure for config.

Multi-mesh federation: Completely separate meshes that expose selected services to each other via ServiceEntry and cross-mesh trust. Largest blast radius containment.

10. Performance Considerations¶

Istio adds latency because every request traverses two additional network hops (source sidecar → destination sidecar). In practice, p50 overhead is 1–2ms, p99 can be 5–10ms for complex routing configs.

Key performance levers:

Concern	Mitigation
Large mesh xDS config	Use `Sidecar` resource to scope what each proxy sees
Sidecar memory usage	Set `resources.limits` in `IstioOperator` (proxy typically needs 128–256Mi)
Init container timing	Set `holdApplicationUntilProxyStarts: true` to prevent app-before-proxy races
Envoy config churn	Avoid frequent label/annotation changes that trigger xDS pushes
Tracing overhead	Tune sampling rate in `MeshConfig.defaultConfig.tracing` (1% is common for high-traffic)

Under the hood: Envoy uses iptables rules (injected by the istio-init container) to intercept all traffic. Inbound traffic is redirected to port 15006, outbound to 15001. If you see unexpected connection resets after enabling Istio, check iptables -t nat -L inside the pod's network namespace to verify the redirect rules.

Key Takeaways¶

Istio decouples network policy from application code. mTLS, retries, canary routing, and authorization are configured in YAML, not in application libraries.
istiod consolidates Pilot + Citadel + Galley. One deployment, one failure domain.
VirtualService = routing rules. DestinationRule = traffic policy and subsets. Gateway = edge traffic. These three resources cover 90% of day-to-day Istio config.
mTLS PERMISSIVE mode is for migration only. Every production cluster should be STRICT or you have no actual transport security guarantees.
The Sidecar resource is the most underused performance optimization. Without it, every proxy in a large mesh carries config it will never use.
Fault injection must be removed after testing. It is the most likely Istio config to be left on accidentally.
AuthorizationPolicy is default-deny once you create any policy. Health check paths must be explicitly allowed or carved out.

Prerequisites¶

Service Mesh (Topic Pack, L3)
Envoy Proxy (Topic Pack, L2)

Istio Flashcards (CLI) (flashcard_deck, L1) — Istio Service Mesh

Istio Service Mesh — Primer¶

Why This Matters¶

Core Concepts¶

1. Architecture: Control Plane and Data Plane¶

2. Traffic Management¶

VirtualService¶

DestinationRule¶

Gateway¶

3. Security¶

mTLS: STRICT vs PERMISSIVE¶

SPIFFE Identity¶

AuthorizationPolicy¶

4. Observability¶

Metrics¶

Distributed Tracing¶

Kiali¶

Access Logs¶

5. Canary Deployments with Weighted Routing¶

6. Fault Injection for Chaos Testing¶

7. Sidecar Injection¶

8. Ingress and Egress Gateways¶

9. Multi-Cluster and Multi-Mesh¶

10. Performance Considerations¶

Key Takeaways¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Istio Service Mesh — Primer¶

Why This Matters¶

Core Concepts¶

1. Architecture: Control Plane and Data Plane¶

2. Traffic Management¶

VirtualService¶

DestinationRule¶

Gateway¶

3. Security¶

mTLS: STRICT vs PERMISSIVE¶

SPIFFE Identity¶

AuthorizationPolicy¶

4. Observability¶

Metrics¶

Distributed Tracing¶

Kiali¶

Access Logs¶

5. Canary Deployments with Weighted Routing¶

6. Fault Injection for Chaos Testing¶

7. Sidecar Injection¶

8. Ingress and Egress Gateways¶

9. Multi-Cluster and Multi-Mesh¶

10. Performance Considerations¶

Key Takeaways¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶