Skip to content

Portal | Level: L2: Operations | Topics: Progressive Delivery | Domain: Kubernetes

Progressive Delivery — Primer

Why This Matters

Every deployment is a bet that the new version is better than the old one. Progressive delivery lets you make that bet incrementally — exposing the new version to 5% of traffic, watching the metrics for 10 minutes, then rolling forward or back based on data rather than hope.

Without progressive delivery, a deploy is a binary event: you're either 100% on the old version or 100% on the new one. The rollout window is too short to catch statistical regressions, and rolling back requires another full deploy. With a canary, you limit the blast radius of a bad release to a small percentage of users while observing real traffic behavior.

The tooling (Argo Rollouts, Flagger) automates what teams used to do manually: deploy a small batch, query Prometheus for error rate and latency, compare against a baseline, decide. This closes the feedback loop from minutes to seconds and makes progressive delivery repeatable without human toil.

Core Concepts

1. Deployment Strategies Compared

Strategy How it works Blast radius Rollback speed Cost
Rolling Replace pods gradually (N at a time) Up to 100% if metric check is absent ~= deploy time No extra resources
Recreate Kill all old, create all new 100% brief outage Deploy a new release No extra resources
Blue/Green Stand up full new stack, switch traffic 0% until cutover Instant (flip LB) 2x resource cost
Canary Route small % to new version, increase gradually Configurable % Instant abort Partial extra resources
A/B Testing Route specific users to new version (headers, cookies) Targeted subset Instant abort Partial extra resources

Name origin: "Canary release" comes from the coal mining practice of bringing canaries into mines. If the canary died, miners knew the air was toxic and retreated. In software, the canary version is the first to "die" (show errors) so you can pull back before the whole fleet is affected.

Canary is the default progressive delivery primitive. Blue/green is useful when schema changes or state mean you can't run old and new simultaneously. A/B is for feature flags at the infrastructure level.

2. Argo Rollouts Architecture

Argo Rollouts extends Kubernetes with a Rollout CRD that replaces Deployment for managed workloads. It adds a rollout controller that orchestrates the traffic split, analysis, and promotion/abort logic.

Install:

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

# kubectl plugin
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts

3. Rollout Resource — Canary Strategy

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
  namespace: my-app
spec:
  replicas: 10
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: my-service
  template:
    metadata:
      labels:
        app: my-service
    spec:
      containers:
        - name: my-service
          image: ghcr.io/myorg/my-service:v1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
  strategy:
    canary:
      canaryService: my-service-canary    # Service pointing to canary pods
      stableService: my-service-stable    # Service pointing to stable pods
      trafficRouting:
        nginx:
          stableIngress: my-service-ingress
      steps:
        - setWeight: 5          # Route 5% of traffic to canary
        - pause: {duration: 5m} # Wait 5 minutes
        - analysis:             # Run analysis
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: my-service-canary
        - setWeight: 25
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100        # Full promotion
      canaryMetadata:
        labels:
          version: canary
      stableMetadata:
        labels:
          version: stable

4. AnalysisTemplate and AnalysisRun

AnalysisTemplate defines what metrics to query and what thresholds constitute success/failure. An AnalysisRun is a running instance, created automatically when a rollout reaches an analysis step.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: my-app
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      count: 5             # Run 5 measurements
      successCondition: result[0] >= 0.99
      failureLimit: 2      # Allow 2 failed measurements before aborting
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{
              job="{{args.service-name}}",
              status!~"5.."
            }[2m])) /
            sum(rate(http_requests_total{
              job="{{args.service-name}}"
            }[2m]))
    - name: latency-p99
      interval: 60s
      count: 5
      successCondition: result[0] < 0.5     # 500ms
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                job="{{args.service-name}}"
              }[2m])) by (le)
            )

Multiple metric providers are supported: Prometheus, Datadog, NewRelic, CloudWatch, Web (generic HTTP), Job (run a Kubernetes Job as the analysis).

Gotcha: The successCondition uses result[0], not result. A common mistake is writing successCondition: result >= 0.99 which silently evaluates wrong. Always index into the result array.

Default trap: The default failureLimit is 0, meaning a single failed measurement aborts the entire rollout. For noisy metrics, set failureLimit: 2 or higher to tolerate transient dips.

5. Rollout with Blue/Green Strategy

spec:
  strategy:
    blueGreen:
      activeService: my-service-active     # Points to current live (blue)
      previewService: my-service-preview   # Points to new version (green)
      autoPromotionEnabled: false          # Require manual promotion
      prePromotionAnalysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: my-service-preview
      postPromotionAnalysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: my-service-active       # active has flipped to new version
      scaleDownDelaySeconds: 30            # Keep old stack for 30s after promotion

Blue/green flow: 1. previewService is updated to point to new pods 2. prePromotionAnalysis runs against the preview service 3. Promotion (automatic or manual) flips activeService to new pods 4. postPromotionAnalysis runs against the (now live) new version 5. Old pods scaled down after scaleDownDelaySeconds

6. Traffic Splitting with Nginx and Istio

Nginx traffic splitting uses canaryWeight annotation on the Ingress:

# Argo Rollouts manages these annotations automatically
# You just define stableIngress in the Rollout spec
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-service-ingress
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  # managed by rollout controller
spec:
  rules:
    - host: my-service.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-service-canary
                port:
                  number: 80

Istio traffic splitting via VirtualService:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      trafficRouting:
        istio:
          virtualService:
            name: my-service-vsvc
            routes:
              - primary       # name of the route in the VirtualService
          destinationRule:
            name: my-service-dr
            canarySubsetName: canary
            stableSubsetName: stable

# VirtualService managed by Argo Rollouts
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service-vsvc
spec:
  hosts:
    - my-service
  http:
    - name: primary
      route:
        - destination:
            host: my-service
            subset: stable
          weight: 90       # managed dynamically
        - destination:
            host: my-service
            subset: canary
          weight: 10       # managed dynamically

7. Header-Based Routing for Testing

Route specific requests to the canary (e.g., internal testers, QA automation) without affecting general traffic:

# With Istio
spec:
  strategy:
    canary:
      trafficRouting:
        istio:
          virtualService:
            name: my-service-vsvc
      steps:
        - setHeaderRoute:
            name: header-route
            match:
              - headerName: X-Canary
                headerValue:
                  exact: "true"
        - pause: {}  # Pause indefinitely — manual promotion

With Nginx:

      steps:
        - setCanaryScale:
            weight: 0         # 0% of general traffic
        - setHeaderRoute:
            name: canary-header
            match:
              - headerName: X-Canary-Version
                headerValue:
                  exact: v2
        - pause: {duration: 1h}  # Soak with header-based traffic only
        - setWeight: 10
        - pause: {duration: 10m}
        ...

8. Flagger

Interview tip: When asked "how is a canary different from a rolling update," the key answer is traffic control. A rolling update splits traffic by pod ratio (if 1 of 10 pods is new, ~10% of traffic hits it). A canary with Istio or Nginx splits traffic by percentage independent of replica count — you can send 1% to a single canary pod regardless of how many stable pods exist.

Flagger is the Flux-ecosystem alternative to Argo Rollouts. It uses a Canary CRD and a similar analysis/promotion model but integrates natively with Flux GitOps.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-service
  namespace: my-app
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  progressDeadlineSeconds: 600
  service:
    port: 80
    targetPort: 8080
    gateways:
      - public-gateway.istio-system.svc.cluster.local
    hosts:
      - my-service.example.com
  analysis:
    interval: 1m
    threshold: 5        # Max number of failed checks before abort
    maxWeight: 50       # Max canary traffic percentage
    stepWeight: 10      # Increment per step
    metrics:
      - name: request-success-rate
        min: 99
        interval: 1m
      - name: request-duration
        max: 500        # milliseconds
        interval: 30s
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://my-service-canary.my-app/health | grep OK"

9. Automated Rollback on Metric Degradation

Both Argo Rollouts and Flagger can automatically abort and roll back when analysis fails.

Argo Rollouts abort behavior: - successCondition fails failureLimit times → AnalysisRun marked Failed - Rollout controller aborts the rollout → rolls back to stable version - Pods using canary image are replaced with stable image - Rollout enters Degraded state with abort message

# Check why a rollout aborted
kubectl argo rollouts get rollout my-service
# Look for: "Aborted due to failed analysis..."

kubectl get analysisrun -n my-app
kubectl describe analysisrun my-service-xxxxx -n my-app

10. Feature Flag Integration

Feature flags and progressive delivery are complementary but distinct: - Progressive delivery: controls which users GET the new version (infra-level routing) - Feature flags: controls which features are ENABLED within a version (app-level toggle)

Typical layered approach:

10% of traffic → canary pods (new version)
  └── New version has feature flag SDK
      ├── Flag "new-checkout-flow" = ON for 5% of all users
      └── Flag "new-checkout-flow" = OFF for 95% of all users

OpenFeature integration in a canary deployment:

# Application reads feature flags from OpenFeature SDK
# Flag provider (LaunchDarkly, Flagsmith, etc.) serves flags
# Rollout controls WHICH pods serve traffic (infrastructure)
# Feature flags control WHAT behavior those pods exhibit (application)

Quick Reference

# Rollout operations
kubectl argo rollouts list rollouts -n my-app
kubectl argo rollouts get rollout my-service -n my-app
kubectl argo rollouts get rollout my-service -n my-app --watch

# Promote (move to next step or full promotion)
kubectl argo rollouts promote my-service -n my-app
kubectl argo rollouts promote my-service -n my-app --full  # skip all steps

# Abort and rollback
kubectl argo rollouts abort my-service -n my-app

# Retry an aborted rollout (after fixing the issue)
kubectl argo rollouts retry rollout my-service -n my-app

# Pause / resume
kubectl argo rollouts pause my-service -n my-app
kubectl argo rollouts resume my-service -n my-app

# Update image (trigger a new rollout)
kubectl argo rollouts set image my-service \
  my-service=ghcr.io/myorg/my-service:v1.2.4 -n my-app

# Check analysis runs
kubectl get analysisrun -n my-app
kubectl describe analysisrun my-service-abcde -n my-app

# Undo last rollout
kubectl argo rollouts undo my-service -n my-app

# Dashboard (port-forward)
kubectl argo rollouts dashboard -n my-app
# Opens at http://localhost:3100

Wiki Navigation

Prerequisites