Skip to content

Progressive Delivery — Street-Level Ops

Quick Diagnosis Commands

# Overall rollout state
kubectl argo rollouts list rollouts -A
kubectl argo rollouts get rollout my-service -n my-app --watch

# Is the canary currently running?
kubectl get pods -n my-app -l app=my-service --show-labels | grep -E "version=canary|version=stable"

# Current traffic split
kubectl get ingress -n my-app -o yaml | grep -E "canary-weight|canary:"

# Analysis runs — are they passing?
kubectl get analysisrun -n my-app
kubectl describe analysisrun -n my-app | grep -E "Phase|Message|Failed"

# Rollout controller logs
kubectl -n argo-rollouts logs -l app.kubernetes.io/name=argo-rollouts --tail=50 -f

# What step is the rollout on?
kubectl get rollout my-service -n my-app -o jsonpath='{.status.currentStepIndex}'

Gotcha: Analysis Passes With Zero Samples

Your AnalysisTemplate has a Prometheus query. The query returns no data (metric doesn't exist yet, wrong label, wrong job name). ArgoCD treats an empty result as 0.0, which satisfies successCondition: result[0] >= 0.0. The rollout promotes successfully even though zero measurements were taken.

Rule: Always add a minimum measurement count check. Set successCondition to require a meaningful value (e.g., result[0] >= 0.99) that a zero result would not satisfy. Validate your Prometheus query manually before baking it into an AnalysisTemplate.

# Test the PromQL query manually
curl -s "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{job="my-service-canary"}[2m]))' \
  | jq '.data.result'
# If result is empty: your query is wrong

Gotcha: Canary Traffic Split Applies to ALL Requests Including Health Checks

You set 5% canary weight. Your load balancer health checks also get routed 5% to the canary. If the canary is broken and returns 500s on /health, the load balancer may mark the canary Service unhealthy, but the 5% of user traffic still hitting it experiences errors while the analysis has low enough signal to not abort.

Rule: Make sure your AnalysisTemplate query is scoped to user-facing traffic (not health-check traffic). Filter by path or use a separate metrics endpoint. Also set failureLimit: 0 for critical metrics if even a single failed measurement should abort.


Gotcha: --full Promotion Bypasses Analysis

kubectl argo rollouts promote my-service --full

This skips all remaining steps including analysis runs. It's useful for hotfixes but dangerous if used routinely. Once the habit forms, engineers promote fully to "get it done quickly" and the analysis safety net becomes vestigial.

Rule: Reserve --full for genuine emergencies (the new version fixes a live outage and there's no time to wait for analysis). Document it in the incident record. If it's used more than once a week, the rollout steps are too slow and should be tuned — not bypassed.

Scale note: At high traffic volumes, even 5% canary weight generates enough signal to detect regressions within minutes. At low traffic (< 10 RPS), 5% canary may produce so few data points that your AnalysisTemplate never gets a meaningful measurement. For low-traffic services, increase canary weight to 20-30% or extend analysis duration.


Pattern: Canary with Automated Rollback

Full production-ready Rollout with automated rollback on Prometheus signal degradation:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 20
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
        - name: api
          image: ghcr.io/myorg/api:v2.1.0
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
  strategy:
    canary:
      canaryService: api-service-canary
      stableService: api-service-stable
      trafficRouting:
        nginx:
          stableIngress: api-service-ingress
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: http-success-rate
            args:
              - name: service-name
                value: api-service-canary
        - setWeight: 20
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: http-success-rate
              - templateName: latency-p99
            args:
              - name: service-name
                value: api-service-canary
        - setWeight: 50
        - pause: {duration: 15m}
        - setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: http-success-rate
  namespace: production
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      count: 5
      successCondition: result[0] >= 0.99
      failureLimit: 1
      inconclusiveLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",code!~"5.."}[2m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-p99
  namespace: production
spec:
  args:
    - name: service-name
  metrics:
    - name: p99-latency
      interval: 60s
      count: 5
      successCondition: result[0] < 0.3
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{
              service="{{args.service-name}}"}[2m])) by (le))

Pattern: Blue/Green for Schema-Breaking Changes

When a migration must complete before the new code can run, use blue/green with a PreSync hook:

spec:
  strategy:
    blueGreen:
      activeService: api-active
      previewService: api-preview
      autoPromotionEnabled: false    # Wait for human approval
      prePromotionAnalysis:
        templates:
          - templateName: smoke-test
        args:
          - name: service-name
            value: api-preview
      scaleDownDelaySeconds: 60
---
# Smoke test analysis — runs a job instead of querying Prometheus
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoke-test
spec:
  args:
    - name: service-name
  metrics:
    - name: smoke
      provider:
        job:
          spec:
            template:
              spec:
                restartPolicy: Never
                containers:
                  - name: smoke
                    image: curlimages/curl:latest
                    command:
                      - sh
                      - -c
                      - |
                        curl -sf http://{{args.service-name}}/health && \
                        curl -sf http://{{args.service-name}}/api/v2/users?limit=1

Scenario: Canary Rollout Stalls at 5%

Symptoms: Rollout has been at 5% for 45 minutes. Status shows Paused. No analysis runs visible.

Diagnosis:

kubectl argo rollouts get rollout my-service -n my-app
# Look at "Step" field and "Message" field

# Check if paused at a pause step (indefinite pause = manual gate)
kubectl get rollout my-service -n my-app -o jsonpath='{.spec.strategy.canary.steps}'
# If one step is: {"pause": {}} — no duration = manual promotion required

# Check if analysis failed (rollout aborted)
kubectl get analysisrun -n my-app --sort-by='.metadata.creationTimestamp' | tail -5

Debug clue: When a canary rollout stalls, check the argo-rollouts controller logs first. If the controller itself is crash-looping or under-resourced, all rollouts across all namespaces stall simultaneously. A single broken Rollout object with invalid YAML can cause the controller to fail its reconciliation loop.

Possible causes: 1. Step pause: {} with no duration — this is a manual gate, intentional 2. pause: {duration: 5m} but the rollout controller is behind — check controller logs 3. Analysis run is stuck Running — check Prometheus connectivity from the analysis pod

Resolution:

# If intentional manual gate — promote
kubectl argo rollouts promote my-service -n my-app

# If analysis is stuck — describe the analysis run
kubectl describe analysisrun my-service-xxxxx -n my-app
# Look for: "Get http://prometheus... connection refused"
# Fix: ensure Prometheus address is correct, check NetworkPolicy


Scenario: Rollout Aborted, Service is Degraded

Symptoms: kubectl argo rollouts get rollout my-service shows Degraded. Some pods are running the new image, some the old.

# Confirm state
kubectl argo rollouts get rollout my-service -n my-app
# Status should show: Degraded, Abort

# Check what failed
kubectl get analysisrun -n my-app
kubectl describe analysisrun <name> -n my-app | grep -A5 "Failure\|Message"

# Current pod versions
kubectl get pods -n my-app -l app=my-service \
  -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[0].image,READY:.status.containerStatuses[0].ready'

Resolution:

# The rollout is already rolled back to stable (abort auto-reverts)
# Verify stable service is 100% on old image
kubectl get rollout my-service -n my-app -o jsonpath='{.status.stableRS}'

# Fix the underlying issue, push a new image
kubectl argo rollouts set image my-service \
  my-service=ghcr.io/myorg/my-service:v1.2.4-fixed -n my-app

# Retry the rollout
kubectl argo rollouts retry rollout my-service -n my-app


Emergency: Canary is Causing Errors, Need Instant Rollback

Traffic is split 50/50 and the canary is throwing 500s. Every second matters.

# 1. Abort immediately — rolls back to stable
kubectl argo rollouts abort my-service -n my-app

# 2. Verify rollback is complete (watch for Healthy status)
kubectl argo rollouts get rollout my-service -n my-app --watch

# 3. Confirm traffic is 100% on stable service
kubectl get svc my-service-stable -n my-app
kubectl get ingress -n my-app -o yaml | grep canary-weight
# Should be 0 after abort

# 4. Scale down canary pods manually if controller is slow
kubectl scale deployment my-service-canary -n my-app --replicas=0
# (Argo Rollouts will reconcile this, but it speeds up recovery)

Useful One-Liners

# Show all rollouts and their step progress
kubectl get rollout -A -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,DESIRED:.spec.replicas,READY:.status.readyReplicas,STEP:.status.currentStepIndex,PHASE:.status.phase'

# Watch canary pod count vs stable
watch -n5 "kubectl get pods -n my-app -l app=my-service --show-labels | grep -c canary; kubectl get pods -n my-app -l app=my-service --show-labels | grep -c stable"

# Get rollout history
kubectl argo rollouts history rollout my-service -n my-app

# Check analysis run results in detail
kubectl get analysisrun -n my-app -o json | jq '.items[].status.metricResults'

# Pause all rollouts in a namespace (emergency freeze)
for r in $(kubectl get rollout -n my-app -o name); do
  kubectl argo rollouts pause ${r#*/} -n my-app
done

# Force image update (triggers new rollout)
kubectl argo rollouts set image my-service my-service=ghcr.io/myorg/my-service:v1.3.0 -n my-app

# Get current canary weight
kubectl get ingress -n my-app -o jsonpath='{.items[0].metadata.annotations.nginx\.ingress\.kubernetes\.io/canary-weight}'