Portal | Level: L2: Operations | Topics: Progressive Delivery | Domain: Kubernetes
Progressive Delivery — Primer¶
Why This Matters¶
Every deployment is a bet that the new version is better than the old one. Progressive delivery lets you make that bet incrementally — exposing the new version to 5% of traffic, watching the metrics for 10 minutes, then rolling forward or back based on data rather than hope.
Without progressive delivery, a deploy is a binary event: you're either 100% on the old version or 100% on the new one. The rollout window is too short to catch statistical regressions, and rolling back requires another full deploy. With a canary, you limit the blast radius of a bad release to a small percentage of users while observing real traffic behavior.
The tooling (Argo Rollouts, Flagger) automates what teams used to do manually: deploy a small batch, query Prometheus for error rate and latency, compare against a baseline, decide. This closes the feedback loop from minutes to seconds and makes progressive delivery repeatable without human toil.
Core Concepts¶
1. Deployment Strategies Compared¶
| Strategy | How it works | Blast radius | Rollback speed | Cost |
|---|---|---|---|---|
| Rolling | Replace pods gradually (N at a time) | Up to 100% if metric check is absent | ~= deploy time | No extra resources |
| Recreate | Kill all old, create all new | 100% brief outage | Deploy a new release | No extra resources |
| Blue/Green | Stand up full new stack, switch traffic | 0% until cutover | Instant (flip LB) | 2x resource cost |
| Canary | Route small % to new version, increase gradually | Configurable % | Instant abort | Partial extra resources |
| A/B Testing | Route specific users to new version (headers, cookies) | Targeted subset | Instant abort | Partial extra resources |
Name origin: "Canary release" comes from the coal mining practice of bringing canaries into mines. If the canary died, miners knew the air was toxic and retreated. In software, the canary version is the first to "die" (show errors) so you can pull back before the whole fleet is affected.
Canary is the default progressive delivery primitive. Blue/green is useful when schema changes or state mean you can't run old and new simultaneously. A/B is for feature flags at the infrastructure level.
2. Argo Rollouts Architecture¶
Argo Rollouts extends Kubernetes with a Rollout CRD that replaces Deployment for managed workloads. It adds a rollout controller that orchestrates the traffic split, analysis, and promotion/abort logic.
Install:
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# kubectl plugin
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
3. Rollout Resource — Canary Strategy¶
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
namespace: my-app
spec:
replicas: 10
revisionHistoryLimit: 3
selector:
matchLabels:
app: my-service
template:
metadata:
labels:
app: my-service
spec:
containers:
- name: my-service
image: ghcr.io/myorg/my-service:v1.2.3
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
strategy:
canary:
canaryService: my-service-canary # Service pointing to canary pods
stableService: my-service-stable # Service pointing to stable pods
trafficRouting:
nginx:
stableIngress: my-service-ingress
steps:
- setWeight: 5 # Route 5% of traffic to canary
- pause: {duration: 5m} # Wait 5 minutes
- analysis: # Run analysis
templates:
- templateName: success-rate
args:
- name: service-name
value: my-service-canary
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100 # Full promotion
canaryMetadata:
labels:
version: canary
stableMetadata:
labels:
version: stable
4. AnalysisTemplate and AnalysisRun¶
AnalysisTemplate defines what metrics to query and what thresholds constitute success/failure. An AnalysisRun is a running instance, created automatically when a rollout reaches an analysis step.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: my-app
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
count: 5 # Run 5 measurements
successCondition: result[0] >= 0.99
failureLimit: 2 # Allow 2 failed measurements before aborting
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_requests_total{
job="{{args.service-name}}",
status!~"5.."
}[2m])) /
sum(rate(http_requests_total{
job="{{args.service-name}}"
}[2m]))
- name: latency-p99
interval: 60s
count: 5
successCondition: result[0] < 0.5 # 500ms
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
job="{{args.service-name}}"
}[2m])) by (le)
)
Multiple metric providers are supported: Prometheus, Datadog, NewRelic, CloudWatch, Web (generic HTTP), Job (run a Kubernetes Job as the analysis).
Gotcha: The
successConditionusesresult[0], notresult. A common mistake is writingsuccessCondition: result >= 0.99which silently evaluates wrong. Always index into the result array.Default trap: The default
failureLimitis 0, meaning a single failed measurement aborts the entire rollout. For noisy metrics, setfailureLimit: 2or higher to tolerate transient dips.
5. Rollout with Blue/Green Strategy¶
spec:
strategy:
blueGreen:
activeService: my-service-active # Points to current live (blue)
previewService: my-service-preview # Points to new version (green)
autoPromotionEnabled: false # Require manual promotion
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-service-preview
postPromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-service-active # active has flipped to new version
scaleDownDelaySeconds: 30 # Keep old stack for 30s after promotion
Blue/green flow:
1. previewService is updated to point to new pods
2. prePromotionAnalysis runs against the preview service
3. Promotion (automatic or manual) flips activeService to new pods
4. postPromotionAnalysis runs against the (now live) new version
5. Old pods scaled down after scaleDownDelaySeconds
6. Traffic Splitting with Nginx and Istio¶
Nginx traffic splitting uses canaryWeight annotation on the Ingress:
# Argo Rollouts manages these annotations automatically
# You just define stableIngress in the Rollout spec
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-service-ingress
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # managed by rollout controller
spec:
rules:
- host: my-service.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-service-canary
port:
number: 80
Istio traffic splitting via VirtualService:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
trafficRouting:
istio:
virtualService:
name: my-service-vsvc
routes:
- primary # name of the route in the VirtualService
destinationRule:
name: my-service-dr
canarySubsetName: canary
stableSubsetName: stable
# VirtualService managed by Argo Rollouts
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-service-vsvc
spec:
hosts:
- my-service
http:
- name: primary
route:
- destination:
host: my-service
subset: stable
weight: 90 # managed dynamically
- destination:
host: my-service
subset: canary
weight: 10 # managed dynamically
7. Header-Based Routing for Testing¶
Route specific requests to the canary (e.g., internal testers, QA automation) without affecting general traffic:
# With Istio
spec:
strategy:
canary:
trafficRouting:
istio:
virtualService:
name: my-service-vsvc
steps:
- setHeaderRoute:
name: header-route
match:
- headerName: X-Canary
headerValue:
exact: "true"
- pause: {} # Pause indefinitely — manual promotion
With Nginx:
steps:
- setCanaryScale:
weight: 0 # 0% of general traffic
- setHeaderRoute:
name: canary-header
match:
- headerName: X-Canary-Version
headerValue:
exact: v2
- pause: {duration: 1h} # Soak with header-based traffic only
- setWeight: 10
- pause: {duration: 10m}
...
8. Flagger¶
Interview tip: When asked "how is a canary different from a rolling update," the key answer is traffic control. A rolling update splits traffic by pod ratio (if 1 of 10 pods is new, ~10% of traffic hits it). A canary with Istio or Nginx splits traffic by percentage independent of replica count — you can send 1% to a single canary pod regardless of how many stable pods exist.
Flagger is the Flux-ecosystem alternative to Argo Rollouts. It uses a Canary CRD and a similar analysis/promotion model but integrates natively with Flux GitOps.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-service
namespace: my-app
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
progressDeadlineSeconds: 600
service:
port: 80
targetPort: 8080
gateways:
- public-gateway.istio-system.svc.cluster.local
hosts:
- my-service.example.com
analysis:
interval: 1m
threshold: 5 # Max number of failed checks before abort
maxWeight: 50 # Max canary traffic percentage
stepWeight: 10 # Increment per step
metrics:
- name: request-success-rate
min: 99
interval: 1m
- name: request-duration
max: 500 # milliseconds
interval: 30s
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: bash
cmd: "curl -sd 'test' http://my-service-canary.my-app/health | grep OK"
9. Automated Rollback on Metric Degradation¶
Both Argo Rollouts and Flagger can automatically abort and roll back when analysis fails.
Argo Rollouts abort behavior:
- successCondition fails failureLimit times → AnalysisRun marked Failed
- Rollout controller aborts the rollout → rolls back to stable version
- Pods using canary image are replaced with stable image
- Rollout enters Degraded state with abort message
# Check why a rollout aborted
kubectl argo rollouts get rollout my-service
# Look for: "Aborted due to failed analysis..."
kubectl get analysisrun -n my-app
kubectl describe analysisrun my-service-xxxxx -n my-app
10. Feature Flag Integration¶
Feature flags and progressive delivery are complementary but distinct: - Progressive delivery: controls which users GET the new version (infra-level routing) - Feature flags: controls which features are ENABLED within a version (app-level toggle)
Typical layered approach:
10% of traffic → canary pods (new version)
└── New version has feature flag SDK
├── Flag "new-checkout-flow" = ON for 5% of all users
└── Flag "new-checkout-flow" = OFF for 95% of all users
OpenFeature integration in a canary deployment:
# Application reads feature flags from OpenFeature SDK
# Flag provider (LaunchDarkly, Flagsmith, etc.) serves flags
# Rollout controls WHICH pods serve traffic (infrastructure)
# Feature flags control WHAT behavior those pods exhibit (application)
Quick Reference¶
# Rollout operations
kubectl argo rollouts list rollouts -n my-app
kubectl argo rollouts get rollout my-service -n my-app
kubectl argo rollouts get rollout my-service -n my-app --watch
# Promote (move to next step or full promotion)
kubectl argo rollouts promote my-service -n my-app
kubectl argo rollouts promote my-service -n my-app --full # skip all steps
# Abort and rollback
kubectl argo rollouts abort my-service -n my-app
# Retry an aborted rollout (after fixing the issue)
kubectl argo rollouts retry rollout my-service -n my-app
# Pause / resume
kubectl argo rollouts pause my-service -n my-app
kubectl argo rollouts resume my-service -n my-app
# Update image (trigger a new rollout)
kubectl argo rollouts set image my-service \
my-service=ghcr.io/myorg/my-service:v1.2.4 -n my-app
# Check analysis runs
kubectl get analysisrun -n my-app
kubectl describe analysisrun my-service-abcde -n my-app
# Undo last rollout
kubectl argo rollouts undo my-service -n my-app
# Dashboard (port-forward)
kubectl argo rollouts dashboard -n my-app
# Opens at http://localhost:3100
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
- ArgoCD & GitOps (Topic Pack, L2)