Progressive Delivery Footguns¶

Mistakes that cause outages, failed rollouts, or false confidence in progressive delivery.

1. Analysis With a Prometheus Query That Returns No Data¶

Your AnalysisTemplate has successCondition: result[0] >= 0.99. The Prometheus query uses a label that doesn't exist yet on the canary pods (e.g., service="my-service-canary" but the pods are labeled app="my-service"). The query returns an empty vector. Argo Rollouts treats this as result[0] = 0, which is < 0.99, so... actually this would fail.

But the inverse: successCondition: result[0] <= 0.01 (error rate below 1%). Empty result → result[0] = 0 <= 0.01 → success. Your canary "passes" with zero measurements.

Fix: Always test PromQL queries against the real Prometheus before embedding in AnalysisTemplates. Use count to require a minimum number of measurements. Require the query to return a meaningful value: set successCondition to something that zero-data would fail.

Gotcha: Prometheus rate() returns an empty vector (not zero) when a counter doesn't exist yet. A canary pod that hasn't received any requests has no http_requests_total metric. Your success rate calculation rate(errors) / rate(total) returns NaN (division by empty vector), which Argo Rollouts may interpret as 0 — a passing error rate. Always add a count() or vector() fallback.

2. Using `autoPromotionEnabled: true` on Blue/Green in Production¶

You set autoPromotionEnabled: true on a blue/green Rollout. Pre-promotion analysis passes. Active service flips to the new (green) stack. The green stack has a bug that only manifests under real load patterns — it crashes when it processes more than 100 concurrent requests. At low preview traffic it was fine. Now 100% of production traffic hits it and the service is down.

Fix: For production blue/green, always use autoPromotionEnabled: false with a manual promotion gate. Let engineering or SRE confirm the preview looks healthy before flipping the active service. The whole point of blue/green is that you can observe the green stack before cutting over.

3. Not Having a Readiness Probe on the Canary¶

Pods are scheduled and the rollout proceeds to weight the canary at 5%. The canary containers are starting but not ready — they're still initializing the application (connecting to DB, warming caches). Requests hit the canary during this window and get connection refused errors. The AnalysisTemplate doesn't catch it because the error rate window is short.

Fix: Every Rollout template must have a readiness probe. The rollout controller will not route traffic to pods that are not Ready. initialDelaySeconds must be long enough for the application to fully initialize.

4. Forgetting That Rollout Pauses Block GitOps Reconciliation¶

Your Rollout is paused at a manual step. A developer pushes a config change to Git (not an image change). ArgoCD tries to sync the Application. It sees the Rollout spec has changed and re-applies it. If ArgoCD's sync strategy isn't careful about Rollout resources, this can restart the rollout from step 0, losing the paused state and re-routing traffic unexpectedly.

Fix: Use argocd.argoproj.io/sync-options: Force=false on the Rollout resource in ArgoCD to prevent force-applying. Or use ArgoCD's ignoreDifferences for the .status subresource of Rollouts. Coordinate config-only changes through a separate path that doesn't touch the Rollout spec's image field.

5. Setting Canary Weight Without Stable/Canary Services¶

You define a Rollout with setWeight: 10 but don't create canaryService and stableService (or don't configure a traffic routing provider). Argo Rollouts falls back to pod-count-based splitting: it creates 1 canary pod out of 10 total — which IS ~10% of pod capacity but is NOT traffic splitting at the load balancer level. If the LB uses round-robin, a client might hit the canary many times in a row. More importantly, header-based routing won't work.

Fix: For real traffic splitting, you must define canaryService, stableService, and a trafficRouting block pointing at Nginx, Istio, or another supported mesh/ingress. Pod-count splitting alone is not reliable for canary analysis.

Under the hood: Without a traffic management provider, Argo Rollouts controls canary weight by scaling pod counts. setWeight: 10 with 10 replicas = 1 canary pod + 9 stable pods. But Kubernetes Service round-robin doesn't guarantee exactly 10% of requests go to the canary — it depends on connection reuse, keep-alive, and load balancer behavior. Istio VirtualService or Nginx canary annotations provide actual request-level traffic splitting.

6. Comparing Canary to Baseline Without Accounting for Traffic Differences¶

You use the canary's success rate query to determine whether to promote. At 5% weight, the canary receives less traffic and the confidence interval on the error rate is wide. A success rate of 98.5% on 200 requests might be statistically indistinguishable from 99.2% on 4000 requests. You auto-promote, and the actual error rate at 100% traffic is 1.5% — enough to breach your SLO.

Fix: Set a minimum traffic threshold before running analysis. Use longer interval and higher count at low canary weights. Or use Kayenta (Netflix's automated canary analysis service) which does proper statistical comparison between canary and baseline traffic.

7. Not Setting `scaleDownDelaySeconds` on Blue/Green¶

You promote green to active. Blue is immediately terminated. A request in flight is being processed by a blue pod — it gets a connection reset. Also, if the post-promotion analysis fails and you need to roll back, the blue pods are already gone. You're now rolling back by deploying old pods from scratch, which takes time.

Fix: Set scaleDownDelaySeconds: 60 (or more) to keep the old stack running briefly after promotion. This allows: 1. In-flight requests to complete 2. Post-promotion analysis to run while rollback is still instant (flip the Service back) 3. A brief window to manually verify before the old stack disappears

8. Running Analysis Against the Stable Service Instead of the Canary¶

You're measuring success rate of my-service-stable in your AnalysisTemplate. It passes. You promote. The canary was actually broken — the analysis was validating the old version, not the new one.

Fix: Double-check the args section of your analysis step. The service-name argument should resolve to the CANARY service, not stable. At the post-sync stage, it should resolve to the now-active service.

Debug clue: kubectl get analysisrun -n <ns> -o yaml shows the resolved arguments and measurement results. If message fields say "success" but your canary was clearly broken, check the args — the service name is the most common misconfiguration. Argo Rollouts substitutes {{args.service-name}} at creation time, so the resolved value is visible in the AnalysisRun spec.

9. Ignoring `inconclusiveLimit` — Rollout Hangs Forever¶

Your analysis metric provider is down (Prometheus unreachable). Measurements return errors. Neither successCondition nor failureCondition can be evaluated. The AnalysisRun enters Inconclusive state. The Rollout sits at the analysis step indefinitely. The release is blocked.

Fix: Set inconclusiveLimit: 3 on your metrics. After 3 inconclusive measurements, the AnalysisRun fails, and the Rollout aborts. An aborted rollout is recoverable (retry after fixing); a hung rollout blocks all further deployments.

10. Using Feature Branches for Rollout Testing in Prod¶

Engineers create "test" Rollouts that point at feature branch images in the production cluster to validate behavior before merging. These Rollouts pile up, consume resources, and sometimes route production traffic because someone misconfigured the service selector.

Fix: Feature branch validation belongs in a staging or preview environment. Production Rollouts should only deploy images from the main branch via the standard CI/CD pipeline. Enforce this with AppProject source repository restrictions in ArgoCD.

11. Not Accounting for Session Affinity in Canary¶

Your application uses sticky sessions (cookie-based or IP-based). A user gets routed to the canary pod. Subsequent requests use sticky routing and always hit the canary. That user gets 100% canary exposure while you think they're in a 5% canary pool. If the canary has a bug that manifests over time (memory leak, state corruption), that user is disproportionately affected.

Fix: If you use session affinity, either disable it during canary rollouts (accept that users may see different versions) or use header-based canary routing instead of weight-based, and clearly communicate to users that they're in a preview group.