Decision Tree: Roll Back or Fix Forward?¶
Category: Operational Decisions Starting Question: "Something broke after a deployment — should I roll back or fix forward?" Estimated traversal: 2-5 minutes Domains: deployments, incident-response, release-engineering, SRE
The Tree¶
Something broke after a deployment — roll back or fix forward?
│
├── [Check 1] Is the service completely down or are users actively impacted?
│ ├── YES (production down, revenue impact, error rate > 50%)
│ │ ├── [Check 2] Is rollback safe? (no DB migration ran, stateless service)
│ │ │ ├── YES → ✅ ROLL BACK IMMEDIATELY
│ │ │ │ kubectl rollout undo / helm rollback / revert pipeline
│ │ │ └── NO (migration already ran, stateful data written)
│ │ │ ├── [Check 3] Is the fix known, tested, and < 5 lines of code?
│ │ │ │ ├── YES + hotfix deploy < 15 min → ✅ FIX FORWARD (hotfix path)
│ │ │ │ └── NO or untested
│ │ │ │ ├── [Check 4] Can a feature flag disable the broken behavior?
│ │ │ │ │ ├── YES → ✅ FEATURE FLAG OFF + escalate for fix
│ │ │ │ │ └── NO → ⚠️ ESCALATE — neither path is safe alone
│ │ │ └── [Check 5] Is blast radius growing (error rate rising)?
│ │ │ ├── YES → ⚠️ ESCALATE IMMEDIATELY + partial traffic split
│ │ │ └── NO (stable, not worsening) → attempt fix forward with timer
│ │
│ └── NO (partial degradation, < 10% error rate, workaround exists)
│ ├── [Check 6] How long ago was the deployment?
│ │ ├── < 30 minutes ago
│ │ │ ├── [Check 7] Any DB migration or schema change ran?
│ │ │ │ ├── NO → ✅ ROLL BACK (low-risk, recent, clean)
│ │ │ │ └── YES (additive/backward-compat migration)
│ │ │ │ ├── Is migration purely additive (new column, new table)?
│ │ │ │ │ ├── YES → ✅ ROLL BACK (old code tolerates new schema)
│ │ │ │ │ └── NO (dropped column, type change) → fix forward only
│ │ ├── 30 min – 2 hours ago
│ │ │ ├── [Check 8] Has state drifted? (writes, caches, queues processed)
│ │ │ │ ├── Minimal drift → rollback still viable with caution
│ │ │ │ └── Significant drift → ✅ FIX FORWARD preferred
│ │ └── > 2 hours ago
│ │ └── [Check 9] Is the fix known and low-risk?
│ │ ├── YES → ✅ FIX FORWARD (state has drifted, rollback risky)
│ │ └── NO → ✅ PARTIAL ROLLBACK with traffic split
│ │
│ └── [Check 10] Do you have a hotfix deployment path < 15 min?
│ ├── YES → ✅ FIX FORWARD via hotfix pipeline
│ └── NO → evaluate rollback cost vs waiting for full deploy cycle
Node Details¶
Check 1: Assess production impact severity¶
Command/method:
# Error rate over last 5 minutes
kubectl exec -it prometheus-pod -- promtool query instant \
'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])'
# Synthetic monitor / uptime check
curl -o /dev/null -sw "%{http_code}\n" https://service.example.com/health
# Check recent logs for error volume
kubectl logs -l app=myservice --since=5m | grep -c "ERROR"
Check 2: Is rollback safe?¶
Command/method:
# Check if a DB migration ran as part of this deploy
kubectl get job -n myapp | grep migrate
kubectl logs job/migrate-v2-3-0 --tail=20
# Check for stateful components that wrote data
kubectl exec -it db-pod -- psql -c "SELECT MAX(created_at) FROM orders;"
# Verify service is stateless (check for local disk writes, session state)
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.volumes}'
Check 3: Is the fix known, tested, and < 5 lines?¶
Command/method: Review the diff in the deploy pipeline or PR. Count changed lines. Confirm the fix was reproduced locally or in staging. What you're looking for: A 1-3 line targeted fix (wrong config value, off-by-one, missing null check) that has an obvious correct form. Anything requiring reasoning or testing > 10 min fails this check. Common pitfall: "It's just a config change" — config changes can have wide blast radius. Treat YAML/config edits with same rigor as code.
Check 4: Feature flag availability¶
Command/method:
# Check if flagging infrastructure exists
curl -s https://flags.internal/api/v1/flags | jq '.[] | select(.key == "feature-xyz")'
# Toggle flag off (LaunchDarkly CLI example)
ld feature-flags update feature-xyz --value false --environment production
# Verify flag is respected by the service
kubectl logs -l app=myservice --since=1m | grep "feature-xyz"
Check 6: Deployment age and state drift¶
Command/method:
# When did the deployment complete?
kubectl rollout history deployment/myapp | tail -3
kubectl describe deployment/myapp | grep "Last Updated"
# How much data has been written since deploy?
kubectl exec -it db-pod -- psql -c \
"SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '30 minutes';"
# Queue depth / messages processed
kubectl exec -it rabbitmq-pod -- rabbitmqctl list_queues name messages
Check 8: State drift assessment¶
Command/method:
# Row counts in primary tables since deployment
kubectl exec -it db-pod -- psql -c \
"SELECT table_name,
(SELECT COUNT(*) FROM information_schema.tables) as total
FROM information_schema.tables WHERE table_schema='public';"
# Cache invalidation risk
redis-cli --scan --pattern "session:*" | wc -l
# Event log / audit trail entries since deployment
kubectl logs -l app=myservice --since=2h | grep "WRITE\|INSERT\|UPDATE" | wc -l
Terminal Actions¶
✅ Action: Roll Back Immediately¶
Do:
# Kubernetes / Helm
kubectl rollout undo deployment/myapp
# OR
helm rollback myapp 1 --wait --timeout 5m
# Verify rollback completed
kubectl rollout status deployment/myapp --timeout=5m
kubectl get pods -l app=myapp -w
# Confirm previous image is running
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[0].image}'
# Check error rate dropped
watch -n5 'kubectl logs -l app=myapp --since=1m | grep -c ERROR'
✅ Action: Fix Forward via Hotfix Path¶
Do:
# 1. Create hotfix branch from the deployed tag
git checkout -b hotfix/v2.3.1 tags/v2.3.0
# 2. Apply the minimal fix
# 3. Push through accelerated pipeline (skip non-critical tests)
git push origin hotfix/v2.3.1
# 4. Trigger hotfix deploy (your CI/CD system)
# 5. Monitor deploy progress
kubectl rollout status deployment/myapp --timeout=15m
# 6. Verify fix resolves the issue
curl -s https://service.example.com/health | jq .
✅ Action: Feature Flag Off¶
Do:
# 1. Disable the flag for all users
ld feature-flags update broken-feature --value false --environment production
# 2. Verify flag change propagated (SDK poll interval is usually 30s)
sleep 35
curl -s https://service.example.com/api/test-endpoint
# 3. Confirm error rate dropping
kubectl logs -l app=myapp --since=2m | grep -c ERROR
# 4. File a bug and schedule fix in next deploy
✅ Action: Partial Rollback with Traffic Split¶
Do:
# Route 10% traffic to old version, 90% to new
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
http:
- route:
- destination:
host: myapp-v2
weight: 90
- destination:
host: myapp-v1
weight: 10
EOF
# Monitor error rates on both versions
kubectl logs -l app=myapp,version=v1 --since=2m | grep -c ERROR
kubectl logs -l app=myapp,version=v2 --since=2m | grep -c ERROR
⚠️ Warning: Escalate — Neither Option Is Safe¶
When: Migration ran AND is not backward-compatible AND fix is unknown or untested AND blast radius is growing. Risk: Rolling back corrupts data (old code writes to new schema). Fixing forward deploys untested code to an already-broken system. Mitigation: 1. Stop all writes if possible (circuit breaker, read-only mode) 2. Page senior engineer and/or DBA immediately 3. Capture a DB snapshot before any further action 4. Do not act until there is a consensus plan
Edge Cases¶
- Multi-service deployment: If the broken deploy involved coordinated changes to multiple services (e.g., API + consumer), rolling back one without the other may create a worse state. Check deployment coordination before rolling back any single service.
- Canary already at 100%: If your canary phase completed before the issue surfaced, "rollback" is a full redeployment of the previous version — factor the full deploy time into your decision.
- Shared library update: If the deployment updated a shared library used by 10 services, a rollback of one service is insufficient. You need to coordinate the rollback across all consumers.
- Third-party dependency change: If the deployment included a vendor API version bump, "rollback" in your service may not revert the vendor-side behavior change.
- Stateless but with external side effects: A "stateless" service that sent emails, charged cards, or provisioned cloud resources on broken code paths has already created real-world effects that rollback cannot undo.
Cross-References¶
- Topic Packs: Deployments, incident-response
- Runbooks: rollback-procedure.md, hotfix-deploy.md
- Related trees: should-i-page.md, config-change.md