Skip to content

Decision Tree: Roll Back or Fix Forward?

Category: Operational Decisions Starting Question: "Something broke after a deployment — should I roll back or fix forward?" Estimated traversal: 2-5 minutes Domains: deployments, incident-response, release-engineering, SRE


The Tree

Something broke after a deployment  roll back or fix forward?
├── [Check 1] Is the service completely down or are users actively impacted?
   ├── YES (production down, revenue impact, error rate > 50%)
      ├── [Check 2] Is rollback safe? (no DB migration ran, stateless service)
         ├── YES   ROLL BACK IMMEDIATELY
                  kubectl rollout undo / helm rollback / revert pipeline
         └── NO (migration already ran, stateful data written)
             ├── [Check 3] Is the fix known, tested, and < 5 lines of code?
                ├── YES + hotfix deploy < 15 min   FIX FORWARD (hotfix path)
                └── NO or untested
                    ├── [Check 4] Can a feature flag disable the broken behavior?
                       ├── YES   FEATURE FLAG OFF + escalate for fix
                       └── NO  ⚠️ ESCALATE  neither path is safe alone
             └── [Check 5] Is blast radius growing (error rate rising)?
                 ├── YES  ⚠️ ESCALATE IMMEDIATELY + partial traffic split
                 └── NO (stable, not worsening)  attempt fix forward with timer
      └── NO (partial degradation, < 10% error rate, workaround exists)
       ├── [Check 6] How long ago was the deployment?
          ├── < 30 minutes ago
             ├── [Check 7] Any DB migration or schema change ran?
                ├── NO   ROLL BACK (low-risk, recent, clean)
                └── YES (additive/backward-compat migration)
                    ├── Is migration purely additive (new column, new table)?
                       ├── YES   ROLL BACK (old code tolerates new schema)
                       └── NO (dropped column, type change)  fix forward only
          ├── 30 min  2 hours ago
             ├── [Check 8] Has state drifted? (writes, caches, queues processed)
                ├── Minimal drift  rollback still viable with caution
                └── Significant drift   FIX FORWARD preferred
          └── > 2 hours ago
              └── [Check 9] Is the fix known and low-risk?
                  ├── YES   FIX FORWARD (state has drifted, rollback risky)
                  └── NO   PARTIAL ROLLBACK with traffic split
              └── [Check 10] Do you have a hotfix deployment path < 15 min?
           ├── YES   FIX FORWARD via hotfix pipeline
           └── NO  evaluate rollback cost vs waiting for full deploy cycle

Node Details

Check 1: Assess production impact severity

Command/method:

# Error rate over last 5 minutes
kubectl exec -it prometheus-pod -- promtool query instant \
  'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])'

# Synthetic monitor / uptime check
curl -o /dev/null -sw "%{http_code}\n" https://service.example.com/health

# Check recent logs for error volume
kubectl logs -l app=myservice --since=5m | grep -c "ERROR"
What you're looking for: Error rate > 50% = complete outage. Any P1 alert firing = treat as down. Common pitfall: Partial outage in one region can look like total outage from aggregate dashboards. Check per-region before declaring P1.

Check 2: Is rollback safe?

Command/method:

# Check if a DB migration ran as part of this deploy
kubectl get job -n myapp | grep migrate
kubectl logs job/migrate-v2-3-0 --tail=20

# Check for stateful components that wrote data
kubectl exec -it db-pod -- psql -c "SELECT MAX(created_at) FROM orders;"

# Verify service is stateless (check for local disk writes, session state)
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.volumes}'
What you're looking for: No migration job = rollback is likely safe. A ran migration = check if it was additive only (new table/column) or destructive (drop, type change, rename). Common pitfall: Migrations that add NOT NULL columns without defaults are safe to roll back schema-wise but may cause insert failures in old code. Always check the migration SQL before deciding.

Check 3: Is the fix known, tested, and < 5 lines?

Command/method: Review the diff in the deploy pipeline or PR. Count changed lines. Confirm the fix was reproduced locally or in staging. What you're looking for: A 1-3 line targeted fix (wrong config value, off-by-one, missing null check) that has an obvious correct form. Anything requiring reasoning or testing > 10 min fails this check. Common pitfall: "It's just a config change" — config changes can have wide blast radius. Treat YAML/config edits with same rigor as code.

Check 4: Feature flag availability

Command/method:

# Check if flagging infrastructure exists
curl -s https://flags.internal/api/v1/flags | jq '.[] | select(.key == "feature-xyz")'

# Toggle flag off (LaunchDarkly CLI example)
ld feature-flags update feature-xyz --value false --environment production

# Verify flag is respected by the service
kubectl logs -l app=myservice --since=1m | grep "feature-xyz"
What you're looking for: Flag infrastructure deployed and the specific feature is gated. Confirm the flag is respected in code (grep codebase for flag key). Common pitfall: A flag exists but the code path that is broken is not behind the flag — the broken behavior is a side effect. Verify the flag actually disables the breaking code path.

Check 6: Deployment age and state drift

Command/method:

# When did the deployment complete?
kubectl rollout history deployment/myapp | tail -3
kubectl describe deployment/myapp | grep "Last Updated"

# How much data has been written since deploy?
kubectl exec -it db-pod -- psql -c \
  "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '30 minutes';"

# Queue depth / messages processed
kubectl exec -it rabbitmq-pod -- rabbitmqctl list_queues name messages
What you're looking for: < 30 min with low write volume = rollback viable. > 2 hours or high write volume = state has drifted and rollback may corrupt or lose data. Common pitfall: Clock drift between nodes means "30 minutes ago" may not be reliable. Use deployment event timestamps from Kubernetes, not wall clock.

Check 8: State drift assessment

Command/method:

# Row counts in primary tables since deployment
kubectl exec -it db-pod -- psql -c \
  "SELECT table_name,
          (SELECT COUNT(*) FROM information_schema.tables) as total
   FROM information_schema.tables WHERE table_schema='public';"

# Cache invalidation risk
redis-cli --scan --pattern "session:*" | wc -l

# Event log / audit trail entries since deployment
kubectl logs -l app=myservice --since=2h | grep "WRITE\|INSERT\|UPDATE" | wc -l
What you're looking for: Low write activity (< 100 rows, < 1000 cache keys) = drift manageable. High activity = rollback creates data inconsistency. Common pitfall: Read-heavy services can still have state drift through cache population. A rolled-back service may serve stale cache that the new code would have invalidated.


Terminal Actions

✅ Action: Roll Back Immediately

Do:

# Kubernetes / Helm
kubectl rollout undo deployment/myapp
# OR
helm rollback myapp 1 --wait --timeout 5m

# Verify rollback completed
kubectl rollout status deployment/myapp --timeout=5m
kubectl get pods -l app=myapp -w

# Confirm previous image is running
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check error rate dropped
watch -n5 'kubectl logs -l app=myapp --since=1m | grep -c ERROR'
Verify: Health check passes, error rate < 1%, on-call alert resolves within 2 minutes of rollback. Runbook: rollback-procedure.md

✅ Action: Fix Forward via Hotfix Path

Do:

# 1. Create hotfix branch from the deployed tag
git checkout -b hotfix/v2.3.1 tags/v2.3.0

# 2. Apply the minimal fix
# 3. Push through accelerated pipeline (skip non-critical tests)
git push origin hotfix/v2.3.1

# 4. Trigger hotfix deploy (your CI/CD system)
# 5. Monitor deploy progress
kubectl rollout status deployment/myapp --timeout=15m

# 6. Verify fix resolves the issue
curl -s https://service.example.com/health | jq .
Verify: Symptom resolved, no new errors introduced, error rate returns to baseline. Runbook: hotfix-deploy.md

✅ Action: Feature Flag Off

Do:

# 1. Disable the flag for all users
ld feature-flags update broken-feature --value false --environment production

# 2. Verify flag change propagated (SDK poll interval is usually 30s)
sleep 35
curl -s https://service.example.com/api/test-endpoint

# 3. Confirm error rate dropping
kubectl logs -l app=myapp --since=2m | grep -c ERROR

# 4. File a bug and schedule fix in next deploy
Verify: Error rate drops within 60 seconds of flag disable (SDK propagation time). Runbook: feature-flags.md

✅ Action: Partial Rollback with Traffic Split

Do:

# Route 10% traffic to old version, 90% to new
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: myapp
spec:
  http:
  - route:
    - destination:
        host: myapp-v2
      weight: 90
    - destination:
        host: myapp-v1
      weight: 10
EOF

# Monitor error rates on both versions
kubectl logs -l app=myapp,version=v1 --since=2m | grep -c ERROR
kubectl logs -l app=myapp,version=v2 --since=2m | grep -c ERROR
Verify: Reduced error rate confirms old version is healthier. Shift traffic further toward old version if confirmed. Runbook: traffic-split.md

⚠️ Warning: Escalate — Neither Option Is Safe

When: Migration ran AND is not backward-compatible AND fix is unknown or untested AND blast radius is growing. Risk: Rolling back corrupts data (old code writes to new schema). Fixing forward deploys untested code to an already-broken system. Mitigation: 1. Stop all writes if possible (circuit breaker, read-only mode) 2. Page senior engineer and/or DBA immediately 3. Capture a DB snapshot before any further action 4. Do not act until there is a consensus plan


Edge Cases

  • Multi-service deployment: If the broken deploy involved coordinated changes to multiple services (e.g., API + consumer), rolling back one without the other may create a worse state. Check deployment coordination before rolling back any single service.
  • Canary already at 100%: If your canary phase completed before the issue surfaced, "rollback" is a full redeployment of the previous version — factor the full deploy time into your decision.
  • Shared library update: If the deployment updated a shared library used by 10 services, a rollback of one service is insufficient. You need to coordinate the rollback across all consumers.
  • Third-party dependency change: If the deployment included a vendor API version bump, "rollback" in your service may not revert the vendor-side behavior change.
  • Stateless but with external side effects: A "stateless" service that sent emails, charged cards, or provisioned cloud resources on broken code paths has already created real-world effects that rollback cannot undo.

Cross-References