Decision Tree: Roll Back or Fix Forward?¶

Category: Operational Decisions Starting Question: "Something broke after a deployment — should I roll back or fix forward?" Estimated traversal: 2-5 minutes Domains: deployments, incident-response, release-engineering, SRE

The Tree¶

Something broke after a deployment — roll back or fix forward?
│
├── [Check 1] Is the service completely down or are users actively impacted?
│   ├── YES (production down, revenue impact, error rate > 50%)
│   │   ├── [Check 2] Is rollback safe? (no DB migration ran, stateless service)
│   │   │   ├── YES → ✅ ROLL BACK IMMEDIATELY
│   │   │   │         kubectl rollout undo / helm rollback / revert pipeline
│   │   │   └── NO (migration already ran, stateful data written)
│   │   │       ├── [Check 3] Is the fix known, tested, and < 5 lines of code?
│   │   │       │   ├── YES + hotfix deploy < 15 min → ✅ FIX FORWARD (hotfix path)
│   │   │       │   └── NO or untested
│   │   │       │       ├── [Check 4] Can a feature flag disable the broken behavior?
│   │   │       │       │   ├── YES → ✅ FEATURE FLAG OFF + escalate for fix
│   │   │       │       │   └── NO → ⚠️ ESCALATE — neither path is safe alone
│   │   │       └── [Check 5] Is blast radius growing (error rate rising)?
│   │   │           ├── YES → ⚠️ ESCALATE IMMEDIATELY + partial traffic split
│   │   │           └── NO (stable, not worsening) → attempt fix forward with timer
│   │
│   └── NO (partial degradation, < 10% error rate, workaround exists)
│       ├── [Check 6] How long ago was the deployment?
│       │   ├── < 30 minutes ago
│       │   │   ├── [Check 7] Any DB migration or schema change ran?
│       │   │   │   ├── NO → ✅ ROLL BACK (low-risk, recent, clean)
│       │   │   │   └── YES (additive/backward-compat migration)
│       │   │   │       ├── Is migration purely additive (new column, new table)?
│       │   │   │       │   ├── YES → ✅ ROLL BACK (old code tolerates new schema)
│       │   │   │       │   └── NO (dropped column, type change) → fix forward only
│       │   ├── 30 min – 2 hours ago
│       │   │   ├── [Check 8] Has state drifted? (writes, caches, queues processed)
│       │   │   │   ├── Minimal drift → rollback still viable with caution
│       │   │   │   └── Significant drift → ✅ FIX FORWARD preferred
│       │   └── > 2 hours ago
│       │       └── [Check 9] Is the fix known and low-risk?
│       │           ├── YES → ✅ FIX FORWARD (state has drifted, rollback risky)
│       │           └── NO → ✅ PARTIAL ROLLBACK with traffic split
│       │
│       └── [Check 10] Do you have a hotfix deployment path < 15 min?
│           ├── YES → ✅ FIX FORWARD via hotfix pipeline
│           └── NO → evaluate rollback cost vs waiting for full deploy cycle

Node Details¶

Check 1: Assess production impact severity¶

Command/method:

# Error rate over last 5 minutes
kubectl exec -it prometheus-pod -- promtool query instant \
  'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])'

# Synthetic monitor / uptime check
curl -o /dev/null -sw "%{http_code}\n" https://service.example.com/health

# Check recent logs for error volume
kubectl logs -l app=myservice --since=5m | grep -c "ERROR"

What you're looking for: Error rate > 50% = complete outage. Any P1 alert firing = treat as down. Common pitfall: Partial outage in one region can look like total outage from aggregate dashboards. Check per-region before declaring P1.

Check 2: Is rollback safe?¶

Command/method:

# Check if a DB migration ran as part of this deploy
kubectl get job -n myapp | grep migrate
kubectl logs job/migrate-v2-3-0 --tail=20

# Check for stateful components that wrote data
kubectl exec -it db-pod -- psql -c "SELECT MAX(created_at) FROM orders;"

# Verify service is stateless (check for local disk writes, session state)
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.volumes}'

What you're looking for: No migration job = rollback is likely safe. A ran migration = check if it was additive only (new table/column) or destructive (drop, type change, rename). Common pitfall: Migrations that add NOT NULL columns without defaults are safe to roll back schema-wise but may cause insert failures in old code. Always check the migration SQL before deciding.

Check 3: Is the fix known, tested, and < 5 lines?¶

Command/method: Review the diff in the deploy pipeline or PR. Count changed lines. Confirm the fix was reproduced locally or in staging. What you're looking for: A 1-3 line targeted fix (wrong config value, off-by-one, missing null check) that has an obvious correct form. Anything requiring reasoning or testing > 10 min fails this check. Common pitfall: "It's just a config change" — config changes can have wide blast radius. Treat YAML/config edits with same rigor as code.

Check 4: Feature flag availability¶

Command/method:

# Check if flagging infrastructure exists
curl -s https://flags.internal/api/v1/flags | jq '.[] | select(.key == "feature-xyz")'

# Toggle flag off (LaunchDarkly CLI example)
ld feature-flags update feature-xyz --value false --environment production

# Verify flag is respected by the service
kubectl logs -l app=myservice --since=1m | grep "feature-xyz"

What you're looking for: Flag infrastructure deployed and the specific feature is gated. Confirm the flag is respected in code (grep codebase for flag key). Common pitfall: A flag exists but the code path that is broken is not behind the flag — the broken behavior is a side effect. Verify the flag actually disables the breaking code path.

Check 6: Deployment age and state drift¶

Command/method:

# When did the deployment complete?
kubectl rollout history deployment/myapp | tail -3
kubectl describe deployment/myapp | grep "Last Updated"

# How much data has been written since deploy?
kubectl exec -it db-pod -- psql -c \
  "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '30 minutes';"

# Queue depth / messages processed
kubectl exec -it rabbitmq-pod -- rabbitmqctl list_queues name messages

What you're looking for: < 30 min with low write volume = rollback viable. > 2 hours or high write volume = state has drifted and rollback may corrupt or lose data. Common pitfall: Clock drift between nodes means "30 minutes ago" may not be reliable. Use deployment event timestamps from Kubernetes, not wall clock.

Check 8: State drift assessment¶

Command/method:

# Row counts in primary tables since deployment
kubectl exec -it db-pod -- psql -c \
  "SELECT table_name,
          (SELECT COUNT(*) FROM information_schema.tables) as total
   FROM information_schema.tables WHERE table_schema='public';"

# Cache invalidation risk
redis-cli --scan --pattern "session:*" | wc -l

# Event log / audit trail entries since deployment
kubectl logs -l app=myservice --since=2h | grep "WRITE\|INSERT\|UPDATE" | wc -l

What you're looking for: Low write activity (< 100 rows, < 1000 cache keys) = drift manageable. High activity = rollback creates data inconsistency. Common pitfall: Read-heavy services can still have state drift through cache population. A rolled-back service may serve stale cache that the new code would have invalidated.

Terminal Actions¶

✅ Action: Roll Back Immediately¶

Do:

# Kubernetes / Helm
kubectl rollout undo deployment/myapp
# OR
helm rollback myapp 1 --wait --timeout 5m

# Verify rollback completed
kubectl rollout status deployment/myapp --timeout=5m
kubectl get pods -l app=myapp -w

# Confirm previous image is running
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check error rate dropped
watch -n5 'kubectl logs -l app=myapp --since=1m | grep -c ERROR'

Verify: Health check passes, error rate < 1%, on-call alert resolves within 2 minutes of rollback. Runbook: rollback-procedure.md

✅ Action: Fix Forward via Hotfix Path¶

Do:

# 1. Create hotfix branch from the deployed tag
git checkout -b hotfix/v2.3.1 tags/v2.3.0

# 2. Apply the minimal fix
# 3. Push through accelerated pipeline (skip non-critical tests)
git push origin hotfix/v2.3.1

# 4. Trigger hotfix deploy (your CI/CD system)
# 5. Monitor deploy progress
kubectl rollout status deployment/myapp --timeout=15m

# 6. Verify fix resolves the issue
curl -s https://service.example.com/health | jq .

Verify: Symptom resolved, no new errors introduced, error rate returns to baseline. Runbook: hotfix-deploy.md

✅ Action: Feature Flag Off¶

Do:

# 1. Disable the flag for all users
ld feature-flags update broken-feature --value false --environment production

# 2. Verify flag change propagated (SDK poll interval is usually 30s)
sleep 35
curl -s https://service.example.com/api/test-endpoint

# 3. Confirm error rate dropping
kubectl logs -l app=myapp --since=2m | grep -c ERROR

# 4. File a bug and schedule fix in next deploy

Verify: Error rate drops within 60 seconds of flag disable (SDK propagation time). Runbook: feature-flags.md

✅ Action: Partial Rollback with Traffic Split¶

Do:

# Route 10% traffic to old version, 90% to new
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: myapp
spec:
  http:
  - route:
    - destination:
        host: myapp-v2
      weight: 90
    - destination:
        host: myapp-v1
      weight: 10
EOF

# Monitor error rates on both versions
kubectl logs -l app=myapp,version=v1 --since=2m | grep -c ERROR
kubectl logs -l app=myapp,version=v2 --since=2m | grep -c ERROR

Verify: Reduced error rate confirms old version is healthier. Shift traffic further toward old version if confirmed. Runbook: traffic-split.md

⚠️ Warning: Escalate — Neither Option Is Safe¶

When: Migration ran AND is not backward-compatible AND fix is unknown or untested AND blast radius is growing. Risk: Rolling back corrupts data (old code writes to new schema). Fixing forward deploys untested code to an already-broken system. Mitigation: 1. Stop all writes if possible (circuit breaker, read-only mode) 2. Page senior engineer and/or DBA immediately 3. Capture a DB snapshot before any further action 4. Do not act until there is a consensus plan

Edge Cases¶

Multi-service deployment: If the broken deploy involved coordinated changes to multiple services (e.g., API + consumer), rolling back one without the other may create a worse state. Check deployment coordination before rolling back any single service.
Canary already at 100%: If your canary phase completed before the issue surfaced, "rollback" is a full redeployment of the previous version — factor the full deploy time into your decision.
Shared library update: If the deployment updated a shared library used by 10 services, a rollback of one service is insufficient. You need to coordinate the rollback across all consumers.
Third-party dependency change: If the deployment included a vendor API version bump, "rollback" in your service may not revert the vendor-side behavior change.
Stateless but with external side effects: A "stateless" service that sent emails, charged cards, or provisioned cloud resources on broken code paths has already created real-world effects that rollback cannot undo.

Cross-References¶

Topic Packs: Deployments, incident-response
Runbooks: rollback-procedure.md, hotfix-deploy.md
Related trees: should-i-page.md, config-change.md

Decision Tree: Roll Back or Fix Forward?¶

The Tree¶

Node Details¶

Check 1: Assess production impact severity¶

Check 2: Is rollback safe?¶

Check 3: Is the fix known, tested, and < 5 lines?¶

Check 4: Feature flag availability¶

Check 6: Deployment age and state drift¶

Check 8: State drift assessment¶

Terminal Actions¶

✅ Action: Roll Back Immediately¶

✅ Action: Fix Forward via Hotfix Path¶

✅ Action: Feature Flag Off¶

✅ Action: Partial Rollback with Traffic Split¶

⚠️ Warning: Escalate — Neither Option Is Safe¶

Edge Cases¶

Cross-References¶

Pages that link here¶