Change Management - Street-Level Ops¶

Quick Diagnosis Commands¶

# ── What changed recently? ──
# Git (application code)
git log --oneline --since="2 hours ago"
git diff HEAD~3..HEAD --stat

# Kubernetes
kubectl get events --sort-by=.metadata.creationTimestamp -A | tail -30
kubectl rollout history deployment/myapp -n production

# System packages
rpm -qa --last | head -20           # RHEL/CentOS
dpkg --list --no-pager 2>/dev/null | head -20  # Debian/Ubuntu
journalctl -u packagekit --since "1 day ago"

# Configuration files (if using etckeeper)
cd /etc && git log --oneline -10

# Infrastructure (Terraform)
terraform show -json | jq '.values.root_module.resources | length'

# DNS
dig +short myapp.example.com       # Current resolution
# Compare with expected value

# SSL/TLS certificates
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

Pattern: Change Ticket Template¶

Every normal change should answer these questions before execution:

## Change Request: [CHG-XXXX]

### Summary
One sentence: what is being changed and why.

### Details
- **Service(s) affected:** api-gateway, auth-service
- **Environment:** production
- **Change type:** Normal
- **Risk level:** Medium
- **Change window:** 2026-03-17 14:00-15:00 UTC

### Pre-flight Checklist
- [ ] Tested in staging environment
- [ ] Peer reviewed (link to PR/review)
- [ ] Rollback procedure documented and tested
- [ ] Monitoring dashboards bookmarked
- [ ] On-call team notified
- [ ] Backup/snapshot taken
- [ ] Dependent teams notified (if cross-service)

### Execution Steps
1. [Exact command or action]
2. [Exact command or action]
3. [Verification step]

### Rollback Steps
1. [Exact command or action]
2. [Verification step]
Estimated rollback time: X minutes

### Validation Criteria
- [ ] Health check returns 200
- [ ] Error rate < 0.1%
- [ ] p99 latency < 200ms
- [ ] No new errors in logs
- [ ] Dependent services healthy

### Post-Change
- [ ] Monitoring watched for 15+ minutes
- [ ] Change ticket closed
- [ ] Team notified of completion

Pattern: Pre-Flight Checklist Execution¶

Before every non-trivial change:

# 1. Verify current state (baseline)
echo "=== Baseline Metrics ==="
curl -s https://api.example.com/health | jq .
echo "Error rate (check Grafana dashboard)"
echo "Current replica count:"
kubectl get deployment myapp -n production -o jsonpath='{.spec.replicas}'

# 2. Verify rollback readiness
echo "=== Rollback Readiness ==="
kubectl rollout history deployment/myapp -n production | tail -5
echo "Previous image:"
kubectl rollout history deployment/myapp -n production --revision=2 \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# 3. Verify no active incidents
echo "=== Incident Check ==="
# Check PagerDuty / OpsGenie / incident channel
echo "Confirm: no active incidents? (y/n)"

# 4. Verify change window
echo "=== Timing ==="
echo "Current time: $(date -u '+%Y-%m-%d %H:%M UTC')"
echo "Planned window: 14:00-15:00 UTC"
echo "Is this within the change window? (y/n)"

# 5. Verify team availability
echo "=== Team ==="
echo "Executor: @alice"
echo "Reviewer on standby: @bob"
echo "On-call aware: @charlie"

Pattern: Rollback Triggers and Execution¶

Define clear thresholds. When they're breached, don't debate — roll back.

# Automated rollback trigger (in deployment pipeline)
#!/bin/bash
DEPLOY_TIME=$(date +%s)
WATCH_DURATION=300  # 5 minutes

while [ $(( $(date +%s) - DEPLOY_TIME )) -lt $WATCH_DURATION ]; do
  ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
    --data-urlencode 'query=rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m])' \
    | jq -r '.data.result[0].value[1]')

  if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "ERROR RATE $ERROR_RATE exceeds threshold. ROLLING BACK."
    kubectl rollout undo deployment/myapp -n production
    exit 1
  fi
  sleep 10
done

echo "Change validated. Error rate stable."

Manual rollback commands to have ready:

# Kubernetes deployment
kubectl rollout undo deployment/myapp -n production

# Database migration (depends on tool)
flyway undo   # or
alembic downgrade -1

# Config change
git revert HEAD && git push  # Then deploy

# DNS change
# Have the previous record values documented
# aws route53 change-resource-record-sets --hosted-zone-id Z123 \
#   --change-batch file://rollback-dns.json

# Terraform
terraform apply -target=module.myapp -var="image_tag=previous-tag"

Gotcha: "It's Just a Config Change"¶

The phrase "it's just a config change" has preceded more outages than any code deployment. Config changes can: - Change connection strings (instant data loss if wrong database) - Modify timeouts (cascade failures across services) - Toggle feature flags (expose unfinished features to all users) - Adjust rate limits (overwhelm downstream services)

Treat config changes with the same process as code changes: review, test, deploy incrementally, validate.

War story: A one-character change to a connection pool max_idle setting (10 to 1) caused a cascading outage across three services. Each service opened new connections for every request, exhausting the database's connection limit in minutes. The change was marked "low risk" because it was "just a number."

Gotcha: Friday Deploys¶

You finish a feature Friday at 3 PM. Ship it! By 6 PM, the team is offline. By Saturday morning, the on-call engineer is fighting a bug they've never seen in code they didn't write.

 Risk
  ^
  │         ████
  │    ████ ████
  │    ████ ████ ████
  │    ████ ████ ████
  │    ████ ████ ████
  │    ████ ████ ████
  └──────────────────────>
   Mon  Tue  Wed  Thu  Fri
         ← Deploy Risk →

If you must deploy on Friday, at least deploy in the morning and monitor through the afternoon.

Pattern: Communication Templates¶

# Pre-change (post to #deploys or equivalent)
🔄 DEPLOYING: auth-service v2.14.3 → v2.15.0
Window: 14:00-14:30 UTC
Changes: JWT token refresh optimization, dependency updates
Rollback: kubectl rollout undo (< 1 min)
Owner: @alice | Reviewer: @bob

# Post-change success
✅ COMPLETE: auth-service v2.15.0 deployed
Validated: health checks passing, error rate normal, latency normal
Duration: 8 minutes (14:02-14:10 UTC)

# Post-change rollback
⚠️ ROLLED BACK: auth-service v2.15.0 → v2.14.3
Reason: p99 latency spike to 800ms (threshold: 200ms)
Rolled back at: 14:18 UTC
Services restored: 14:19 UTC
Follow-up: investigating root cause, will re-attempt after fix

Pattern: Post-Change Validation Script¶

#!/bin/bash
# post-change-validate.sh — run after every deploy

SERVICE="myapp"
NAMESPACE="production"
WAIT_SECONDS=60
CHECK_INTERVAL=10

echo "Starting post-change validation for $SERVICE..."
echo "Waiting ${WAIT_SECONDS}s for changes to stabilize..."
sleep $WAIT_SECONDS

PASS=true

# Check 1: Pods running
READY=$(kubectl get deployment "$SERVICE" -n "$NAMESPACE" \
  -o jsonpath='{.status.readyReplicas}')
DESIRED=$(kubectl get deployment "$SERVICE" -n "$NAMESPACE" \
  -o jsonpath='{.spec.replicas}')
if [ "$READY" != "$DESIRED" ]; then
  echo "FAIL: Only $READY/$DESIRED pods ready"
  PASS=false
else
  echo "PASS: $READY/$DESIRED pods ready"
fi

# Check 2: No restarts in last 2 minutes
RESTARTS=$(kubectl get pods -n "$NAMESPACE" -l app="$SERVICE" \
  -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}')
echo "Pod restart counts: $RESTARTS"

# Check 3: Health endpoint
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://api.example.com/health")
if [ "$HTTP_CODE" != "200" ]; then
  echo "FAIL: Health check returned $HTTP_CODE"
  PASS=false
else
  echo "PASS: Health check returned 200"
fi

# Check 4: No error logs
ERROR_COUNT=$(kubectl logs -n "$NAMESPACE" -l app="$SERVICE" \
  --since=2m 2>/dev/null | grep -c -i "error\|panic\|fatal")
if [ "$ERROR_COUNT" -gt 0 ]; then
  echo "WARN: $ERROR_COUNT error log lines in last 2 minutes"
fi

if [ "$PASS" = true ]; then
  echo "✓ Post-change validation PASSED"
else
  echo "✗ Post-change validation FAILED — consider rollback"
  exit 1
fi

Gotcha: Silent Changes¶

Someone SSH'd into a prod box and edited a config file. Nobody knows. Two weeks later, the server is replaced (immutable infra), the config is gone, and the service breaks. Nobody can reproduce the fix because it was never documented.

Prevention: - Disable direct SSH to production (use bastion with audit logging) - Use configuration management (Ansible, Puppet, Chef) as the source of truth - Run etckeeper to auto-commit /etc changes - Alert on file changes in critical directories (auditd, AIDE, OSSEC) - Make it easier to go through the process than to go around it

Pattern: Emergency Change Workflow¶

When production is down and you need to change something NOW:

1. Declare incident (incident channel, PagerDuty)
2. Identify the fix
3. Get verbal approval from incident commander
4. Execute the change
5. Validate service restoration
6. Communicate resolution
7. Create change ticket retroactively (within 24 hours)
8. Post-incident review includes the emergency change

The key difference: you still document it. "We were in a hurry" is not a reason to have zero record of what changed.

Remember: The most dangerous changes are the ones made under time pressure with the fewest witnesses. Emergency changes account for a disproportionate share of repeat incidents because they bypass the feedback loops (review, testing, documentation) that prevent recurrence.