Skip to content

Decision Tree: How to Handle This Config Change?

Category: Operational Decisions Starting Question: "I need to make a config change to a running system — what process?" Estimated traversal: 3-5 minutes Domains: configuration-management, change-management, deployments, SRE


The Tree

I need to make a config change to a running system — what process?
├── [Check 1] Is this an emergency? (production incident is driving the change)
│   ├── YES — system is currently broken and this change is the fix
│   │   ├── [Check 2] Is the current system state healthy enough to tolerate a change?
│   │   │   ├── NO (system is crashing, pods restarting) → stabilize first
│   │   │   │   └── → ⚠️ STABILIZE, then apply emergency change
│   │   │   └── YES (degraded but not crashing)
│   │   │       ├── [Check 3] Is this change reversible in < 5 minutes?
│   │   │       │   ├── YES → ✅ EMERGENCY CHANGE with immediate rollback plan documented
│   │   │       │   └── NO (requires restart, migration, or manual cleanup)
│   │   │       │       └── → ⚠️ PAGE SECOND APPROVER before proceeding
│   │
│   └── NO — this is a planned change (optimization, new feature, routine update)
│       │
│       ├── [Check 4] What is the blast radius?
│       │   │
│       │   ├── Single pod / single instance (affects one workload only)
│       │   │   ├── [Check 5] Has this been tested in a non-prod environment?
│       │   │   │   ├── YES (staging validated within the last 7 days)
│       │   │   │   │   ├── [Check 6] Does the change require a restart or pod rollout?
│       │   │   │   │   │   ├── YES → ✅ STANDARD PR + ROLLING RESTART
│       │   │   │   │   │   └── NO (hot reload / live config) → ✅ CONFIGMAP UPDATE
│       │   │   │   └── NO (untested in staging)
│       │   │   │       └── → ✅ TEST IN STAGING FIRST (block: do not apply to prod)
│       │   │
│       │   ├── Entire deployment / all pods in a namespace
│       │   │   ├── [Check 5] Has this been tested in a non-prod environment?
│       │   │   │   ├── YES
│       │   │   │   │   ├── [Check 7] Is there a canary or traffic-split path available?
│       │   │   │   │   │   ├── YES → ✅ CANARY ROLLOUT (1 pod → 10% → 50% → 100%)
│       │   │   │   │   │   └── NO → ✅ ROLLING RESTART with readiness probe validation
│       │   │   │   └── NO → TEST IN STAGING FIRST
│       │   │
│       │   ├── Entire cluster or all regions
│       │   │   ├── [Check 8] Is there a change freeze in effect?
│       │   │   │   ├── YES → ✅ CHANGE FREEZE HOLD — get exception or wait
│       │   │   │   └── NO
│       │   │       ├── [Check 9] Is it reversible within 5 minutes?
│       │   │       │   ├── YES + tested in staging → ✅ BLUE-GREEN SWAP or staged rollout
│       │   │       │   └── NO → ⚠️ ESCALATE — requires change management sign-off
│       │   │
│       │   └── Shared infrastructure (database, message broker, load balancer, DNS)
│       │       ├── [Check 10] Does the change require downtime?
│       │       │   ├── YES → ✅ MAINTENANCE WINDOW — schedule, notify, execute
│       │       │   └── NO → ✅ CHANGE MANAGEMENT REVIEW + canary + rollback plan
│       │
│       └── [Check 11] Is the system currently degraded (not an emergency, but not healthy)?
│           ├── YES → ⚠️ WARNING: Applying changes to a degraded system is high risk
│           │         Resolve the degradation first, or get explicit approval to proceed
│           └── NO (system is healthy) → proceed with blast-radius checks above

Node Details

Check 1: Emergency vs planned change

Command/method:

# Is there an active incident driving this change?
pd incident list --statuses triggered,acknowledged | grep your-service

# Is there an open P1/P2 that this change resolves?
gh issue list --label "incident,P1" --state open | grep your-service

# Check current error rate to assess urgency
kubectl exec -it prometheus-pod -- promtool query instant \
  'rate(http_requests_total{service="myapp",status=~"5.."}[5m])'
What you're looking for: Active P1/P2 incident + this config change is the identified fix = emergency path. "I want to do this while I have time" = planned path, even if it feels urgent. Common pitfall: Treating a P3 or a personal sense of urgency as an emergency to skip change management. The emergency path exists for production-down scenarios, not convenience.

Check 2: Current system stability

Command/method:

# Pod health
kubectl get pods -n production -l app=myapp
kubectl describe pods -n production -l app=myapp | grep -A5 "Conditions:"

# Recent restarts
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

# Is the system in a restart loop?
kubectl get pods -n production | grep -E "CrashLoopBackOff|Error|OOMKilled"
What you're looking for: All pods Running with 0–1 restarts in the last hour = stable enough to proceed. Any CrashLoopBackOff or restart count > 3 = stabilize first. Common pitfall: Making a config change while pods are thrashing. The noise from the existing instability makes it impossible to tell if your change helped or made things worse.

Check 4: Blast radius assessment

Command/method:

# What does this ConfigMap / Secret apply to?
kubectl get configmap myapp-config -n production -o yaml | grep -A5 "metadata:"

# How many pods will be affected?
kubectl get pods -n production -l "$(kubectl get configmap myapp-config -o jsonpath='{.metadata.labels}' | jq -r 'to_entries | .[] | "\(.key)=\(.value)"' | head -1)"

# Does this config apply to multiple namespaces?
kubectl get configmap myapp-config --all-namespaces

# Estimate affected pods/services
kubectl get deployment -n production --selector="app=myapp" \
  -o jsonpath='{.items[*].status.replicas}'
What you're looking for: 1 pod = low blast radius. All pods in a deployment = medium. All deployments in a cluster or shared infra = high. Common pitfall: A ConfigMap change that looks single-service but is actually mounted by 5 different deployments. Always check kubectl get pods --all-namespaces | grep configmap-name or the ConfigMap's consumers before assuming narrow blast radius.

Check 5: Non-prod testing validation

Command/method:

# Apply change to staging first
kubectl config use-context staging
kubectl apply -f config-change.yaml

# Run smoke tests
kubectl exec -it test-pod -n staging -- ./smoke-tests.sh

# Check for errors after applying
kubectl logs -l app=myapp -n staging --since=5m | grep -E "ERROR|FATAL|panic"

# Validate the config was loaded correctly
kubectl exec -it myapp-pod-in-staging -- env | grep CONFIG_KEY
kubectl exec -it myapp-pod-in-staging -- curl localhost:8080/debug/config
What you're looking for: Zero errors after config applied in staging, smoke tests pass, config value confirmed via debug endpoint. Common pitfall: Validating in staging without confirming staging config is sufficiently representative of production. Staging using a different auth provider, different secret values, or different network topology will not catch config errors that only manifest in production.

Check 6: Restart vs hot reload

Command/method:

# Does the application support hot config reload?
kubectl exec -it myapp-pod -- curl -s localhost:8080/debug/config-reload
# OR
kubectl exec -it myapp-pod -- kill -HUP 1  # SIGHUP for many apps

# Check application documentation or code
grep -r "SIGHUP\|viper.WatchConfig\|hot.reload\|config.watch" app/src/ | head -10

# Nginx example
kubectl exec -it nginx-pod -- nginx -s reload

# Check if ConfigMap change requires pod restart
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[*].envFrom}'
What you're looking for: If config is loaded at startup only (env vars from ConfigMap, not mounted as file with inotify), you need a rolling restart. If the app watches the file or has a reload endpoint, you can apply without restart. Common pitfall: Assuming all ConfigMap changes take effect immediately. Environment variables injected from ConfigMaps are only read at pod startup — changing the ConfigMap does not update running pods.

Check 8: Change freeze status

Command/method:

# Check change freeze calendar
curl -s "https://wiki.internal/api/change-freeze" | jq '.active'

# Check Slack announcements channel for freeze notices
# Search: "change freeze" in #platform-announcements

# Check deployment guardrails
cat /workspace/_guardrails/change-freeze.conf 2>/dev/null || echo "no local freeze config"
What you're looking for: Active freeze = no production changes without explicit exception approval from change manager or VP Eng. Common pitfall: Forgetting that change freezes often apply to all environments (including staging deploys that promote to production). Confirm scope of freeze before assuming staging is exempt.

Check 11: System health pre-check

Command/method:

# Full system health before making any change
kubectl get pods -n production | grep -v Running | grep -v Completed
kubectl get nodes | grep -v Ready
kubectl get pvc -n production | grep -v Bound

# SLO burn rate — are we burning error budget right now?
kubectl exec -it prometheus-pod -- promtool query instant \
  '(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / 0.001'

# Any ongoing deploys?
kubectl rollout status deployment/myapp -n production 2>&1 | head -5
What you're looking for: All pods Running, all nodes Ready, all PVCs Bound, no ongoing rollouts, SLO burn rate < 5x = healthy enough to proceed. Common pitfall: Making a config change while a deployment rollout is still in progress. The combination of a mid-rollout and a config change makes debugging extremely difficult.


Terminal Actions

✅ Action: Standard PR + Deploy Pipeline

Do:

# 1. Make change in version-controlled config file
vim kubernetes/production/myapp-configmap.yaml

# 2. Open a PR with description of what changed and why
git add kubernetes/production/myapp-configmap.yaml
git commit -m "config: increase myapp worker threads from 4 to 8 for Q4 load"
gh pr create --title "config: increase myapp worker threads" \
  --body "Reason: profiling showed CPU idle while threads blocked on I/O. Tested in staging for 24h."

# 3. Merge PR and let pipeline deploy
# 4. Watch deployment rollout
kubectl rollout status deployment/myapp -n production --timeout=10m

# 5. Verify config took effect
kubectl exec -it $(kubectl get pods -l app=myapp -o name | head -1) -- \
  env | grep WORKER_THREADS
Verify: Pipeline succeeds, rollout completes, config value confirmed in running pod, error rate unchanged after deploy. Runbook: standard-deploy.md

✅ Action: Emergency Change with Immediate Rollback Plan

Do:

# 1. Document the rollback command BEFORE making the change
ROLLBACK_CMD="kubectl apply -f kubernetes/production/myapp-configmap.yaml.bak"
echo "ROLLBACK: $ROLLBACK_CMD" | tee /tmp/emergency-change-$(date +%Y%m%d%H%M).txt

# 2. Back up current config
kubectl get configmap myapp-config -n production -o yaml > \
  kubernetes/production/myapp-configmap.yaml.bak

# 3. Apply the change
kubectl apply -f kubernetes/production/myapp-configmap-emergency.yaml

# 4. Trigger rolling restart to pick up new config
kubectl rollout restart deployment/myapp -n production

# 5. Watch and validate — set 5-minute timer
kubectl rollout status deployment/myapp -n production --timeout=5m
kubectl logs -l app=myapp --since=2m | grep -c ERROR

# 6. If worse, roll back immediately
# $ROLLBACK_CMD && kubectl rollout restart deployment/myapp -n production
Verify: Error rate decreased, no new errors introduced. Post-incident: open a follow-up ticket to get this change through the standard PR process. Runbook: emergency-change.md

✅ Action: Canary Rollout (1 pod → 10% → 50% → 100%)

Do:

# Step 1: Apply config to a single pod (using a canary label selector)
kubectl patch deployment myapp -n production -p \
  '{"spec":{"template":{"metadata":{"labels":{"canary":"true"}}}}}'

# Step 2: Apply new ConfigMap only to canary
kubectl apply -f myapp-configmap-new.yaml -l canary=true

# Step 3: Watch canary pod for 10 minutes
kubectl logs -l canary=true --since=10m | grep -c ERROR

# Step 4: If healthy, expand to 10%
kubectl scale deployment myapp-canary --replicas=1  # out of 10 total pods

# Step 5: Monitor error rate in Prometheus for 15 minutes
# Step 6: Expand to 50%, then 100%
kubectl apply -f myapp-configmap-new.yaml  # apply to all pods
kubectl rollout restart deployment/myapp -n production
Verify: Each stage runs for at least 10 minutes without error rate increase before advancing. Runbook: canary-deploy.md

✅ Action: ConfigMap Update + Rolling Restart

Do:

# 1. Apply updated ConfigMap
kubectl apply -f myapp-configmap-updated.yaml

# 2. Trigger rolling restart to pick up new config
kubectl rollout restart deployment/myapp -n production

# 3. Monitor rollout
kubectl rollout status deployment/myapp -n production

# 4. Verify new config is loaded
for pod in $(kubectl get pods -l app=myapp -o name); do
  echo "$pod: $(kubectl exec $pod -- env | grep CONFIG_KEY)"
done
Verify: All pods show new config value, rollout completes without pod failures.

✅ Action: Change Freeze Hold

Do:

# 1. Document the change you need to make
# 2. Request an exception if it's blocking critical work
# Template: "Requesting change freeze exception for [change]. Reason: [impact if delayed]. Risk: [low/medium — reversible in < 5 min]. Approver needed: [change manager name]."

# 3. Schedule for next maintenance window
# 4. Add to change calendar so it's not forgotten

gh issue create --repo org/platform \
  --title "FROZEN: config change for myapp — schedule for post-freeze" \
  --label "change-freeze,config" \
  --body "Blocked by change freeze until [date]. Change: increase worker threads. Validation complete."
Verify: The issue is tracked and scheduled. Follow up after freeze lifts.

⚠️ Warning: Applying Change to a Degraded System

When: The system is already experiencing errors, high latency, or pod restarts when you want to make a config change. Risk: It becomes impossible to determine if the config change helped, had no effect, or made things worse. If the system crashes after your change, you will be blamed regardless of causation. Mitigation: Stabilize the system to a known baseline state before making any config changes, unless the config change is the identified fix for the current degradation (use emergency path).


Edge Cases

  • Config change via environment variables vs mounted files: Env-var config requires pod restart to take effect; mounted ConfigMap files update in place but only if the application watches for file changes. Know your app's config-loading behavior before choosing the path.
  • Secret rotation vs config change: Rotating credentials (database passwords, API keys) looks like a config change but has additional steps: update the external system first, then update the Kubernetes Secret, then restart. The ordering matters.
  • Multi-cluster config drift: If the same ConfigMap exists in multiple clusters and they have drifted, applying a "small" config change may look very different across clusters. Always diff the current state across all target clusters before applying.
  • Config validation at deploy time: Some config values (regex patterns, JSON schemas, YAML) are only validated when the application loads them. A syntactically valid but semantically wrong config will pass kubectl apply and only fail at pod startup.

Cross-References