Decision Tree: How to Handle This Config Change?¶
Category: Operational Decisions Starting Question: "I need to make a config change to a running system — what process?" Estimated traversal: 3-5 minutes Domains: configuration-management, change-management, deployments, SRE
The Tree¶
I need to make a config change to a running system — what process?
│
├── [Check 1] Is this an emergency? (production incident is driving the change)
│ ├── YES — system is currently broken and this change is the fix
│ │ ├── [Check 2] Is the current system state healthy enough to tolerate a change?
│ │ │ ├── NO (system is crashing, pods restarting) → stabilize first
│ │ │ │ └── → ⚠️ STABILIZE, then apply emergency change
│ │ │ └── YES (degraded but not crashing)
│ │ │ ├── [Check 3] Is this change reversible in < 5 minutes?
│ │ │ │ ├── YES → ✅ EMERGENCY CHANGE with immediate rollback plan documented
│ │ │ │ └── NO (requires restart, migration, or manual cleanup)
│ │ │ │ └── → ⚠️ PAGE SECOND APPROVER before proceeding
│ │
│ └── NO — this is a planned change (optimization, new feature, routine update)
│ │
│ ├── [Check 4] What is the blast radius?
│ │ │
│ │ ├── Single pod / single instance (affects one workload only)
│ │ │ ├── [Check 5] Has this been tested in a non-prod environment?
│ │ │ │ ├── YES (staging validated within the last 7 days)
│ │ │ │ │ ├── [Check 6] Does the change require a restart or pod rollout?
│ │ │ │ │ │ ├── YES → ✅ STANDARD PR + ROLLING RESTART
│ │ │ │ │ │ └── NO (hot reload / live config) → ✅ CONFIGMAP UPDATE
│ │ │ │ └── NO (untested in staging)
│ │ │ │ └── → ✅ TEST IN STAGING FIRST (block: do not apply to prod)
│ │ │
│ │ ├── Entire deployment / all pods in a namespace
│ │ │ ├── [Check 5] Has this been tested in a non-prod environment?
│ │ │ │ ├── YES
│ │ │ │ │ ├── [Check 7] Is there a canary or traffic-split path available?
│ │ │ │ │ │ ├── YES → ✅ CANARY ROLLOUT (1 pod → 10% → 50% → 100%)
│ │ │ │ │ │ └── NO → ✅ ROLLING RESTART with readiness probe validation
│ │ │ │ └── NO → TEST IN STAGING FIRST
│ │ │
│ │ ├── Entire cluster or all regions
│ │ │ ├── [Check 8] Is there a change freeze in effect?
│ │ │ │ ├── YES → ✅ CHANGE FREEZE HOLD — get exception or wait
│ │ │ │ └── NO
│ │ │ ├── [Check 9] Is it reversible within 5 minutes?
│ │ │ │ ├── YES + tested in staging → ✅ BLUE-GREEN SWAP or staged rollout
│ │ │ │ └── NO → ⚠️ ESCALATE — requires change management sign-off
│ │ │
│ │ └── Shared infrastructure (database, message broker, load balancer, DNS)
│ │ ├── [Check 10] Does the change require downtime?
│ │ │ ├── YES → ✅ MAINTENANCE WINDOW — schedule, notify, execute
│ │ │ └── NO → ✅ CHANGE MANAGEMENT REVIEW + canary + rollback plan
│ │
│ └── [Check 11] Is the system currently degraded (not an emergency, but not healthy)?
│ ├── YES → ⚠️ WARNING: Applying changes to a degraded system is high risk
│ │ Resolve the degradation first, or get explicit approval to proceed
│ └── NO (system is healthy) → proceed with blast-radius checks above
Node Details¶
Check 1: Emergency vs planned change¶
Command/method:
# Is there an active incident driving this change?
pd incident list --statuses triggered,acknowledged | grep your-service
# Is there an open P1/P2 that this change resolves?
gh issue list --label "incident,P1" --state open | grep your-service
# Check current error rate to assess urgency
kubectl exec -it prometheus-pod -- promtool query instant \
'rate(http_requests_total{service="myapp",status=~"5.."}[5m])'
Check 2: Current system stability¶
Command/method:
# Pod health
kubectl get pods -n production -l app=myapp
kubectl describe pods -n production -l app=myapp | grep -A5 "Conditions:"
# Recent restarts
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Is the system in a restart loop?
kubectl get pods -n production | grep -E "CrashLoopBackOff|Error|OOMKilled"
Check 4: Blast radius assessment¶
Command/method:
# What does this ConfigMap / Secret apply to?
kubectl get configmap myapp-config -n production -o yaml | grep -A5 "metadata:"
# How many pods will be affected?
kubectl get pods -n production -l "$(kubectl get configmap myapp-config -o jsonpath='{.metadata.labels}' | jq -r 'to_entries | .[] | "\(.key)=\(.value)"' | head -1)"
# Does this config apply to multiple namespaces?
kubectl get configmap myapp-config --all-namespaces
# Estimate affected pods/services
kubectl get deployment -n production --selector="app=myapp" \
-o jsonpath='{.items[*].status.replicas}'
kubectl get pods --all-namespaces | grep configmap-name or the ConfigMap's consumers before assuming narrow blast radius.
Check 5: Non-prod testing validation¶
Command/method:
# Apply change to staging first
kubectl config use-context staging
kubectl apply -f config-change.yaml
# Run smoke tests
kubectl exec -it test-pod -n staging -- ./smoke-tests.sh
# Check for errors after applying
kubectl logs -l app=myapp -n staging --since=5m | grep -E "ERROR|FATAL|panic"
# Validate the config was loaded correctly
kubectl exec -it myapp-pod-in-staging -- env | grep CONFIG_KEY
kubectl exec -it myapp-pod-in-staging -- curl localhost:8080/debug/config
Check 6: Restart vs hot reload¶
Command/method:
# Does the application support hot config reload?
kubectl exec -it myapp-pod -- curl -s localhost:8080/debug/config-reload
# OR
kubectl exec -it myapp-pod -- kill -HUP 1 # SIGHUP for many apps
# Check application documentation or code
grep -r "SIGHUP\|viper.WatchConfig\|hot.reload\|config.watch" app/src/ | head -10
# Nginx example
kubectl exec -it nginx-pod -- nginx -s reload
# Check if ConfigMap change requires pod restart
kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[*].envFrom}'
Check 8: Change freeze status¶
Command/method:
# Check change freeze calendar
curl -s "https://wiki.internal/api/change-freeze" | jq '.active'
# Check Slack announcements channel for freeze notices
# Search: "change freeze" in #platform-announcements
# Check deployment guardrails
cat /workspace/_guardrails/change-freeze.conf 2>/dev/null || echo "no local freeze config"
Check 11: System health pre-check¶
Command/method:
# Full system health before making any change
kubectl get pods -n production | grep -v Running | grep -v Completed
kubectl get nodes | grep -v Ready
kubectl get pvc -n production | grep -v Bound
# SLO burn rate — are we burning error budget right now?
kubectl exec -it prometheus-pod -- promtool query instant \
'(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / 0.001'
# Any ongoing deploys?
kubectl rollout status deployment/myapp -n production 2>&1 | head -5
Terminal Actions¶
✅ Action: Standard PR + Deploy Pipeline¶
Do:
# 1. Make change in version-controlled config file
vim kubernetes/production/myapp-configmap.yaml
# 2. Open a PR with description of what changed and why
git add kubernetes/production/myapp-configmap.yaml
git commit -m "config: increase myapp worker threads from 4 to 8 for Q4 load"
gh pr create --title "config: increase myapp worker threads" \
--body "Reason: profiling showed CPU idle while threads blocked on I/O. Tested in staging for 24h."
# 3. Merge PR and let pipeline deploy
# 4. Watch deployment rollout
kubectl rollout status deployment/myapp -n production --timeout=10m
# 5. Verify config took effect
kubectl exec -it $(kubectl get pods -l app=myapp -o name | head -1) -- \
env | grep WORKER_THREADS
✅ Action: Emergency Change with Immediate Rollback Plan¶
Do:
# 1. Document the rollback command BEFORE making the change
ROLLBACK_CMD="kubectl apply -f kubernetes/production/myapp-configmap.yaml.bak"
echo "ROLLBACK: $ROLLBACK_CMD" | tee /tmp/emergency-change-$(date +%Y%m%d%H%M).txt
# 2. Back up current config
kubectl get configmap myapp-config -n production -o yaml > \
kubernetes/production/myapp-configmap.yaml.bak
# 3. Apply the change
kubectl apply -f kubernetes/production/myapp-configmap-emergency.yaml
# 4. Trigger rolling restart to pick up new config
kubectl rollout restart deployment/myapp -n production
# 5. Watch and validate — set 5-minute timer
kubectl rollout status deployment/myapp -n production --timeout=5m
kubectl logs -l app=myapp --since=2m | grep -c ERROR
# 6. If worse, roll back immediately
# $ROLLBACK_CMD && kubectl rollout restart deployment/myapp -n production
✅ Action: Canary Rollout (1 pod → 10% → 50% → 100%)¶
Do:
# Step 1: Apply config to a single pod (using a canary label selector)
kubectl patch deployment myapp -n production -p \
'{"spec":{"template":{"metadata":{"labels":{"canary":"true"}}}}}'
# Step 2: Apply new ConfigMap only to canary
kubectl apply -f myapp-configmap-new.yaml -l canary=true
# Step 3: Watch canary pod for 10 minutes
kubectl logs -l canary=true --since=10m | grep -c ERROR
# Step 4: If healthy, expand to 10%
kubectl scale deployment myapp-canary --replicas=1 # out of 10 total pods
# Step 5: Monitor error rate in Prometheus for 15 minutes
# Step 6: Expand to 50%, then 100%
kubectl apply -f myapp-configmap-new.yaml # apply to all pods
kubectl rollout restart deployment/myapp -n production
✅ Action: ConfigMap Update + Rolling Restart¶
Do:
# 1. Apply updated ConfigMap
kubectl apply -f myapp-configmap-updated.yaml
# 2. Trigger rolling restart to pick up new config
kubectl rollout restart deployment/myapp -n production
# 3. Monitor rollout
kubectl rollout status deployment/myapp -n production
# 4. Verify new config is loaded
for pod in $(kubectl get pods -l app=myapp -o name); do
echo "$pod: $(kubectl exec $pod -- env | grep CONFIG_KEY)"
done
✅ Action: Change Freeze Hold¶
Do:
# 1. Document the change you need to make
# 2. Request an exception if it's blocking critical work
# Template: "Requesting change freeze exception for [change]. Reason: [impact if delayed]. Risk: [low/medium — reversible in < 5 min]. Approver needed: [change manager name]."
# 3. Schedule for next maintenance window
# 4. Add to change calendar so it's not forgotten
gh issue create --repo org/platform \
--title "FROZEN: config change for myapp — schedule for post-freeze" \
--label "change-freeze,config" \
--body "Blocked by change freeze until [date]. Change: increase worker threads. Validation complete."
⚠️ Warning: Applying Change to a Degraded System¶
When: The system is already experiencing errors, high latency, or pod restarts when you want to make a config change. Risk: It becomes impossible to determine if the config change helped, had no effect, or made things worse. If the system crashes after your change, you will be blamed regardless of causation. Mitigation: Stabilize the system to a known baseline state before making any config changes, unless the config change is the identified fix for the current degradation (use emergency path).
Edge Cases¶
- Config change via environment variables vs mounted files: Env-var config requires pod restart to take effect; mounted ConfigMap files update in place but only if the application watches for file changes. Know your app's config-loading behavior before choosing the path.
- Secret rotation vs config change: Rotating credentials (database passwords, API keys) looks like a config change but has additional steps: update the external system first, then update the Kubernetes Secret, then restart. The ordering matters.
- Multi-cluster config drift: If the same ConfigMap exists in multiple clusters and they have drifted, applying a "small" config change may look very different across clusters. Always diff the current state across all target clusters before applying.
- Config validation at deploy time: Some config values (regex patterns, JSON schemas, YAML) are only validated when the application loads them. A syntactically valid but semantically wrong config will pass
kubectl applyand only fail at pod startup.
Cross-References¶
- Topic Packs: Configuration Management, change-management
- Runbooks: standard-deploy.md, emergency-change.md
- Related trees: rollback-or-fix-forward.md, should-i-page.md