Portal | Level: L1: Foundations | Topics: Change Management, SRE Practices, Incident Response | Domain: DevOps & Tooling
Change Management - Primer¶
Why This Matters¶
Most outages aren't caused by hardware failure or unknown bugs. They're caused by changes — someone deployed something, changed a config, rotated a credential, or ran a migration. Change management is the discipline that turns "someone changed something" into "we made a planned, reviewed, reversible change with a known blast radius."
This isn't about bureaucracy. It's about the difference between "we pushed a bad config and rolled back in 2 minutes" and "we pushed a bad config on Friday at 5 PM and nobody knew until Monday morning."
If your team has ever said "what changed?" during an incident, your change management process has a gap.
Fun fact: According to Gartner and repeated industry studies, 60-80% of production outages are caused by changes, not by hardware failure or software bugs. The 2019 DORA (DevOps Research and Assessment) State of DevOps Report found that elite performers deploy more frequently but have lower change failure rates — because they invest in automated testing, canary deployments, and fast rollback, not because they skip change management. The lesson: the goal is not fewer changes, it is safer changes.
Change Categories¶
Not all changes carry the same risk or need the same process:
| Category | Definition | Approval | Examples |
|---|---|---|---|
| Standard | Pre-approved, low-risk, repeatable | Pre-authorized | Scaling replicas, cert rotation, dep bump |
| Normal | Planned, reviewed, scheduled | CAB or peer | Schema migration, new service deploy, DNS |
| Emergency | Unplanned, needed to restore service | Expedited | Hotfix for production outage |
┌─────────────────────────────────────────────────────────┐
│ Change Flow │
│ │
│ Standard Normal Emergency │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Create│ │Create│ │Create│ │
│ │ticket│ │ticket│ │ticket│ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │
│ │ ┌────▼────┐ ┌────▼────┐ │
│ │ │Peer │ │Expedited│ │
│ │ │review │ │approval │ │
│ │ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ │ ┌────▼────┐ │ │
│ │ │Schedule │ │ │
│ │ │window │ │ │
│ │ └────┬────┘ │ │
│ │ │ │ │
│ ┌──▼──────────────────────▼────────────────────▼──┐ │
│ │ Execute Change │ │
│ └──────────────────────┬──────────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────────┐ │
│ │ Validate & Close │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Standard Changes¶
These should be your goal for most changes. A standard change is: - Well-understood (done it many times before) - Low-risk (blast radius is small, easily reversible) - Repeatable (same procedure every time) - Pre-approved (doesn't need per-instance review)
Automate standard changes. A CI/CD pipeline deploying a tested, reviewed code change is a standard change. Don't wrap it in bureaucracy.
Normal Changes¶
Anything that isn't standard or emergency. These get reviewed, scheduled, and communicated: - Database schema changes - Infrastructure provisioning (new clusters, network changes) - Major version upgrades - DNS changes - Security policy changes
Emergency Changes¶
Used only to restore service during an active incident. The process is compressed but not eliminated: - Someone still approves (incident commander or on-call lead) - The change is still documented (retroactively if needed) - Post-incident review evaluates whether the emergency change was correct
Risk Assessment¶
Before any normal change, assess:
| Factor | Low Risk | High Risk |
|---|---|---|
| Blast radius | Single service, single region | Cross-service, multi-region |
| Reversibility | One-command rollback, < 5 min | Requires data migration, hours |
| Dependency count | No downstream consumers | Many services depend on this |
| Testing confidence | Tested in staging, load tested | "It works on my machine" |
| Time sensitivity | Can execute anytime | Must happen during maintenance |
| Data impact | No data changes | Schema change, data migration |
| Previous incidents | This change type never caused issues | Similar change caused outage last Q |
Score each factor. High risk on any dimension means extra review, smaller blast radius (canary), or a dedicated change window.
Rollback Criteria¶
Define rollback triggers before you start, not when things are on fire:
ROLLBACK if any of these occur within 15 minutes of change:
- Error rate exceeds 1% (baseline: 0.05%)
- p99 latency exceeds 500ms (baseline: 120ms)
- Any 5xx responses from the changed service
- Dependent service health checks fail
- Data integrity check fails
- Change author's gut says something is wrong (this is valid)
The rollback plan must answer: 1. How — exact commands or pipeline to execute 2. Who — who has authority to trigger it 3. When — time window after which rollback is no longer safe 4. Verification — how to confirm rollback succeeded
Change Windows¶
Not all hours are equal:
Risk Level by Time
┌───────────────────────────────────────────────┐
│ Mon Tue Wed Thu Fri │ Sat │ Sun │
│ ═══ ═══ ═══ ═══ ═══ │ ═══ │ ═══ │
│ ● ● ● ● ✖ │ ◐ │ ◐ │
│ │ │ │
│ ● = Good change window │ │ │
│ ◐ = Acceptable (reduced │ │ │
│ staff, slower resp) │ │ │
│ ✖ = Avoid (end of week, │ │ │
│ reduced monitoring) │ │ │
└───────────────────────────────────────────────┘
Best practice: Tuesday through Thursday, during business hours when the full team is available.
Anti-pattern: Friday afternoon, before holidays, or during on-call handoff.
Within a day:
06:00-09:00 ← Pre-traffic, good for infra changes
09:00-11:00 ← Traffic ramping, avoid
11:00-14:00 ← Peak traffic, avoid changes
14:00-17:00 ← Post-peak, acceptable window
17:00-22:00 ← Off-hours, good but reduced staff
22:00-06:00 ← Maintenance window, best for high-risk
Change Freeze Discipline¶
Change freezes exist for high-risk periods: Black Friday, end-of-quarter, product launches.
Rules: 1. Define the window — exact start and end time, communicated 2+ weeks ahead 2. Define exceptions — what constitutes an emergency that overrides the freeze 3. Enforce it — block deployments at the pipeline level, not just with a Slack message 4. Prepare before — complete all planned changes 48+ hours before the freeze starts 5. Don't stack changes — don't batch up changes to deploy the moment the freeze ends
# Example: block deploys in CI/CD during freeze
if [ "$(date +%Y-%m-%d)" \> "2026-11-25" ] && [ "$(date +%Y-%m-%d)" \< "2026-12-02" ]; then
echo "CHANGE FREEZE ACTIVE. Deploy blocked. Contact incident commander for exceptions."
exit 1
fi
Communication¶
Every change needs communication proportional to its risk:
| Change Type | Communication |
|---|---|
| Standard | Automated notification in deploy channel |
| Normal | Pre-change announcement, execute, post-change validation |
| Emergency | Real-time updates in incident channel |
| High-risk | Advance email to stakeholders, dedicated Slack thread |
Communication Template (Normal Change)¶
📋 CHANGE NOTIFICATION — [CHG-1234]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What: Upgrading PostgreSQL from 14.9 to 14.11
When: 2026-03-17 14:00-15:00 UTC
Who: @alice (executor), @bob (reviewer)
Impact: 30-second read-only window during failover
Rollback: Promote old primary, ETA 2 minutes
Status: SCHEDULED
Pre-change checklist:
[x] Tested in staging
[x] Backup verified
[x] Rollback procedure tested
[ ] Change executed
[ ] Post-change validation
[ ] Change closed
Analogy: Change management is like a pre-flight checklist for pilots. The checklist does not make flying slower — it makes flying safer. A 747 pilot runs through the same checklist before every flight, even after 10,000 hours of experience. Standard changes are like routine pre-flight checks (pre-approved, same every time). Normal changes are like filing a flight plan (reviewed, scheduled). Emergency changes are like an in-flight diversion (necessary, documented after the fact).
War story: A team deployed a database migration on a Friday afternoon without a rollback plan. The migration corrupted an index, causing queries to return wrong results. By the time anyone noticed on Monday morning, three days of incorrect data had been served to customers. The fix took two weeks of data reconciliation. The lesson: never deploy schema changes without a tested rollback procedure, and never deploy on Fridays unless you are prepared to work the weekend.
Post-Change Validation¶
The change isn't done when the command finishes. It's done when you've confirmed it worked:
# Application-level checks
curl -s https://api.example.com/health | jq .status
# Should return "healthy"
# Error rate check (Prometheus)
# rate(http_requests_total{status=~"5.."}[5m]) should be near 0
# Latency check
# histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Should be within normal range
# Dependency checks
# Verify downstream services are still healthy
# Data integrity check (if applicable)
# SELECT count(*) FROM critical_table;
# Compare to pre-change count
Wait at least 15 minutes (one full monitoring cycle) before declaring the change successful. Some failures are delayed — cache expiration, connection pool exhaustion, slow memory leak.
The CAB (Change Advisory Board)¶
For organizations that use formal CAB review:
- CAB reviews normal and high-risk changes
- Meets on a fixed schedule (e.g., weekly)
- Members: ops leads, dev leads, security, service owners
- Output: approve, reject, or request more information
Make CAB efficient: - Require a complete change ticket before the meeting - Pre-screen changes — don't waste CAB time on standard changes - Time-box each review (5 minutes for normal, 15 for high-risk) - Track approval-to-execution time — if it's weeks, your process is too slow
Key Takeaways¶
- Most outages are caused by changes, not failures. Respect the change.
- Classify changes: standard (automate), normal (review + schedule), emergency (expedite but document).
- Define rollback criteria before you start, not during the fire.
- Tuesday-Thursday, business hours, full team available = best change window.
- Change freezes need enforcement (pipeline blocks), not just announcements.
- Post-change validation is part of the change. Wait 15 minutes minimum.
- Communication is proportional to risk. Over-communicate high-risk changes.
Wiki Navigation¶
Related Content¶
- Capacity Planning (Topic Pack, L2) — SRE Practices
- Change Management Flashcards (CLI) (flashcard_deck, L1) — Change Management
- Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
- Debugging Methodology (Topic Pack, L1) — Incident Response
- Incident Command & On-Call (Topic Pack, L2) — Incident Response
- Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
- Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
- Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
- Postmortems & SLOs (Topic Pack, L2) — Incident Response