Portal | Level: L1: Foundations | Topics: Change Management, SRE Practices, Incident Response | Domain: DevOps & Tooling

Change Management - Primer¶

Why This Matters¶

Most outages aren't caused by hardware failure or unknown bugs. They're caused by changes — someone deployed something, changed a config, rotated a credential, or ran a migration. Change management is the discipline that turns "someone changed something" into "we made a planned, reviewed, reversible change with a known blast radius."

This isn't about bureaucracy. It's about the difference between "we pushed a bad config and rolled back in 2 minutes" and "we pushed a bad config on Friday at 5 PM and nobody knew until Monday morning."

If your team has ever said "what changed?" during an incident, your change management process has a gap.

Fun fact: According to Gartner and repeated industry studies, 60-80% of production outages are caused by changes, not by hardware failure or software bugs. The 2019 DORA (DevOps Research and Assessment) State of DevOps Report found that elite performers deploy more frequently but have lower change failure rates — because they invest in automated testing, canary deployments, and fast rollback, not because they skip change management. The lesson: the goal is not fewer changes, it is safer changes.

Change Categories¶

Not all changes carry the same risk or need the same process:

Category	Definition	Approval	Examples
Standard	Pre-approved, low-risk, repeatable	Pre-authorized	Scaling replicas, cert rotation, dep bump
Normal	Planned, reviewed, scheduled	CAB or peer	Schema migration, new service deploy, DNS
Emergency	Unplanned, needed to restore service	Expedited	Hotfix for production outage

 ┌─────────────────────────────────────────────────────────┐
 │                    Change Flow                           │
 │                                                          │
 │   Standard                Normal              Emergency  │
 │   ┌──────┐               ┌──────┐            ┌──────┐  │
 │   │Create│               │Create│            │Create│  │
 │   │ticket│               │ticket│            │ticket│  │
 │   └──┬───┘               └──┬───┘            └──┬───┘  │
 │      │                      │                    │      │
 │      │                 ┌────▼────┐          ┌────▼────┐ │
 │      │                 │Peer     │          │Expedited│ │
 │      │                 │review   │          │approval │ │
 │      │                 └────┬────┘          └────┬────┘ │
 │      │                      │                    │      │
 │      │                 ┌────▼────┐               │      │
 │      │                 │Schedule │               │      │
 │      │                 │window   │               │      │
 │      │                 └────┬────┘               │      │
 │      │                      │                    │      │
 │   ┌──▼──────────────────────▼────────────────────▼──┐   │
 │   │              Execute Change                      │   │
 │   └──────────────────────┬──────────────────────────┘   │
 │                          │                               │
 │   ┌──────────────────────▼──────────────────────────┐   │
 │   │              Validate & Close                    │   │
 │   └─────────────────────────────────────────────────┘   │
 └─────────────────────────────────────────────────────────┘

Standard Changes¶

These should be your goal for most changes. A standard change is: - Well-understood (done it many times before) - Low-risk (blast radius is small, easily reversible) - Repeatable (same procedure every time) - Pre-approved (doesn't need per-instance review)

Automate standard changes. A CI/CD pipeline deploying a tested, reviewed code change is a standard change. Don't wrap it in bureaucracy.

Normal Changes¶

Anything that isn't standard or emergency. These get reviewed, scheduled, and communicated: - Database schema changes - Infrastructure provisioning (new clusters, network changes) - Major version upgrades - DNS changes - Security policy changes

Emergency Changes¶

Used only to restore service during an active incident. The process is compressed but not eliminated: - Someone still approves (incident commander or on-call lead) - The change is still documented (retroactively if needed) - Post-incident review evaluates whether the emergency change was correct

Risk Assessment¶

Before any normal change, assess:

Factor	Low Risk	High Risk
Blast radius	Single service, single region	Cross-service, multi-region
Reversibility	One-command rollback, < 5 min	Requires data migration, hours
Dependency count	No downstream consumers	Many services depend on this
Testing confidence	Tested in staging, load tested	"It works on my machine"
Time sensitivity	Can execute anytime	Must happen during maintenance
Data impact	No data changes	Schema change, data migration
Previous incidents	This change type never caused issues	Similar change caused outage last Q

Score each factor. High risk on any dimension means extra review, smaller blast radius (canary), or a dedicated change window.

Rollback Criteria¶

Define rollback triggers before you start, not when things are on fire:

ROLLBACK if any of these occur within 15 minutes of change:
- Error rate exceeds 1% (baseline: 0.05%)
- p99 latency exceeds 500ms (baseline: 120ms)
- Any 5xx responses from the changed service
- Dependent service health checks fail
- Data integrity check fails
- Change author's gut says something is wrong (this is valid)

The rollback plan must answer: 1. How — exact commands or pipeline to execute 2. Who — who has authority to trigger it 3. When — time window after which rollback is no longer safe 4. Verification — how to confirm rollback succeeded

Change Windows¶

Not all hours are equal:

 Risk Level by Time
 ┌───────────────────────────────────────────────┐
 │  Mon  Tue  Wed  Thu  Fri  │  Sat  │  Sun     │
 │  ═══  ═══  ═══  ═══  ═══  │  ═══  │  ═══     │
 │   ●    ●    ●    ●    ✖   │   ◐   │   ◐      │
 │                            │       │          │
 │  ● = Good change window   │       │          │
 │  ◐ = Acceptable (reduced  │       │          │
 │      staff, slower resp)  │       │          │
 │  ✖ = Avoid (end of week,  │       │          │
 │      reduced monitoring)  │       │          │
 └───────────────────────────────────────────────┘

Best practice: Tuesday through Thursday, during business hours when the full team is available.

Anti-pattern: Friday afternoon, before holidays, or during on-call handoff.

Within a day:

 06:00-09:00  ← Pre-traffic, good for infra changes
 09:00-11:00  ← Traffic ramping, avoid
 11:00-14:00  ← Peak traffic, avoid changes
 14:00-17:00  ← Post-peak, acceptable window
 17:00-22:00  ← Off-hours, good but reduced staff
 22:00-06:00  ← Maintenance window, best for high-risk

Change Freeze Discipline¶

Change freezes exist for high-risk periods: Black Friday, end-of-quarter, product launches.

Rules: 1. Define the window — exact start and end time, communicated 2+ weeks ahead 2. Define exceptions — what constitutes an emergency that overrides the freeze 3. Enforce it — block deployments at the pipeline level, not just with a Slack message 4. Prepare before — complete all planned changes 48+ hours before the freeze starts 5. Don't stack changes — don't batch up changes to deploy the moment the freeze ends

# Example: block deploys in CI/CD during freeze
if [ "$(date +%Y-%m-%d)" \> "2026-11-25" ] && [ "$(date +%Y-%m-%d)" \< "2026-12-02" ]; then
  echo "CHANGE FREEZE ACTIVE. Deploy blocked. Contact incident commander for exceptions."
  exit 1
fi

Communication¶

Every change needs communication proportional to its risk:

Change Type	Communication
Standard	Automated notification in deploy channel
Normal	Pre-change announcement, execute, post-change validation
Emergency	Real-time updates in incident channel
High-risk	Advance email to stakeholders, dedicated Slack thread

Communication Template (Normal Change)¶

📋 CHANGE NOTIFICATION — [CHG-1234]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What: Upgrading PostgreSQL from 14.9 to 14.11
When: 2026-03-17 14:00-15:00 UTC
Who: @alice (executor), @bob (reviewer)
Impact: 30-second read-only window during failover
Rollback: Promote old primary, ETA 2 minutes
Status: SCHEDULED

Pre-change checklist:
[x] Tested in staging
[x] Backup verified
[x] Rollback procedure tested
[ ] Change executed
[ ] Post-change validation
[ ] Change closed

Analogy: Change management is like a pre-flight checklist for pilots. The checklist does not make flying slower — it makes flying safer. A 747 pilot runs through the same checklist before every flight, even after 10,000 hours of experience. Standard changes are like routine pre-flight checks (pre-approved, same every time). Normal changes are like filing a flight plan (reviewed, scheduled). Emergency changes are like an in-flight diversion (necessary, documented after the fact).

War story: A team deployed a database migration on a Friday afternoon without a rollback plan. The migration corrupted an index, causing queries to return wrong results. By the time anyone noticed on Monday morning, three days of incorrect data had been served to customers. The fix took two weeks of data reconciliation. The lesson: never deploy schema changes without a tested rollback procedure, and never deploy on Fridays unless you are prepared to work the weekend.

Post-Change Validation¶

The change isn't done when the command finishes. It's done when you've confirmed it worked:

# Application-level checks
curl -s https://api.example.com/health | jq .status
# Should return "healthy"

# Error rate check (Prometheus)
# rate(http_requests_total{status=~"5.."}[5m]) should be near 0

# Latency check
# histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Should be within normal range

# Dependency checks
# Verify downstream services are still healthy

# Data integrity check (if applicable)
# SELECT count(*) FROM critical_table;
# Compare to pre-change count

Wait at least 15 minutes (one full monitoring cycle) before declaring the change successful. Some failures are delayed — cache expiration, connection pool exhaustion, slow memory leak.

The CAB (Change Advisory Board)¶

For organizations that use formal CAB review:

CAB reviews normal and high-risk changes
Meets on a fixed schedule (e.g., weekly)
Members: ops leads, dev leads, security, service owners
Output: approve, reject, or request more information

Make CAB efficient: - Require a complete change ticket before the meeting - Pre-screen changes — don't waste CAB time on standard changes - Time-box each review (5 minutes for normal, 15 for high-risk) - Track approval-to-execution time — if it's weeks, your process is too slow

Key Takeaways¶

Most outages are caused by changes, not failures. Respect the change.
Classify changes: standard (automate), normal (review + schedule), emergency (expedite but document).
Define rollback criteria before you start, not during the fire.
Tuesday-Thursday, business hours, full team available = best change window.
Change freezes need enforcement (pipeline blocks), not just announcements.
Post-change validation is part of the change. Wait 15 minutes minimum.
Communication is proportional to risk. Over-communicate high-risk changes.

Capacity Planning (Topic Pack, L2) — SRE Practices
Change Management Flashcards (CLI) (flashcard_deck, L1) — Change Management
Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
Debugging Methodology (Topic Pack, L1) — Incident Response
Incident Command & On-Call (Topic Pack, L2) — Incident Response
Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
Postmortems & SLOs (Topic Pack, L2) — Incident Response

Change Management - Primer¶

Why This Matters¶

Change Categories¶

Standard Changes¶

Normal Changes¶

Emergency Changes¶

Risk Assessment¶

Rollback Criteria¶

Change Windows¶

Change Freeze Discipline¶

Communication¶

Communication Template (Normal Change)¶

Post-Change Validation¶

The CAB (Change Advisory Board)¶

Key Takeaways¶

Wiki Navigation¶

Pages that link here¶

Change Management - Primer¶

Why This Matters¶

Change Categories¶

Standard Changes¶

Normal Changes¶

Emergency Changes¶

Risk Assessment¶

Rollback Criteria¶

Change Windows¶

Change Freeze Discipline¶

Communication¶

Communication Template (Normal Change)¶

Post-Change Validation¶

The CAB (Change Advisory Board)¶

Key Takeaways¶

Wiki Navigation¶

Related Content¶

Pages that link here¶