Skip to content

Portal | Level: L1: Foundations | Topics: Change Management, SRE Practices, Incident Response | Domain: DevOps & Tooling

Change Management - Primer

Why This Matters

Most outages aren't caused by hardware failure or unknown bugs. They're caused by changes — someone deployed something, changed a config, rotated a credential, or ran a migration. Change management is the discipline that turns "someone changed something" into "we made a planned, reviewed, reversible change with a known blast radius."

This isn't about bureaucracy. It's about the difference between "we pushed a bad config and rolled back in 2 minutes" and "we pushed a bad config on Friday at 5 PM and nobody knew until Monday morning."

If your team has ever said "what changed?" during an incident, your change management process has a gap.

Fun fact: According to Gartner and repeated industry studies, 60-80% of production outages are caused by changes, not by hardware failure or software bugs. The 2019 DORA (DevOps Research and Assessment) State of DevOps Report found that elite performers deploy more frequently but have lower change failure rates — because they invest in automated testing, canary deployments, and fast rollback, not because they skip change management. The lesson: the goal is not fewer changes, it is safer changes.


Change Categories

Not all changes carry the same risk or need the same process:

Category Definition Approval Examples
Standard Pre-approved, low-risk, repeatable Pre-authorized Scaling replicas, cert rotation, dep bump
Normal Planned, reviewed, scheduled CAB or peer Schema migration, new service deploy, DNS
Emergency Unplanned, needed to restore service Expedited Hotfix for production outage
 ┌─────────────────────────────────────────────────────────┐
 │                    Change Flow                           │
 │                                                          │
 │   Standard                Normal              Emergency  │
 │   ┌──────┐               ┌──────┐            ┌──────┐  │
 │   │Create│               │Create│            │Create│  │
 │   │ticket│               │ticket│            │ticket│  │
 │   └──┬───┘               └──┬───┘            └──┬───┘  │
 │      │                      │                    │      │
 │      │                 ┌────▼────┐          ┌────▼────┐ │
 │      │                 │Peer     │          │Expedited│ │
 │      │                 │review   │          │approval │ │
 │      │                 └────┬────┘          └────┬────┘ │
 │      │                      │                    │      │
 │      │                 ┌────▼────┐               │      │
 │      │                 │Schedule │               │      │
 │      │                 │window   │               │      │
 │      │                 └────┬────┘               │      │
 │      │                      │                    │      │
 │   ┌──▼──────────────────────▼────────────────────▼──┐   │
 │   │              Execute Change                      │   │
 │   └──────────────────────┬──────────────────────────┘   │
 │                          │                               │
 │   ┌──────────────────────▼──────────────────────────┐   │
 │   │              Validate & Close                    │   │
 │   └─────────────────────────────────────────────────┘   │
 └─────────────────────────────────────────────────────────┘

Standard Changes

These should be your goal for most changes. A standard change is: - Well-understood (done it many times before) - Low-risk (blast radius is small, easily reversible) - Repeatable (same procedure every time) - Pre-approved (doesn't need per-instance review)

Automate standard changes. A CI/CD pipeline deploying a tested, reviewed code change is a standard change. Don't wrap it in bureaucracy.

Normal Changes

Anything that isn't standard or emergency. These get reviewed, scheduled, and communicated: - Database schema changes - Infrastructure provisioning (new clusters, network changes) - Major version upgrades - DNS changes - Security policy changes

Emergency Changes

Used only to restore service during an active incident. The process is compressed but not eliminated: - Someone still approves (incident commander or on-call lead) - The change is still documented (retroactively if needed) - Post-incident review evaluates whether the emergency change was correct


Risk Assessment

Before any normal change, assess:

Factor Low Risk High Risk
Blast radius Single service, single region Cross-service, multi-region
Reversibility One-command rollback, < 5 min Requires data migration, hours
Dependency count No downstream consumers Many services depend on this
Testing confidence Tested in staging, load tested "It works on my machine"
Time sensitivity Can execute anytime Must happen during maintenance
Data impact No data changes Schema change, data migration
Previous incidents This change type never caused issues Similar change caused outage last Q

Score each factor. High risk on any dimension means extra review, smaller blast radius (canary), or a dedicated change window.


Rollback Criteria

Define rollback triggers before you start, not when things are on fire:

ROLLBACK if any of these occur within 15 minutes of change:
- Error rate exceeds 1% (baseline: 0.05%)
- p99 latency exceeds 500ms (baseline: 120ms)
- Any 5xx responses from the changed service
- Dependent service health checks fail
- Data integrity check fails
- Change author's gut says something is wrong (this is valid)

The rollback plan must answer: 1. How — exact commands or pipeline to execute 2. Who — who has authority to trigger it 3. When — time window after which rollback is no longer safe 4. Verification — how to confirm rollback succeeded


Change Windows

Not all hours are equal:

 Risk Level by Time
 ┌───────────────────────────────────────────────┐
 │  Mon  Tue  Wed  Thu  Fri  │  Sat  │  Sun     │
 │  ═══  ═══  ═══  ═══  ═══  │  ═══  │  ═══     │
 │   ●    ●    ●    ●    ✖   │   ◐   │   ◐      │
 │                            │       │          │
 │  ● = Good change window   │       │          │
 │  ◐ = Acceptable (reduced  │       │          │
 │      staff, slower resp)  │       │          │
 │  ✖ = Avoid (end of week,  │       │          │
 │      reduced monitoring)  │       │          │
 └───────────────────────────────────────────────┘

Best practice: Tuesday through Thursday, during business hours when the full team is available.

Anti-pattern: Friday afternoon, before holidays, or during on-call handoff.

Within a day:

 06:00-09:00  ← Pre-traffic, good for infra changes
 09:00-11:00  ← Traffic ramping, avoid
 11:00-14:00  ← Peak traffic, avoid changes
 14:00-17:00  ← Post-peak, acceptable window
 17:00-22:00  ← Off-hours, good but reduced staff
 22:00-06:00  ← Maintenance window, best for high-risk


Change Freeze Discipline

Change freezes exist for high-risk periods: Black Friday, end-of-quarter, product launches.

Rules: 1. Define the window — exact start and end time, communicated 2+ weeks ahead 2. Define exceptions — what constitutes an emergency that overrides the freeze 3. Enforce it — block deployments at the pipeline level, not just with a Slack message 4. Prepare before — complete all planned changes 48+ hours before the freeze starts 5. Don't stack changes — don't batch up changes to deploy the moment the freeze ends

# Example: block deploys in CI/CD during freeze
if [ "$(date +%Y-%m-%d)" \> "2026-11-25" ] && [ "$(date +%Y-%m-%d)" \< "2026-12-02" ]; then
  echo "CHANGE FREEZE ACTIVE. Deploy blocked. Contact incident commander for exceptions."
  exit 1
fi

Communication

Every change needs communication proportional to its risk:

Change Type Communication
Standard Automated notification in deploy channel
Normal Pre-change announcement, execute, post-change validation
Emergency Real-time updates in incident channel
High-risk Advance email to stakeholders, dedicated Slack thread

Communication Template (Normal Change)

📋 CHANGE NOTIFICATION — [CHG-1234]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What: Upgrading PostgreSQL from 14.9 to 14.11
When: 2026-03-17 14:00-15:00 UTC
Who: @alice (executor), @bob (reviewer)
Impact: 30-second read-only window during failover
Rollback: Promote old primary, ETA 2 minutes
Status: SCHEDULED

Pre-change checklist:
[x] Tested in staging
[x] Backup verified
[x] Rollback procedure tested
[ ] Change executed
[ ] Post-change validation
[ ] Change closed

Analogy: Change management is like a pre-flight checklist for pilots. The checklist does not make flying slower — it makes flying safer. A 747 pilot runs through the same checklist before every flight, even after 10,000 hours of experience. Standard changes are like routine pre-flight checks (pre-approved, same every time). Normal changes are like filing a flight plan (reviewed, scheduled). Emergency changes are like an in-flight diversion (necessary, documented after the fact).

War story: A team deployed a database migration on a Friday afternoon without a rollback plan. The migration corrupted an index, causing queries to return wrong results. By the time anyone noticed on Monday morning, three days of incorrect data had been served to customers. The fix took two weeks of data reconciliation. The lesson: never deploy schema changes without a tested rollback procedure, and never deploy on Fridays unless you are prepared to work the weekend.

Post-Change Validation

The change isn't done when the command finishes. It's done when you've confirmed it worked:

# Application-level checks
curl -s https://api.example.com/health | jq .status
# Should return "healthy"

# Error rate check (Prometheus)
# rate(http_requests_total{status=~"5.."}[5m]) should be near 0

# Latency check
# histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Should be within normal range

# Dependency checks
# Verify downstream services are still healthy

# Data integrity check (if applicable)
# SELECT count(*) FROM critical_table;
# Compare to pre-change count

Wait at least 15 minutes (one full monitoring cycle) before declaring the change successful. Some failures are delayed — cache expiration, connection pool exhaustion, slow memory leak.


The CAB (Change Advisory Board)

For organizations that use formal CAB review:

  • CAB reviews normal and high-risk changes
  • Meets on a fixed schedule (e.g., weekly)
  • Members: ops leads, dev leads, security, service owners
  • Output: approve, reject, or request more information

Make CAB efficient: - Require a complete change ticket before the meeting - Pre-screen changes — don't waste CAB time on standard changes - Time-box each review (5 minutes for normal, 15 for high-risk) - Track approval-to-execution time — if it's weeks, your process is too slow


Key Takeaways

  1. Most outages are caused by changes, not failures. Respect the change.
  2. Classify changes: standard (automate), normal (review + schedule), emergency (expedite but document).
  3. Define rollback criteria before you start, not during the fire.
  4. Tuesday-Thursday, business hours, full team available = best change window.
  5. Change freezes need enforcement (pipeline blocks), not just announcements.
  6. Post-change validation is part of the change. Wait 15 minutes minimum.
  7. Communication is proportional to risk. Over-communicate high-risk changes.

Wiki Navigation

  • Capacity Planning (Topic Pack, L2) — SRE Practices
  • Change Management Flashcards (CLI) (flashcard_deck, L1) — Change Management
  • Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
  • Debugging Methodology (Topic Pack, L1) — Incident Response
  • Incident Command & On-Call (Topic Pack, L2) — Incident Response
  • Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
  • Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
  • Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
  • Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
  • Postmortems & SLOs (Topic Pack, L2) — Incident Response