Change Management¶
16 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard
🟢 Easy (3)¶
1. What are the three change categories and their approval requirements?
Show answer
Standard (pre-approved, low-risk, repeatable), Normal (planned, peer-reviewed or CAB-approved, scheduled), and Emergency (unplanned, expedited approval, used only to restore service during an active incident).Remember: peer review catches errors automation misses. At minimum: what's changing, why, rollback plan, risk assessment, and who approved.
2. What days and times are considered the best change windows?
Show answer
Tuesday through Thursday, during business hours when the full team is available. Avoid Friday afternoons, before holidays, or during on-call handoff.Fun fact: Google's SRE book documents 'change Fridays' as a leading cause of weekend incidents. 'No deploy Friday' is nearly universal in ops culture.
Gotcha: 'business hours' means when engineers who can fix problems are available, not necessarily 9-5 for global services.
Gotcha: 'off-hours' maintenance windows still affect global teams. Coordinate across time zones and communicate widely.
3. How long should you wait after a change before declaring it successful?
Show answer
At least 15 minutes (one full monitoring cycle). Some failures are delayed due to cache expiration, connection pool exhaustion, or slow memory leaks.Gotcha: some failures are delayed — connection pool exhaustion takes 30-60 min, memory leaks take hours. Tune soak time to your service's known failure modes.
🟡 Medium (4)¶
1. What four questions must a rollback plan answer?
Show answer
1. How — exact commands or pipeline to execute. 2. Who — who has authority to trigger the rollback. 3. When — the time window after which rollback is no longer safe. 4. Verification — how to confirm the rollback succeeded.Remember: every change needs a rollback plan. If the rollback plan is 'restore from backup,' that's not a plan — it's a prayer. Test your rollback.
2. Name at least five factors used in change risk assessment.
Show answer
Blast radius (single service vs cross-service), reversibility (one-command rollback vs hours-long migration), dependency count, testing confidence, time sensitivity, data impact (schema changes), and history of previous incidents from similar changes.3. What are the five rules for managing a change freeze?
Show answer
1. Define the exact window and communicate 2+ weeks ahead.2. Define exceptions for emergencies that override the freeze.
3. Enforce at the pipeline level, not just via Slack messages.
4. Complete all planned changes 48+ hours before the freeze.
5. Do not batch up changes to deploy the moment the freeze ends.
4. How should communication scale with change risk?
Show answer
Standard changes get automated notifications in the deploy channel. Normal changes get pre-change announcements and post-change validation messages. Emergency changes get real-time updates in the incident channel. High-risk changes get advance email to stakeholders and a dedicated Slack thread.Remember: measure before you optimize. Profile the actual bottleneck — premature optimization wastes effort on the wrong component. Amdahl's Law applies everywhere.
🔴 Hard (3)¶
1. What makes a change qualify as a standard change and why is this classification a goal?
Show answer
A standard change is well-understood (done many times), low-risk (small blast radius, easily reversible), repeatable (same procedure each time), and pre-approved (no per-instance review needed). The goal is to classify as many changes as standard as possible because they can be fully automated via CI/CD without bureaucratic overhead.2. What is a CAB (Change Advisory Board) and how should it be made efficient?
Show answer
A CAB reviews normal and high-risk changes, meets on a fixed schedule, and includes ops leads, dev leads, security, and service owners. Make it efficient by requiring complete change tickets before the meeting, pre-screening to skip standard changes, time-boxing reviews (5 min normal, 15 min high-risk), and tracking approval-to-execution time.3. What specific metrics should define rollback triggers, and when should they be defined?
Show answer
Define rollback triggers before the change starts. Triggers should include: error rate exceeding a threshold (e.g., 1% vs 0.05% baseline), p99 latency exceeding limits (e.g., 500ms vs 120ms baseline), any 5xx responses from the changed service, dependent service health check failures, data integrity check failures, and the change author's intuition that something is wrong.Remember: every change needs a rollback plan. If the rollback plan is 'restore from backup,' that's not a plan — it's a prayer. Test your rollback.