Skip to content

Change Management Footguns

  1. "It's just a config change" — skipping review entirely. Config changes have caused some of the largest outages in tech history. A mistyped connection string, a wrong timeout value, a feature flag flipped to the wrong audience. Config changes bypass compiler checks, unit tests, and often integration tests entirely.

Fix: Route config changes through the same review process as code: version control, peer review, staged rollout. Use validation schemas for config files. Deploy config changes incrementally (canary) when possible.

War story: Facebook's 2021 global outage (6+ hours, affecting WhatsApp, Instagram, and Facebook) was caused by a configuration change to the backbone routers. The change was intended to assess capacity, but a bug in the audit tool failed to catch the error. The misconfiguration withdrew BGP routes for Facebook's DNS nameservers, making the entire company unreachable from the internet. Config changes on network infrastructure are the highest-blast-radius changes in any organization.

  1. Deploying on Friday afternoon. You ship at 4 PM on Friday. A subtle bug surfaces at 7 PM. The author is at dinner. The reviewer is offline. The on-call engineer has never seen this code. What should have been a 5-minute rollback becomes a 3-hour investigation with half the context missing.

Fix: Hard rule: no non-emergency deploys after Wednesday for high-risk changes, after Thursday for normal changes. If your CI/CD pipeline doesn't enforce this, add a gate.

  1. No rollback plan — "we'll figure it out." You deploy a database migration that adds columns, changes indexes, and updates 50 million rows. It breaks something. Rolling back means... what, exactly? Nobody wrote it down. The migration tool doesn't support automatic rollback for DDL changes.

Fix: Every change ticket must have a rollback section with exact commands. If a change can't be rolled back (e.g., destructive migration), that's a risk factor that escalates the review level. Test the rollback procedure in staging.

  1. Skipping pre-flight checks because "we've done this before." The team has deployed this service 200 times. Routine. Except this time someone merged a PR that changes the health check endpoint. The deploy succeeds but the load balancer can't health-check the new pods. You skip the pre-flight checklist because it's "just another deploy."

Fix: Pre-flight checklists exist for the one time in 200 when something is different. Automate the checklist into your deployment pipeline so it can't be skipped. Airline pilots don't skip pre-flight because they've flown 10,000 times.

Remember: The Atul Gawande "Checklist Manifesto" principle applies directly to deployments: checklists catch errors that expertise misses. The most dangerous moment is when something is routine — expertise breeds complacency. Automate the checklist into CI/CD gates so it's impossible to skip, not just easy to follow.

  1. Silent changes — no announcement, no ticket, no record. An engineer SSH's into a production server and tweaks a kernel parameter to fix a performance issue. It works. No ticket, no Slack message, no commit. Three months later, the server is replaced and the issue returns. Nobody remembers the fix.

Fix: Every production change needs a record: a ticket, a commit, a message — something searchable. Make the official path (pipeline, config management) easier than the unofficial path (SSH and edit). Use auditd to detect unauthorized changes.

  1. Stacking changes right after a freeze ends. The change freeze lifts Monday morning. Five teams deploy simultaneously because they've been queuing changes for two weeks. Three of the five changes interact badly. Which one caused the outage? Nobody knows because everything changed at once.

Fix: Stagger post-freeze deploys. Enforce a "one change at a time" policy for the first 24-48 hours after a freeze. Prioritize changes by risk and deploy lowest-risk first.

Gotcha: The Monday after a code freeze is statistically the most dangerous deployment day in many organizations. Google's SRE book calls this the "launch queue" anti-pattern. If 10 teams are blocked by a freeze and all deploy Monday morning, you have 10 simultaneous changes with no ability to isolate failures. Some teams now enforce a "deployment lottery" — a randomized deployment order with mandatory 2-hour gaps between teams.

  1. Emergency changes without follow-up documentation. Production is down. You hotfix it in 10 minutes. Crisis averted, high-fives all around. Nobody writes the change ticket. Nobody documents what was changed. The emergency fix introduced technical debt that compounds silently until the next incident.

Fix: Emergency changes get retroactive documentation within 24 hours — no exceptions. Include: what changed, why, who approved, what was the impact, and what permanent fix is needed. Add it to the post-incident review.

  1. No post-change validation — "the deploy succeeded." The CI/CD pipeline shows green. The deploy completed. The engineer closes the ticket. But "deployed successfully" and "working correctly" are different things. A successful deploy can still break functionality if the new code has a logic error that tests didn't cover.

Fix: Post-change validation is a mandatory step, not optional. Check health endpoints, error rates, latency, and at least one functional test. Wait 15 minutes before declaring success. Automate this validation into the pipeline.

  1. Changing multiple things at once to "save time." You bundle a dependency update, a config change, and a feature flag flip into one deploy because scheduling three change windows feels wasteful. Something breaks. You can't isolate which change caused it, so you have to roll back all three.

Fix: One logical change per deploy. If three things need to change, that's three deploys with validation between each. The time you "save" by bundling is always less than the time you spend debugging a bundled failure.

  1. Treating change management as bureaucracy instead of safety. The process exists but engineers see it as overhead. They find workarounds: deploy without tickets, use personal AWS credentials, push directly to main. The process becomes a checkbox exercise that catches nothing because nobody takes it seriously.

    Fix: Make the process lightweight for low-risk changes — standard changes should be automated and friction-free. Reserve heavyweight review for high-risk changes. If engineers are working around the process, the process is too heavy for the risk level. Fix the process, don't blame the engineers.