The Network Change Window¶
Category: The Close Call Domains: networking, change-management Read time: ~5 min
Setting the Scene¶
We were migrating from a flat /16 network to a segmented hub-and-spoke topology across three AWS VPCs. The project had been running for four months. We were on Change Request #47 — updating route tables in the transit gateway to shift traffic for the last batch of application subnets. It was a Wednesday, 9 PM, during our standard change window.
The change had been reviewed by two network engineers, approved by the CAB (Change Advisory Board), and had a rollback plan. Everything by the book.
What Happened¶
The change request specified adding three new route table entries to the transit gateway and removing two deprecated ones. The routes being removed pointed to a pair of /24 subnets (10.42.8.0/24 and 10.42.9.0/24) that, according to our CMDB, had been decommissioned in October.
I wrote the Terraform and ran terraform plan. Clean: 3 to add, 2 to destroy. I submitted the MR and it went to the CAB meeting on Monday.
During the CAB review, Marcus — our most senior network engineer, six months from retirement — asked a question nobody expected: "What about the legacy payment gateway? The one that runs on the bare-metal boxes in the colo cage. Doesn't that use the 10.42.8.0 range?"
The room went quiet. Our CMDB said those subnets were decommissioned. Our network diagrams didn't show them. But Marcus remembered, because he'd been the one to set up the VPN tunnel to the colo facility in 2019.
I SSHed into the transit gateway and ran tcpdump on the interface. There it was: 400 packets per second flowing to 10.42.8.17. Our legacy payment reconciliation system — the one that runs overnight batch jobs to settle transactions with three major banks — was actively using that subnet. Every night at midnight, it pulled transaction files via SFTP, processed them, and pushed settlements back.
If we'd deleted those routes, the batch job would have silently failed. No error — the packets would just vanish into a black hole. We wouldn't have noticed until the banks called on Friday asking why $2.3 million in settlements hadn't arrived.
The Moment of Truth¶
Marcus's institutional memory was the only thing standing between us and a routing black hole that would have disrupted bank settlements for 30% of our transaction volume. The CMDB was wrong. The network diagrams were wrong. The only correct record was in Marcus's head, and he was six months from taking it with him into retirement.
The Aftermath¶
We canceled the route removal, updated the CMDB, and added the colo subnets to our network diagrams. We ran a full subnet audit using nmap scans and flow log analysis across every VPC and transit gateway attachment — it took two weeks and found three other "decommissioned" subnets with active traffic. We also started a knowledge capture project with Marcus: weekly recorded sessions documenting every piece of tribal knowledge about our network topology.
The Lessons¶
- Change review processes exist for a reason: The CAB meeting felt like bureaucratic overhead until Marcus asked his question. That one question saved us from a multi-million-dollar settlement failure. Process isn't overhead — it's the safety net.
- Tribal knowledge is a risk: When critical information exists only in one person's head, it's not knowledge — it's a single point of failure. Document it, verify it, and make it searchable.
- Document all subnets and verify before removing: Never trust the CMDB as the sole source of truth for network topology. Verify with traffic analysis —
flow logs,tcpdump,netflow— before removing any route or subnet. If traffic is flowing, something is using it.
What I'd Do Differently¶
I'd implement continuous network discovery using VPC Flow Logs piped into a tool like Kentik or ElastiFlow. Any route removal PR would automatically check flow logs for traffic on the affected subnets in the last 90 days. If traffic exists, the PR is blocked. No human memory required.
The Quote¶
"The network doesn't care what your CMDB says. It cares what's in the route table."
Cross-References¶
- Topic Packs: Networking, Change Management
- Case Studies: Network Migration Planning (if relevant)