Anti-Primer: Load Balancing¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A network engineer is making changes to the Load Balancing configuration during a maintenance window. The network serves 500 users and a dozen production services. The change was planned last month but the engineer implementing it was not part of the planning.
The Timeline¶
Hour 0: No Rollback Plan¶
Makes configuration changes without saving the current running config. The deadline was looming, and this seemed like the fastest path forward. But the result is new config breaks connectivity; cannot restore the previous state without a full outage.
Footgun #1: No Rollback Plan — makes configuration changes without saving the current running config, leading to new config breaks connectivity; cannot restore the previous state without a full outage.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Testing in Production¶
Skips the lab environment and applies changes directly to production gear. Under time pressure, the team chose speed over caution. But the result is a misconfiguration causes a broadcast storm; the entire VLAN goes down.
Footgun #2: Testing in Production — skips the lab environment and applies changes directly to production gear, leading to a misconfiguration causes a broadcast storm; the entire VLAN goes down.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Ignoring Layer 1¶
Spends 2 hours debugging routing when the issue is a bad fiber patch cable. Nobody pushed back because the shortcut looked harmless in the moment. But the result is delays resolution while chasing a software problem that does not exist.
Footgun #3: Ignoring Layer 1 — spends 2 hours debugging routing when the issue is a bad fiber patch cable, leading to delays resolution while chasing a software problem that does not exist.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Overlapping Subnets¶
Assigns an IP range that overlaps with another VLAN without checking the IPAM. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is duplicate IP addresses cause intermittent connectivity for both networks.
Footgun #4: Overlapping Subnets — assigns an IP range that overlaps with another VLAN without checking the IPAM, leading to duplicate IP addresses cause intermittent connectivity for both networks.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | No Rollback Plan | New config breaks connectivity; cannot restore the previous state without a full outage | Primer: Always backup running configuration before making any changes |
| 2 | Testing in Production | A misconfiguration causes a broadcast storm; the entire VLAN goes down | Primer: Test all changes in a lab or staging environment first |
| 3 | Ignoring Layer 1 | Delays resolution while chasing a software problem that does not exist | Primer: Always check physical layer first: link lights, cable integrity, SFP seating |
| 4 | Overlapping Subnets | Duplicate IP addresses cause intermittent connectivity for both networks | Primer: Check IPAM and ARP tables before assigning any new IP ranges |
Damage Report¶
- Downtime: 1-4 hours of connectivity loss or degraded throughput
- Data loss: None directly, but dependent services may lose in-flight data
- Customer impact: Timeouts, connection failures, or complete network unreachability
- Engineering time to remediate: 8-16 engineer-hours including physical layer verification
- Reputation cost: Network team credibility damaged; possible SLA credits to internal customers
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on no rollback plan, they would have learned: Always backup running configuration before making any changes.
- Footgun #2: If the engineer had read the primer, section on testing in production, they would have learned: Test all changes in a lab or staging environment first.
- Footgun #3: If the engineer had read the primer, section on ignoring layer 1, they would have learned: Always check physical layer first: link lights, cable integrity, SFP seating.
- Footgun #4: If the engineer had read the primer, section on overlapping subnets, they would have learned: Check IPAM and ARP tables before assigning any new IP ranges.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice