Incident Replay: DHCP Relay Broken¶
Setup¶
- System context: Corporate network with centralized DHCP server. Remote office VLAN clients cannot obtain IP addresses after a router upgrade. DHCP relay agent on the router forwards requests to the central DHCP server.
- Time: Monday 08:00 UTC
- Your role: Network engineer
Round 1: Alert Fires¶
[Pressure cue: "Remote office users reporting 'No network connection.' All 50 workstations in VLAN 200 cannot get DHCP leases. Office is dead. Helpdesk overwhelmed."]
What you see: Workstations show "APIPA address" (169.254.x.x). The DHCP server log shows no DISCOVER packets from VLAN 200. Other VLANs are working fine.
Choose your action: - A) Restart the DHCP server - B) Check the DHCP relay configuration on the router for VLAN 200 - C) Assign static IPs to workstations as a workaround - D) Check if VLAN 200 is configured correctly on the switch
If you chose B (recommended):¶
[Result:
show running-config interface vlan200reveals theip helper-addresscommand is missing. The router upgrade replaced the config and the DHCP relay for VLAN 200 was not included in the new config. Other VLANs still have their helper-address because they were in a different config section that was preserved. Proceed to Round 2.]
If you chose A:¶
[Result: DHCP server is healthy — it serves other VLANs fine. Not a server issue.]
If you chose C:¶
[Result: Static IPs for 50 workstations would take hours and creates management overhead.]
If you chose D:¶
[Result: VLAN 200 is correctly configured on the switches. The L2 path is fine — the issue is L3 relay.]
Round 2: First Triage Data¶
[Pressure cue: "DHCP relay missing on VLAN 200. Fix it."]
What you see:
The router upgrade 3 days ago merged the new config with the old one. The ip helper-address 10.0.0.10 line was dropped from the VLAN 200 SVI during the merge. Existing leases expired over the weekend, and users noticed Monday morning.
Choose your action: - A) Add the ip helper-address back to the VLAN 200 SVI - B) Add the helper-address and also verify all other VLANs still have theirs - C) Roll back the entire router config to the pre-upgrade version - D) Set up a local DHCP server at the remote office instead
If you chose B (recommended):¶
[Result:
interface vlan200->ip helper-address 10.0.0.10. Applied. Also audit all VLANs — VLAN 200 was the only one affected. Workstations begin obtaining leases within 30 seconds. Proceed to Round 3.]
If you chose A:¶
[Result: Fixes VLAN 200 but you should verify other VLANs too.]
If you chose C:¶
[Result: Rolling back loses the upgrade benefits (security patches, bug fixes). Overkill for a one-line config fix.]
If you chose D:¶
[Result: Local DHCP is complex — split-scope, conflict detection, management overhead. Not justified.]
Round 3: Root Cause Identification¶
[Pressure cue: "Office is back online. Document the failure."]
What you see:
Root cause: Router config merge during the upgrade dropped the ip helper-address from VLAN 200's SVI. The pre-upgrade config backup was taken but the post-upgrade config was not diff'd against the backup to verify completeness.
Choose your action: - A) Add a post-upgrade config diff check to the upgrade procedure - B) Use a config management tool (RANCID/Oxidized) to track and alert on config changes - C) Add DHCP lease monitoring to detect relay failures faster - D) All of the above
If you chose D (recommended):¶
[Result: Config diff catches missing lines. Config management tracks all changes. DHCP monitoring detects lease failures within minutes. Proceed to Round 4.]
If you chose A:¶
[Result: Manual diff is good but automated comparison is better.]
If you chose B:¶
[Result: Config management catches changes but might not flag a missing line as anomalous.]
If you chose C:¶
[Result: Faster detection but does not prevent the configuration error.]
Round 4: Remediation¶
[Pressure cue: "Office working. Close the incident."]
Actions: 1. Verify all 50 workstations have DHCP leases: check DHCP server lease table 2. Verify DHCP relay is configured on all required VLANs 3. Take a post-fix config backup 4. Add config diff check to the router upgrade runbook 5. Deploy DHCP lease monitoring for all remote offices
Damage Report¶
- Total downtime: ~3 hours Monday morning (users arrived to broken networking)
- Blast radius: 50 workstations at one remote office; entire office unproductive
- Optimal resolution time: 5 minutes (check relay config -> add helper-address -> verify)
- If every wrong choice was made: 4+ hours with server restarts and static IP assignments
Cross-References¶
- Primer: DHCP & IPAM
- Primer: Networking
- Primer: VLANs
- Footguns: Networking