Incident Replay: VLAN Trunk Mistag¶
Setup¶
- System context: New VLAN (VLAN 500) added for IoT devices. Devices on the new VLAN cannot communicate with the gateway. Other VLANs on the same switch work fine.
- Time: Thursday 11:00 UTC
- Your role: Network engineer
Round 1: Alert Fires¶
[Pressure cue: "IoT team reports none of their 30 devices on VLAN 500 can get DHCP leases or reach the gateway. Other VLANs are working."]
What you see:
VLAN 500 exists on the access switch but show interface trunk on the uplink reveals VLAN 500 is not in the allowed VLAN list on the trunk. The trunk was configured with a specific allowed VLAN list that does not include the new VLAN 500.
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. The trunk port has
switchport trunk allowed vlan 1-499. When VLAN 500 was created on the access switch, nobody updated the trunk allowed list. VLAN 500 frames are stripped at the trunk port. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see:
The trunk port has switchport trunk allowed vlan 1-499. When VLAN 500 was created on the access switch, nobody updated the trunk allowed list. VLAN 500 frames are stripped at the trunk port.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result:
switchport trunk allowed vlan add 500on both ends of the trunk. Verify withshow vlan briefandshow interface trunk. Test DHCP and gateway connectivity from VLAN 500. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see:
Root cause confirmed: The trunk port has switchport trunk allowed vlan 1-499. When VLAN 500 was created on the access switch, nobody updated the trunk allowed list. VLAN 500 frames are stripped at the trunk port.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 8 minutes
- If every wrong choice was made: 45 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking