Skip to content

Incident Replay: Firewall Shadow Rule

Setup

  • System context: Corporate firewall with 500+ rules. A new allow rule was added for a partner integration but traffic is still being blocked.
  • Time: Thursday 10:00 UTC
  • Your role: Network security engineer

Round 1: Alert Fires

[Pressure cue: "Partner integration team says their traffic is blocked despite the new firewall rule added yesterday. 'We confirmed the rule is there.'"]

What you see: The new rule (line 487) allows TCP/443 from partner IP range. But packet counters show 0 hits on this rule. Traffic logs show the traffic is being dropped.

Choose your action: - A) Move the new rule to the top of the ruleset - B) Check if an earlier rule matches and drops the partner traffic first - C) Check if the partner IP is correct - D) Disable the firewall temporarily to test

If you chose A:

[Result: Moving to the top works but violates the rule ordering policy. Other rules depend on evaluation order.]

[Result: Rule 142 is a broad deny: 'deny tcp any any range 443-445'. This was added months ago to block SMB-over-HTTPS. It matches the partner traffic before rule 487 gets evaluated. The new rule is shadowed. Proceed to Round 2.]

If you chose C:

[Result: Partner IP is correct. The rule is correct — it is just never reached.]

If you chose D:

[Result: Never disable a production firewall for testing. This would expose the entire network.]

Round 2: First Triage Data

[Pressure cue: "Problem scoped. Apply the fix."]

What you see: Root cause from Round 1 narrows the investigation. You need to apply the correct fix and verify.

Choose your action: - A) Apply the quick targeted fix - B) Apply the comprehensive fix with verification - C) Apply a workaround while planning the proper fix - D) Escalate to a specialist team

If you chose A:

[Result: Quick fix resolves the immediate issue but may not be durable. Proceed cautiously.]

[Result: Comprehensive fix applied with verification steps. Issue resolved. Proceed to Round 3.]

If you chose C:

[Result: Workaround buys time but the root cause remains. Acceptable short-term.]

If you chose D:

[Result: Specialist is unavailable or adds delay. Try the fix yourself first.]

Round 3: Root Cause Identification

[Pressure cue: "Fix applied. Document root cause and prevention."]

What you see: Root cause is confirmed. Process or configuration gap that allowed this to happen is identified.

Choose your action: - A) Fix the specific instance only - B) Fix the instance and add monitoring - C) Fix the instance, add monitoring, and update procedures - D) Comprehensive: fix + monitor + procedure + automation

[Result: All layers addressed. Immediate fix, detection, process, and automation. Proceed to Round 4.]

If you chose A:

[Result: Fixes this case but the same mistake can recur.]

If you chose B:

[Result: Better detection next time but does not prevent recurrence.]

If you chose C:

[Result: Good coverage but automation reduces human error further.]

Round 4: Remediation

[Pressure cue: "Service restored. Verify and close."]

Actions: 1. Verify the service is functioning correctly 2. Verify monitoring detects the fix 3. Update runbooks and procedures 4. Schedule follow-up actions (automation, infrastructure changes) 5. Close the incident with a post-mortem

Damage Report

  • Total downtime: Varies based on path taken
  • Blast radius: Affected service and dependent systems
  • Optimal resolution time: 15 minutes
  • If every wrong choice was made: 120 minutes + additional damage

Cross-References