Thinking Out Loud: Networking Troubleshooting¶
A senior SRE's internal monologue while working through a real networking issue. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
An internal microservice (order-processor) can reach the database and Redis, but HTTP requests to another internal service (inventory-api) are timing out. Both services are on the same VLAN. The issue started after a "routine" network change by the networking team last night.
The Monologue¶
One service can't reach another, but can reach other things. This isn't a total network outage — it's specific to this path. Let me start from the basics and work up the stack.
Ping works. So we have basic IP connectivity (ICMP). That tells me: DNS resolves, routing works, ARP works, the host is reachable at L3. The problem is likely at L4 (TCP) or L7 (HTTP).
Connection timed out. So TCP to port 8080 is failing even though ICMP works. That's the classic "ping works but TCP doesn't" scenario. The usual suspects: firewall rule blocking the port, the service isn't listening on that port, or there's a stateful firewall issue.
Let me verify the service is listening.
LISTEN 0 128 0.0.0.0:8080. The service IS listening on 8080. So the packet is being dropped somewhere between the two hosts. This is a firewall or ACL issue.
Mental Model: The Layer Isolation Technique¶
When diagnosing network connectivity issues, test each layer independently: DNS -> ICMP (L3) -> TCP SYN (L4) -> Application (L7). If ICMP works but TCP doesn't, the problem is a stateful firewall or ACL at L4. If TCP works but the application times out, the problem is L7 (application, proxy, or middleware). Each layer that passes eliminates a class of root causes.
Let me check for firewalls on both ends.
# On the source (order-processor)
sudo iptables -L OUTPUT -n --line-numbers | grep -i drop
sudo iptables -L FORWARD -n --line-numbers | grep -i drop
# On the destination (inventory-api)
ssh inventory-api.internal 'sudo iptables -L INPUT -n --line-numbers | grep -i drop'
No explicit DROP rules on either host's iptables. But the networking team made a change last night. They might have changed something in the middle — a network firewall, an ACL on the switch, or a security group.
Let me trace the path.
Using TCP traceroute to port 8080 specifically. The trace shows hops 1-3 responding, then hop 4 shows * * * — packets are being dropped at hop 4. Let me identify that hop.
ICMP traceroute shows hop 4 as 10.1.0.1 — that's the core router/firewall. So the firewall at the core is dropping TCP port 8080 between these two VLANs.
Mental Model: Compare Working and Broken Paths¶
When one path fails but others succeed, compare them. If order-processor can reach the database (port 5432) but not inventory-api (port 8080), and both are on the same VLAN, the differentiator is the port or the destination. Test the same port to a different host, and a different port to the same host, to isolate whether it's a port-based or host-based rule.
Let me test a different port to the same host, and the same port to a different host.
SSH (port 22) to inventory-api works. Port 8080 to database fails (but the database doesn't listen on 8080, so that's expected). Let me try port 8080 to another service that listens on it.
Also times out. So port 8080 is blocked between these VLANs regardless of the destination. The networking team's "routine change" likely added or modified an ACL that blocks port 8080.
Let me check if HTTPS (443) works.
Connection succeeded. So port 443 is allowed but port 8080 is blocked. The networking team probably tightened the inter-VLAN ACL to only allow "standard" ports (80, 443, 22) and blocked everything else. Internal microservices using non-standard ports got caught in the crossfire.
Before I call the networking team, let me confirm by checking from a machine on the same VLAN as inventory-api.
Works locally. So the service is fine — it's purely a network path issue between VLANs.
Let me also verify there isn't a workaround. Can order-processor reach inventory-api through the internal load balancer on port 443 instead?
Works through the load balancer on 443. The LB terminates TLS and forwards to port 8080 on the service's VLAN (same VLAN, so no ACL crossing). I can use this as a temporary workaround while the networking team fixes the ACL.
Let me update the order-processor config to use the LB endpoint temporarily.
kubectl set env deployment/order-processor -n orders INVENTORY_API_URL=https://inventory-api.lb.internal
Now let me verify the service is working.
Requests to inventory-api are succeeding through the LB. Good. Users are unblocked.
Now I need to file the ticket with the networking team. The key information they need: port 8080 is blocked between VLAN 10 (services) and VLAN 20 (services), started after last night's change, affects all inter-VLAN traffic on non-standard ports. They need to add port 8080 (and probably a range of service ports like 8000-9000) to the ACL.
One more thing — I'm going to add a synthetic check that tests inter-VLAN connectivity on our service ports. If the networking team makes another change that breaks us, I want to know within minutes, not when users complain.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| See "timeout" and debug the application | Test layer by layer (ICMP, TCP, HTTP) to isolate the failure to L4 | Layer isolation eliminates whole categories of causes instantly |
| Not think to compare working vs broken paths | Test the same port to different hosts AND different ports to the same host | This comparison reveals whether the block is port-based, host-based, or path-based |
| Wait for the networking team to fix the ACL before unblocking users | Find and implement a workaround (route through the LB on an allowed port) | Users shouldn't wait for a cross-team fix when a workaround exists |
| Just report the issue and move on | Add synthetic monitoring for inter-VLAN connectivity | Prevent the next "routine network change" from causing the same surprise |
Key Heuristics Used¶
- Layer Isolation: Test DNS, then ICMP (L3), then TCP SYN (L4), then application (L7). Each passing layer eliminates a class of root causes.
- Differential Testing: When one path fails, test variations (same port/different host, different port/same host) to isolate the variable causing the failure.
- Workaround Before Root Fix: Find a way to restore service immediately, then pursue the proper fix through the responsible team.
Cross-References¶
- Primer — TCP/IP stack, how firewalls and ACLs work, and VLAN fundamentals
- Street Ops — The networking troubleshooting toolkit (ping, nc, traceroute, tcpdump)
- Footguns — "Routine" network changes breaking non-standard ports and the "ping works so the network is fine" fallacy