Skip to content

Thinking Out Loud: Networking Troubleshooting

A senior SRE's internal monologue while working through a real networking issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

An internal microservice (order-processor) can reach the database and Redis, but HTTP requests to another internal service (inventory-api) are timing out. Both services are on the same VLAN. The issue started after a "routine" network change by the networking team last night.

The Monologue

One service can't reach another, but can reach other things. This isn't a total network outage — it's specific to this path. Let me start from the basics and work up the stack.

ping -c 3 inventory-api.internal

Ping works. So we have basic IP connectivity (ICMP). That tells me: DNS resolves, routing works, ARP works, the host is reachable at L3. The problem is likely at L4 (TCP) or L7 (HTTP).

nc -zv inventory-api.internal 8080 -w 5

Connection timed out. So TCP to port 8080 is failing even though ICMP works. That's the classic "ping works but TCP doesn't" scenario. The usual suspects: firewall rule blocking the port, the service isn't listening on that port, or there's a stateful firewall issue.

Let me verify the service is listening.

ssh inventory-api.internal 'ss -tlnp | grep 8080'

LISTEN 0 128 0.0.0.0:8080. The service IS listening on 8080. So the packet is being dropped somewhere between the two hosts. This is a firewall or ACL issue.

Mental Model: The Layer Isolation Technique

When diagnosing network connectivity issues, test each layer independently: DNS -> ICMP (L3) -> TCP SYN (L4) -> Application (L7). If ICMP works but TCP doesn't, the problem is a stateful firewall or ACL at L4. If TCP works but the application times out, the problem is L7 (application, proxy, or middleware). Each layer that passes eliminates a class of root causes.

Let me check for firewalls on both ends.

# On the source (order-processor)
sudo iptables -L OUTPUT -n --line-numbers | grep -i drop
sudo iptables -L FORWARD -n --line-numbers | grep -i drop

# On the destination (inventory-api)
ssh inventory-api.internal 'sudo iptables -L INPUT -n --line-numbers | grep -i drop'

No explicit DROP rules on either host's iptables. But the networking team made a change last night. They might have changed something in the middle — a network firewall, an ACL on the switch, or a security group.

Let me trace the path.

traceroute -T -p 8080 inventory-api.internal

Using TCP traceroute to port 8080 specifically. The trace shows hops 1-3 responding, then hop 4 shows * * * — packets are being dropped at hop 4. Let me identify that hop.

traceroute inventory-api.internal

ICMP traceroute shows hop 4 as 10.1.0.1 — that's the core router/firewall. So the firewall at the core is dropping TCP port 8080 between these two VLANs.

Mental Model: Compare Working and Broken Paths

When one path fails but others succeed, compare them. If order-processor can reach the database (port 5432) but not inventory-api (port 8080), and both are on the same VLAN, the differentiator is the port or the destination. Test the same port to a different host, and a different port to the same host, to isolate whether it's a port-based or host-based rule.

Let me test a different port to the same host, and the same port to a different host.

nc -zv inventory-api.internal 22 -w 5
nc -zv database.internal 8080 -w 5

SSH (port 22) to inventory-api works. Port 8080 to database fails (but the database doesn't listen on 8080, so that's expected). Let me try port 8080 to another service that listens on it.

nc -zv user-service.internal 8080 -w 5

Also times out. So port 8080 is blocked between these VLANs regardless of the destination. The networking team's "routine change" likely added or modified an ACL that blocks port 8080.

Let me check if HTTPS (443) works.

nc -zv inventory-api.internal 443 -w 5

Connection succeeded. So port 443 is allowed but port 8080 is blocked. The networking team probably tightened the inter-VLAN ACL to only allow "standard" ports (80, 443, 22) and blocked everything else. Internal microservices using non-standard ports got caught in the crossfire.

Before I call the networking team, let me confirm by checking from a machine on the same VLAN as inventory-api.

ssh inventory-api.internal 'curl -v --connect-timeout 5 http://localhost:8080/health'

Works locally. So the service is fine — it's purely a network path issue between VLANs.

Let me also verify there isn't a workaround. Can order-processor reach inventory-api through the internal load balancer on port 443 instead?

curl -v --connect-timeout 5 https://inventory-api.lb.internal/health

Works through the load balancer on 443. The LB terminates TLS and forwards to port 8080 on the service's VLAN (same VLAN, so no ACL crossing). I can use this as a temporary workaround while the networking team fixes the ACL.

Let me update the order-processor config to use the LB endpoint temporarily.

kubectl set env deployment/order-processor -n orders INVENTORY_API_URL=https://inventory-api.lb.internal

Now let me verify the service is working.

kubectl logs deployment/order-processor -n orders --tail=10

Requests to inventory-api are succeeding through the LB. Good. Users are unblocked.

Now I need to file the ticket with the networking team. The key information they need: port 8080 is blocked between VLAN 10 (services) and VLAN 20 (services), started after last night's change, affects all inter-VLAN traffic on non-standard ports. They need to add port 8080 (and probably a range of service ports like 8000-9000) to the ACL.

One more thing — I'm going to add a synthetic check that tests inter-VLAN connectivity on our service ports. If the networking team makes another change that breaks us, I want to know within minutes, not when users complain.

What Made This Senior-Level

Junior Would... Senior Does... Why
See "timeout" and debug the application Test layer by layer (ICMP, TCP, HTTP) to isolate the failure to L4 Layer isolation eliminates whole categories of causes instantly
Not think to compare working vs broken paths Test the same port to different hosts AND different ports to the same host This comparison reveals whether the block is port-based, host-based, or path-based
Wait for the networking team to fix the ACL before unblocking users Find and implement a workaround (route through the LB on an allowed port) Users shouldn't wait for a cross-team fix when a workaround exists
Just report the issue and move on Add synthetic monitoring for inter-VLAN connectivity Prevent the next "routine network change" from causing the same surprise

Key Heuristics Used

  1. Layer Isolation: Test DNS, then ICMP (L3), then TCP SYN (L4), then application (L7). Each passing layer eliminates a class of root causes.
  2. Differential Testing: When one path fails, test variations (same port/different host, different port/same host) to isolate the variable causing the failure.
  3. Workaround Before Root Fix: Find a way to restore service immediately, then pursue the proper fix through the responsible team.

Cross-References

  • Primer — TCP/IP stack, how firewalls and ACLs work, and VLAN fundamentals
  • Street Ops — The networking troubleshooting toolkit (ping, nc, traceroute, tcpdump)
  • Footguns — "Routine" network changes breaking non-standard ports and the "ping works so the network is fine" fallacy