Networking - Street Ops¶

What experienced network troubleshooters know.

The Troubleshooting Playbook¶

Step 1: Isolate the Layer¶

Most people jump to "restart the service." Experienced engineers isolate the layer first:

# L1: Physical link
ethtool eth0 | grep -E "Speed|Link detected"
# No link? Bad cable, bad port, bad NIC, or port shut on switch

# L2: ARP / MAC
ip neigh show
arping -c 3 -I eth0 192.168.1.1
# No ARP reply? Wrong VLAN, STP blocking, or MAC filtering

# L3: IP routing
ip route get 10.20.30.40
ping -c 3 -W 2 10.20.30.40
traceroute -n 10.20.30.40
# No route? Check default gateway. Asymmetric path? Check return route.

# L4: Port reachability
ss -tlnp | grep :8080
curl -v --connect-timeout 5 http://10.20.30.40:8080
# Connection refused? Service isn't listening. Timeout? Firewall.

# DNS
dig @127.0.0.1 myservice.example.com
dig @8.8.8.8 myservice.example.com
# Different answers? Local resolver is stale or misconfigured.

Remember: "Connection refused" (TCP RST) and "Connection timed out" mean completely different things. Refused = the host is reachable but nothing is listening. Timeout = packets are being silently dropped (firewall, wrong route, host down). This single distinction saves 30 minutes of misdirected troubleshooting.

Step 2: Common Patterns¶

"It works from one server but not another" - Different VLAN? Check ip addr - are they on the same subnet? - Firewall rule applies to one source IP but not the other - Asymmetric routing: traffic goes out one path, comes back another, stateful firewall drops it

"It worked yesterday" - What changed? Check switch logs, firewall rule changes, DHCP lease expiry - DNS TTL expired and record changed - Certificate expired (TLS)

"It's slow" - MTU mismatch: ping -M do -s 1472 destination (if fails, MTU is smaller than 1500) - Duplex mismatch: ethtool eth0 - should be Full, not Half - Packet loss: mtr -n destination - look for loss at a specific hop - DNS slow: dig +stats example.com - check query time

"Intermittent connectivity" - NIC flapping: dmesg -T | grep -i "link" - up/down cycles - STP reconvergence: happens when topology changes (cable pulled, switch rebooted) - ARP table overflow: unlikely but happens in flat L2 networks with thousands of hosts - Duplicate IP: arping -D -I eth0 <IP> - if you get a reply, someone else has your IP

Gotchas That Burn People¶

MTU Black Holes¶

TCP works fine (small SYN/ACK packets get through) but large data transfers fail silently. The path has a lower MTU than expected, DF (Don't Fragment) bit is set, and ICMP "fragmentation needed" is being blocked by a firewall.

Fix: ping -M do -s 1472 <dest> and decrease until it works. That's your path MTU minus 28 bytes of header.

Debug clue: MTU blackholes are especially common with VPNs, VXLAN overlays, and GRE tunnels — anything that adds encapsulation overhead. If "SSH works but SCP stalls" or "small API calls succeed but file uploads fail," suspect MTU immediately.

DNS Caching Layers¶

There are typically 3-4 layers of DNS caching between your app and the authoritative server: 1. Application cache (JVM, browser) 2. OS resolver cache (systemd-resolved, nscd) 3. Local DNS server (dnsmasq, CoreDNS, corporate resolver) 4. Upstream resolver (ISP, 8.8.8.8)

When you change a DNS record and "it's not working," the record is cached at one of these layers. Check each one. dig @<server> lets you query each layer specifically.

Remember: DNS TTL mnemonic: "TTL = Time To Lie." The record may have changed at the source, but every cache in the chain will serve the old answer until its TTL expires. When planning a DNS migration, lower the TTL to 60s at least 48 hours before the cutover, then raise it back after.

Private RFC1918 Overlap¶

10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 are private ranges. When you VPN into a corporate network and your home network uses the same subnet as a remote resource, traffic routes locally instead of through the VPN. This is why corporate networks shouldn't use 192.168.1.0/24.

Half-Open TCP Connections¶

A client thinks a connection is established but the server has no record of it (server crashed, firewall dropped the connection silently). The client keeps sending on what it thinks is an open connection, gets no response, eventually times out.

Symptom: ss shows ESTABLISHED connections with no data flowing. Fix: application-level keepalives (TCP keepalive alone isn't enough - default is 2 hours).

Default trap: Linux TCP keepalive defaults: tcp_keepalive_time=7200 (2 hours before first probe), tcp_keepalive_intvl=75 (75s between probes), tcp_keepalive_probes=9. A dead connection takes 2+ hours to detect. For cloud load balancers with idle timeouts (AWS ALB = 60s), your connections get silently dropped long before keepalive fires.

Under the hood: TCP keepalive vs application-level keepalive: TCP keepalive (SO_KEEPALIVE) is an OS-level mechanism that sends empty ACK packets. But cloud load balancers (AWS ALB idle timeout = 60s, Azure LB = 4 min, GCP = 10 min) drop idle connections independently of TCP keepalive. Your app must send real data or HTTP/2 PING frames more frequently than the LB idle timeout, or connections die silently.

tcpdump Recipes¶

# See all traffic on an interface
tcpdump -i eth0 -nn

# DNS queries
tcpdump -i eth0 -nn port 53

# HTTP traffic to a specific host
tcpdump -i eth0 -nn host 10.20.30.40 and port 80

# SYN packets only (new connections)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0'

# ICMP errors (reveals firewalls, PMTUD issues, unreachable hosts)
tcpdump -i eth0 -nn icmp

# ARP traffic
tcpdump -i eth0 -nn arp

# Write to file for Wireshark analysis
tcpdump -i eth0 -w /tmp/capture.pcap -c 1000

# Read back from file
tcpdump -r /tmp/capture.pcap -nn

One-liner: Quick packet capture that shows only TCP handshake and reset packets (new connections and failures): tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-syn|tcp-rst) != 0' — this filters out the data noise and shows you exactly which connections are being established or rejected.

Linux Network Configuration Commands¶

# Show interfaces and IPs
ip addr show
ip -4 addr show           # IPv4 only
ip link show               # Link layer info

# Show routing table
ip route show
ip route get 10.20.30.40   # Which route would be used?

# Show ARP/neighbor table
ip neigh show

# Add a static route
ip route add 10.20.0.0/16 via 192.168.1.1

# Show listening ports
ss -tlnp                   # TCP listening, numeric, process
ss -ulnp                   # UDP listening
ss -s                      # Summary statistics

# DNS resolution
dig example.com            # Full query
dig +short example.com     # Just the answer
dig @8.8.8.8 example.com   # Query specific server
dig -x 1.2.3.4             # Reverse lookup

# Continuous traceroute
mtr -n 10.20.30.40

# NetworkManager
nmcli device status        # All interfaces
nmcli connection show      # All connections
nmcli connection modify eth0 ipv4.addresses 192.168.1.100/24

Gotcha: ip addr add and ip route add changes are not persistent — they vanish on reboot. For permanent changes, use nmcli connection modify (RHEL/Fedora), netplan (Ubuntu), or write to /etc/network/interfaces (Debian). This catches people who fix a routing issue at 3 AM and then it breaks again at the next reboot.

Quick Reference¶

Cheatsheet: Networking
Deep Dive: Tcp Ip Deep Dive