Networking Troubleshooting Street Ops¶

The Systematic Debug Workflow¶

When someone says "I can't reach X," run this flowchart:

Step 1: Is it DNS?
  dig <hostname>
  dig @8.8.8.8 <hostname>   # bypass local resolver
  If no resolution -> DNS problem. Stop here and fix DNS.
  If resolves to wrong IP -> DNS problem (stale record, wrong zone).
  If resolves correctly -> continue.

Step 2: Is it routing?
  ping <ip>                  # can you reach the IP at all?
  traceroute <ip>            # where do packets stop?
  ip route get <ip>          # what route would be used?
  If ping fails -> check local routing, check firewall.
  If traceroute dies at a hop -> routing/firewall issue at that hop.
  If ping works -> continue.

Step 3: Is it the port/firewall?
  nc -zv <ip> <port>         # TCP connection test
  ss -tlnp | grep <port>     # is anything listening locally?
  sudo iptables -L -n -v     # check local firewall rules
  If connection refused -> service not listening on that port.
  If timeout -> firewall blocking.
  If connected -> continue.

Step 4: Is it the application?
  curl -v http://<ip>:<port>/   # test HTTP
  openssl s_client -connect <ip>:443   # test TLS
  If HTTP error -> application problem.
  If TLS error -> certificate or TLS config problem.

DNS Debugging¶

Quick checks¶

dig example.com                  # default resolver
dig @8.8.8.8 example.com        # Google DNS (bypass local)
dig +short example.com           # just the answer
dig +trace example.com           # full delegation chain
dig example.com MX               # specific record type
dig -x 1.2.3.4                  # reverse DNS

Common DNS failure modes¶

Resolution works from one place but not another: - Check /etc/resolv.conf -- different nameservers configured. - Check /etc/nsswitch.conf -- order of resolution (files, dns). - Check /etc/hosts -- local override. - If using systemd-resolved: resolvectl status.

Intermittent DNS failures: - DNS server overloaded or flapping. - MTU issues causing large DNS responses to be dropped (TCP fallback failing). - Firewall blocking DNS responses (check both UDP/53 and TCP/53).

DNS resolves but to wrong IP: - Stale DNS cache: systemd-resolve --flush-caches or restart systemd-resolved. - Wrong DNS zone or record. - CDN/anycast returning different IPs based on location (expected behavior). - /etc/hosts override taking precedence.

Check resolution order¶

getent hosts example.com    # uses nsswitch.conf order (may check /etc/hosts first)
dig example.com             # queries DNS directly (skips /etc/hosts)
# If getent and dig return different results, check /etc/hosts and nsswitch.conf

Reading tcpdump Output¶

Essential captures¶

# Capture traffic to/from a host on a specific port
tcpdump -i any host 10.0.0.5 and port 443 -nn

# Capture DNS traffic
tcpdump -i any port 53 -nn

# Capture with full packet content (for HTTP debugging)
tcpdump -i any host 10.0.0.5 and port 80 -nn -A -s0

# Write to file for Wireshark analysis
tcpdump -i any host 10.0.0.5 -nn -w /tmp/capture.pcap

# Capture SYN packets only (connection attempts)
tcpdump -i any 'tcp[tcpflags] & tcp-syn != 0' -nn

Reading the output¶

14:23:01.123456 IP 10.0.0.1.54321 > 10.0.0.2.443: Flags [S], seq 12345, win 65535
14:23:01.123789 IP 10.0.0.2.443 > 10.0.0.1.54321: Flags [S.], seq 67890, ack 12346, win 65535
14:23:01.124000 IP 10.0.0.1.54321 > 10.0.0.2.443: Flags [.], ack 67891, win 65535

Flag meanings: - [S] = SYN (connection initiation) - [S.] = SYN-ACK (connection accepted) - [.] = ACK - [P.] = PSH-ACK (data push) - [F.] = FIN-ACK (connection close) - [R.] = RST-ACK (connection reset -- abrupt close) - [R] = RST (reset without ACK -- often means port not listening)

What the TCP handshake tells you:¶

SYN ->          (client initiates)
     <- SYN-ACK (server accepts)
ACK ->          (connection established)

If you see SYN but no SYN-ACK:
  - Packet not reaching server (routing/firewall)
  - Server is dropping the SYN (firewall, full backlog)

If you see SYN and get RST:
  - Nothing listening on that port

If you see SYN-ACK but no ACK from client:
  - Client-side firewall blocking the response
  - Asymmetric routing (SYN goes one path, SYN-ACK comes back another)

MTU Issues¶

Symptoms¶

Small requests work, large requests fail or hang.
SSH works, SCP/SFTP of large files hangs.
VPN/tunnel connections have mysterious failures.
ping -M do -s 1472 <host> fails but ping <host> works.

Diagnosis¶

# Test MTU by sending different-sized pings with Don't Fragment flag
ping -M do -s 1472 <host>    # 1472 + 28 bytes header = 1500 (standard MTU)
ping -M do -s 1400 <host>    # try smaller if 1472 fails
# Binary search to find the working MTU

# Check interface MTU
ip link show <interface>      # look for mtu value

# Common MTU values
# 1500 = standard Ethernet
# 1460 = typical inside a GRE tunnel
# 1420 = typical inside a WireGuard tunnel
# 9000 = jumbo frames (datacenter)

Fix¶

# Temporary
ip link set <interface> mtu 1400

# Permanent (depends on distro)
# /etc/sysconfig/network-scripts/ifcfg-<iface>: MTU=1400
# /etc/netplan/*.yaml: mtu: 1400
# nmcli: nmcli connection modify <conn> 802-3-ethernet.mtu 1400

# Enable Path MTU Discovery (should be on by default)
sysctl net.ipv4.ip_no_pmtu_disc=0

Connection State Diagnosis with ss¶

# Show all TCP connections with state
ss -tan

# Show listening sockets with process names
ss -tlnp

# Show established connections
ss -tn state established

# Show connections to a specific port
ss -tn dport = :443

# Show connection counts by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Show socket memory usage
ss -tm

Connection states to watch for¶

Many TIME_WAIT: Normal for busy servers. Sockets waiting to close cleanly. Usually harmless unless exhausting port range. Tune with net.ipv4.tcp_tw_reuse=1.

Many CLOSE_WAIT: The remote end closed the connection but the local application hasn't called close(). This is an application bug. The application is leaking connections.

Many SYN_RECV: Connection backlog is filling up. Either a SYN flood attack or the server can't accept connections fast enough. Check net.core.somaxconn and net.ipv4.tcp_max_syn_backlog.

Many ESTABLISHED with no traffic: Idle connections being held open. Check keepalive settings.

Firewall Debugging¶

iptables (legacy but still common)¶

# List all rules with packet counts
sudo iptables -L -n -v
sudo iptables -L -n -v -t nat    # NAT rules

# Watch for drops in real time
sudo iptables -L -n -v | grep DROP
# Or add a LOG rule before the DROP:
sudo iptables -I INPUT -j LOG --log-prefix "IPT-DROP: "

nftables (modern replacement)¶

sudo nft list ruleset

firewalld (RHEL/CentOS)¶

sudo firewall-cmd --list-all
sudo firewall-cmd --list-ports
sudo firewall-cmd --add-port=8080/tcp    # temporary
sudo firewall-cmd --add-port=8080/tcp --permanent && sudo firewall-cmd --reload

ufw (Ubuntu)¶

sudo ufw status verbose
sudo ufw allow 8080/tcp

The "is it the firewall?" test¶

# Temporarily disable firewall to test
sudo iptables -F         # flush all rules (DANGEROUS on remote servers)
# OR add a specific allow:
sudo iptables -I INPUT -p tcp --dport <port> -j ACCEPT
# Test. If it works -> firewall was blocking. Re-enable and add proper rule.

Never flush iptables on a remote server unless you have out-of-band access (console, IPMI). If default policy is DROP and you flush, you lose SSH.

Default trap: iptables -F flushes rules but does NOT reset the default chain policy. If someone set iptables -P INPUT DROP at any point, flushing removes the ACCEPT rules while leaving the DROP policy -- instant lockout. Always check iptables -L | head -3 for the default policy before flushing.

curl for Network Debugging¶

# Verbose output showing connection details
curl -v https://example.com

# Show timing breakdown
curl -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirstByte: %{time_starttransfer}s\nTotal: %{time_total}s\n" https://example.com

# Resolve to a specific IP (bypass DNS)
curl --resolve example.com:443:10.0.0.5 https://example.com

# Use specific source interface
curl --interface eth0 https://example.com

# Ignore TLS errors (debugging only)
curl -k https://self-signed.example.com

# Show response headers only
curl -I https://example.com

# Follow redirects
curl -L https://example.com

# Test with specific HTTP method
curl -X POST -d '{"key":"value"}' -H "Content-Type: application/json" https://api.example.com

Interpreting curl timing:¶

time_namelookup high? DNS problem.
time_connect - time_namelookup high? Network latency or routing issue.
time_appconnect - time_connect high? TLS handshake slow.
time_starttransfer - time_appconnect high? Server processing slow.

Common Failure Patterns¶

"Connection refused"¶

Nothing listening on that port. Check ss -tlnp on the target.
Service is listening on localhost (127.0.0.1) only, not the external IP.
Service is listening on wrong port.

"Connection timed out"¶

Firewall dropping packets (no RST sent back).
Routing problem -- packets not reaching destination.
Target host is down.
Network path congestion.

"No route to host"¶

Missing route in routing table. Check ip route.
ICMP "host unreachable" received from a router.
Interface down.

"Name or service not known"¶

DNS resolution failure. Check /etc/resolv.conf and try dig.
Network partition preventing DNS queries.

Slow connections (not failing)¶

MTU issues causing fragmentation or PMTUD failures.
TCP retransmits from packet loss: ss -ti shows retransmit counts.
DNS lookup slow: check curl timing breakdown.
TLS handshake slow: certificate chain too long, OCSP stapling not configured.
TCP window scaling issues.

Network Performance Checks¶

# Check interface for errors
ip -s link show <iface>
ethtool -S <iface> | grep -i error

# Check TCP retransmits
cat /proc/net/snmp | grep -A1 Tcp   # look at RetransSegs
ss -ti | grep retrans                # per-connection

# Check for dropped packets
netstat -s | grep -i drop
ss -s                                 # summary with drops

# Bandwidth test (if iperf3 available)
iperf3 -c <server>                   # TCP throughput
iperf3 -c <server> -u -b 100M       # UDP throughput

# Path analysis
mtr -rw <host>                       # shows loss and latency per hop

Decision Tree: Is It DNS?¶

Can you reach the service by IP?
  curl http://<ip>:<port>/
  |
  Yes -> It's DNS. Fix DNS.
  No  -> It's not (only) DNS. Continue with routing/firewall/service checks.

This single test eliminates or confirms DNS as the culprit in 5 seconds.

Heuristics¶

Always test with IP first, then hostname. This immediately separates DNS problems from everything else.
"Connection refused" means you reached the host. The kernel sent back RST because nothing is listening. This is good news -- the network path works.
"Connection timed out" usually means firewall. The packet was silently dropped. Less commonly, the host is unreachable.
Check both directions. A can reach B doesn't mean B can reach A. Asymmetric routing and firewall rules are directional.
When in doubt, tcpdump. Packets don't lie. If you can see the SYN leaving and no SYN-ACK coming back, the problem is on the remote end or in between.
MTU issues are sneaky. They cause intermittent failures that depend on packet size. If "small requests work but large transfers fail," think MTU.
Check the obvious first. ip addr -- is there an IP assigned? ip link -- is the interface up? Before pulling out tcpdump, make sure the basics are right.
CLOSE_WAIT is always an application bug. TIME_WAIT is normal. Don't confuse them.

Power One-Liners¶

All TCP connection states at a glance¶

ss -tan | awk 'NR>1 {state[$1]++} END {for(s in state) printf "%-15s %d\n", s, state[s]}'

Breakdown: ss -tan = all TCP sockets, numeric. Awk skips header, counts each state (ESTABLISHED, TIME_WAIT, CLOSE_WAIT, etc.). Reveals connection leaks, SYN floods, or half-open connections.

[!TIP] When to use: Diagnosing "too many open files", connection pool exhaustion, or SYN flood attacks.

Connections per IP (top talkers)¶

ss -tn | awk 'NR>1 {split($5,a,":"); ip[a[1]]++} END {for(i in ip) printf "%5d %s\n", ip[i], i}' | sort -rn | head -20

Breakdown: ss -tn for TCP numeric. Awk splits the peer address field on : to extract IP (dropping port). Counts per IP, sorts descending.

[!TIP] When to use: Identifying which clients are hammering your service during load issues.

List all listening ports with owning process¶

ss -tlnp | awk 'NR>1 {printf "%-6s %-25s %s\n", $1, $4, $7}'

or the classic:

lsof -Pan -i tcp -i udp

[!TIP] When to use: Security auditing — "what's listening and who owns it?"

Monitor HTTP connections live¶

watch -n1 "ss -tn state established '( dport = :80 or dport = :443 )' | awk 'NR>1 {split(\$5,a,\":\"); ip[a[1]]++} END {for(i in ip) printf \"%5d %s\n\", ip[i], i}' | sort -rn"

[!TIP] When to use: Real-time visibility during traffic spikes or suspected attacks.

Capture MySQL queries on the wire¶

tshark -i eth0 -f 'tcp port 3306' -Y 'mysql.query' -T fields -e mysql.query

Breakdown: tshark (Wireshark CLI) captures packets on port 3306, filters for MySQL query protocol, extracts just the query text. No MySQL access or slow-query log needed.

[!TIP] When to use: Debugging query patterns without application access, finding N+1 queries, verifying ORM behavior.

TCP proxy / traffic logger with a FIFO and netcat¶

mkfifo /tmp/backpipe
cat /tmp/backpipe | nc -l -p 8080 | tee -a request.log | nc target-host 80 | tee -a response.log > /tmp/backpipe

Breakdown: This builds a full-duplex relay in one line using a named pipe (FIFO) to close the loop. Inbound traffic hits nc -l (listen), gets logged to request.log via tee, forwarded to the real server via second nc. Responses flow back through tee into response.log and back to the client via the FIFO. It's a transparent man-in-the-middle proxy built from pipe fittings.

[!TIP] When to use: Debugging HTTP traffic between services without modifying either side, quick protocol inspection, intercepting traffic in dev environments.

Caveat: Single-connection only (nc exits after one connection). For persistent use, wrap in a while true loop or use socat instead.

Audible alert when host comes back online¶

ping -i 5 -a host && echo "HOST IS UP"

Breakdown: -i 5 pings every 5 seconds. -a triggers audible bell on response. Blocks until host responds if you add -c 1 -W timeout in a loop.

[!TIP] When to use: Waiting for a server to come back after reboot/maintenance.