Networking Troubleshooting Street Ops¶
The Systematic Debug Workflow¶
When someone says "I can't reach X," run this flowchart:
Step 1: Is it DNS?
dig <hostname>
dig @8.8.8.8 <hostname> # bypass local resolver
If no resolution -> DNS problem. Stop here and fix DNS.
If resolves to wrong IP -> DNS problem (stale record, wrong zone).
If resolves correctly -> continue.
Step 2: Is it routing?
ping <ip> # can you reach the IP at all?
traceroute <ip> # where do packets stop?
ip route get <ip> # what route would be used?
If ping fails -> check local routing, check firewall.
If traceroute dies at a hop -> routing/firewall issue at that hop.
If ping works -> continue.
Step 3: Is it the port/firewall?
nc -zv <ip> <port> # TCP connection test
ss -tlnp | grep <port> # is anything listening locally?
sudo iptables -L -n -v # check local firewall rules
If connection refused -> service not listening on that port.
If timeout -> firewall blocking.
If connected -> continue.
Step 4: Is it the application?
curl -v http://<ip>:<port>/ # test HTTP
openssl s_client -connect <ip>:443 # test TLS
If HTTP error -> application problem.
If TLS error -> certificate or TLS config problem.
DNS Debugging¶
Quick checks¶
dig example.com # default resolver
dig @8.8.8.8 example.com # Google DNS (bypass local)
dig +short example.com # just the answer
dig +trace example.com # full delegation chain
dig example.com MX # specific record type
dig -x 1.2.3.4 # reverse DNS
Common DNS failure modes¶
Resolution works from one place but not another:
- Check /etc/resolv.conf -- different nameservers configured.
- Check /etc/nsswitch.conf -- order of resolution (files, dns).
- Check /etc/hosts -- local override.
- If using systemd-resolved: resolvectl status.
Intermittent DNS failures: - DNS server overloaded or flapping. - MTU issues causing large DNS responses to be dropped (TCP fallback failing). - Firewall blocking DNS responses (check both UDP/53 and TCP/53).
DNS resolves but to wrong IP:
- Stale DNS cache: systemd-resolve --flush-caches or restart systemd-resolved.
- Wrong DNS zone or record.
- CDN/anycast returning different IPs based on location (expected behavior).
- /etc/hosts override taking precedence.
Check resolution order¶
getent hosts example.com # uses nsswitch.conf order (may check /etc/hosts first)
dig example.com # queries DNS directly (skips /etc/hosts)
# If getent and dig return different results, check /etc/hosts and nsswitch.conf
Reading tcpdump Output¶
Essential captures¶
# Capture traffic to/from a host on a specific port
tcpdump -i any host 10.0.0.5 and port 443 -nn
# Capture DNS traffic
tcpdump -i any port 53 -nn
# Capture with full packet content (for HTTP debugging)
tcpdump -i any host 10.0.0.5 and port 80 -nn -A -s0
# Write to file for Wireshark analysis
tcpdump -i any host 10.0.0.5 -nn -w /tmp/capture.pcap
# Capture SYN packets only (connection attempts)
tcpdump -i any 'tcp[tcpflags] & tcp-syn != 0' -nn
Reading the output¶
14:23:01.123456 IP 10.0.0.1.54321 > 10.0.0.2.443: Flags [S], seq 12345, win 65535
14:23:01.123789 IP 10.0.0.2.443 > 10.0.0.1.54321: Flags [S.], seq 67890, ack 12346, win 65535
14:23:01.124000 IP 10.0.0.1.54321 > 10.0.0.2.443: Flags [.], ack 67891, win 65535
Flag meanings:
- [S] = SYN (connection initiation)
- [S.] = SYN-ACK (connection accepted)
- [.] = ACK
- [P.] = PSH-ACK (data push)
- [F.] = FIN-ACK (connection close)
- [R.] = RST-ACK (connection reset -- abrupt close)
- [R] = RST (reset without ACK -- often means port not listening)
What the TCP handshake tells you:¶
SYN -> (client initiates)
<- SYN-ACK (server accepts)
ACK -> (connection established)
If you see SYN but no SYN-ACK:
- Packet not reaching server (routing/firewall)
- Server is dropping the SYN (firewall, full backlog)
If you see SYN and get RST:
- Nothing listening on that port
If you see SYN-ACK but no ACK from client:
- Client-side firewall blocking the response
- Asymmetric routing (SYN goes one path, SYN-ACK comes back another)
MTU Issues¶
Symptoms¶
- Small requests work, large requests fail or hang.
- SSH works, SCP/SFTP of large files hangs.
- VPN/tunnel connections have mysterious failures.
ping -M do -s 1472 <host>fails butping <host>works.
Diagnosis¶
# Test MTU by sending different-sized pings with Don't Fragment flag
ping -M do -s 1472 <host> # 1472 + 28 bytes header = 1500 (standard MTU)
ping -M do -s 1400 <host> # try smaller if 1472 fails
# Binary search to find the working MTU
# Check interface MTU
ip link show <interface> # look for mtu value
# Common MTU values
# 1500 = standard Ethernet
# 1460 = typical inside a GRE tunnel
# 1420 = typical inside a WireGuard tunnel
# 9000 = jumbo frames (datacenter)
Fix¶
# Temporary
ip link set <interface> mtu 1400
# Permanent (depends on distro)
# /etc/sysconfig/network-scripts/ifcfg-<iface>: MTU=1400
# /etc/netplan/*.yaml: mtu: 1400
# nmcli: nmcli connection modify <conn> 802-3-ethernet.mtu 1400
# Enable Path MTU Discovery (should be on by default)
sysctl net.ipv4.ip_no_pmtu_disc=0
Connection State Diagnosis with ss¶
# Show all TCP connections with state
ss -tan
# Show listening sockets with process names
ss -tlnp
# Show established connections
ss -tn state established
# Show connections to a specific port
ss -tn dport = :443
# Show connection counts by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Show socket memory usage
ss -tm
Connection states to watch for¶
Many TIME_WAIT: Normal for busy servers. Sockets waiting to close cleanly. Usually harmless unless exhausting port range. Tune with net.ipv4.tcp_tw_reuse=1.
Many CLOSE_WAIT: The remote end closed the connection but the local application hasn't called close(). This is an application bug. The application is leaking connections.
Many SYN_RECV: Connection backlog is filling up. Either a SYN flood attack or the server can't accept connections fast enough. Check net.core.somaxconn and net.ipv4.tcp_max_syn_backlog.
Many ESTABLISHED with no traffic: Idle connections being held open. Check keepalive settings.
Firewall Debugging¶
iptables (legacy but still common)¶
# List all rules with packet counts
sudo iptables -L -n -v
sudo iptables -L -n -v -t nat # NAT rules
# Watch for drops in real time
sudo iptables -L -n -v | grep DROP
# Or add a LOG rule before the DROP:
sudo iptables -I INPUT -j LOG --log-prefix "IPT-DROP: "
nftables (modern replacement)¶
firewalld (RHEL/CentOS)¶
sudo firewall-cmd --list-all
sudo firewall-cmd --list-ports
sudo firewall-cmd --add-port=8080/tcp # temporary
sudo firewall-cmd --add-port=8080/tcp --permanent && sudo firewall-cmd --reload
ufw (Ubuntu)¶
The "is it the firewall?" test¶
# Temporarily disable firewall to test
sudo iptables -F # flush all rules (DANGEROUS on remote servers)
# OR add a specific allow:
sudo iptables -I INPUT -p tcp --dport <port> -j ACCEPT
# Test. If it works -> firewall was blocking. Re-enable and add proper rule.
Never flush iptables on a remote server unless you have out-of-band access (console, IPMI). If default policy is DROP and you flush, you lose SSH.
Default trap:
iptables -Fflushes rules but does NOT reset the default chain policy. If someone setiptables -P INPUT DROPat any point, flushing removes the ACCEPT rules while leaving the DROP policy -- instant lockout. Always checkiptables -L | head -3for the default policy before flushing.
curl for Network Debugging¶
# Verbose output showing connection details
curl -v https://example.com
# Show timing breakdown
curl -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirstByte: %{time_starttransfer}s\nTotal: %{time_total}s\n" https://example.com
# Resolve to a specific IP (bypass DNS)
curl --resolve example.com:443:10.0.0.5 https://example.com
# Use specific source interface
curl --interface eth0 https://example.com
# Ignore TLS errors (debugging only)
curl -k https://self-signed.example.com
# Show response headers only
curl -I https://example.com
# Follow redirects
curl -L https://example.com
# Test with specific HTTP method
curl -X POST -d '{"key":"value"}' -H "Content-Type: application/json" https://api.example.com
Interpreting curl timing:¶
- time_namelookup high? DNS problem.
- time_connect - time_namelookup high? Network latency or routing issue.
- time_appconnect - time_connect high? TLS handshake slow.
- time_starttransfer - time_appconnect high? Server processing slow.
Common Failure Patterns¶
"Connection refused"¶
- Nothing listening on that port. Check
ss -tlnpon the target. - Service is listening on localhost (127.0.0.1) only, not the external IP.
- Service is listening on wrong port.
"Connection timed out"¶
- Firewall dropping packets (no RST sent back).
- Routing problem -- packets not reaching destination.
- Target host is down.
- Network path congestion.
"No route to host"¶
- Missing route in routing table. Check
ip route. - ICMP "host unreachable" received from a router.
- Interface down.
"Name or service not known"¶
- DNS resolution failure. Check
/etc/resolv.confand trydig. - Network partition preventing DNS queries.
Slow connections (not failing)¶
- MTU issues causing fragmentation or PMTUD failures.
- TCP retransmits from packet loss:
ss -tishows retransmit counts. - DNS lookup slow: check curl timing breakdown.
- TLS handshake slow: certificate chain too long, OCSP stapling not configured.
- TCP window scaling issues.
Network Performance Checks¶
# Check interface for errors
ip -s link show <iface>
ethtool -S <iface> | grep -i error
# Check TCP retransmits
cat /proc/net/snmp | grep -A1 Tcp # look at RetransSegs
ss -ti | grep retrans # per-connection
# Check for dropped packets
netstat -s | grep -i drop
ss -s # summary with drops
# Bandwidth test (if iperf3 available)
iperf3 -c <server> # TCP throughput
iperf3 -c <server> -u -b 100M # UDP throughput
# Path analysis
mtr -rw <host> # shows loss and latency per hop
Decision Tree: Is It DNS?¶
Can you reach the service by IP?
curl http://<ip>:<port>/
|
Yes -> It's DNS. Fix DNS.
No -> It's not (only) DNS. Continue with routing/firewall/service checks.
This single test eliminates or confirms DNS as the culprit in 5 seconds.
Heuristics¶
- Always test with IP first, then hostname. This immediately separates DNS problems from everything else.
- "Connection refused" means you reached the host. The kernel sent back RST because nothing is listening. This is good news -- the network path works.
- "Connection timed out" usually means firewall. The packet was silently dropped. Less commonly, the host is unreachable.
- Check both directions. A can reach B doesn't mean B can reach A. Asymmetric routing and firewall rules are directional.
- When in doubt, tcpdump. Packets don't lie. If you can see the SYN leaving and no SYN-ACK coming back, the problem is on the remote end or in between.
- MTU issues are sneaky. They cause intermittent failures that depend on packet size. If "small requests work but large transfers fail," think MTU.
- Check the obvious first.
ip addr-- is there an IP assigned?ip link-- is the interface up? Before pulling out tcpdump, make sure the basics are right. - CLOSE_WAIT is always an application bug. TIME_WAIT is normal. Don't confuse them.
Power One-Liners¶
All TCP connection states at a glance¶
Breakdown: ss -tan = all TCP sockets, numeric. Awk skips header, counts each state (ESTABLISHED, TIME_WAIT, CLOSE_WAIT, etc.). Reveals connection leaks, SYN floods, or half-open connections.
[!TIP] When to use: Diagnosing "too many open files", connection pool exhaustion, or SYN flood attacks.
Connections per IP (top talkers)¶
ss -tn | awk 'NR>1 {split($5,a,":"); ip[a[1]]++} END {for(i in ip) printf "%5d %s\n", ip[i], i}' | sort -rn | head -20
Breakdown: ss -tn for TCP numeric. Awk splits the peer address field on : to extract IP (dropping port). Counts per IP, sorts descending.
[!TIP] When to use: Identifying which clients are hammering your service during load issues.
List all listening ports with owning process¶
or the classic:
[!TIP] When to use: Security auditing — "what's listening and who owns it?"
Monitor HTTP connections live¶
watch -n1 "ss -tn state established '( dport = :80 or dport = :443 )' | awk 'NR>1 {split(\$5,a,\":\"); ip[a[1]]++} END {for(i in ip) printf \"%5d %s\n\", ip[i], i}' | sort -rn"
[!TIP] When to use: Real-time visibility during traffic spikes or suspected attacks.
Capture MySQL queries on the wire¶
Breakdown: tshark (Wireshark CLI) captures packets on port 3306, filters for MySQL query protocol, extracts just the query text. No MySQL access or slow-query log needed.
[!TIP] When to use: Debugging query patterns without application access, finding N+1 queries, verifying ORM behavior.
TCP proxy / traffic logger with a FIFO and netcat¶
mkfifo /tmp/backpipe
cat /tmp/backpipe | nc -l -p 8080 | tee -a request.log | nc target-host 80 | tee -a response.log > /tmp/backpipe
Breakdown: This builds a full-duplex relay in one line using a named pipe (FIFO) to close the loop. Inbound traffic hits nc -l (listen), gets logged to request.log via tee, forwarded to the real server via second nc. Responses flow back through tee into response.log and back to the client via the FIFO. It's a transparent man-in-the-middle proxy built from pipe fittings.
[!TIP] When to use: Debugging HTTP traffic between services without modifying either side, quick protocol inspection, intercepting traffic in dev environments.
Caveat: Single-connection only (nc exits after one connection). For persistent use, wrap in a while true loop or use socat instead.
Audible alert when host comes back online¶
Breakdown: -i 5 pings every 5 seconds. -a triggers audible bell on response. Blocks until host responds if you add -c 1 -W timeout in a loop.
[!TIP] When to use: Waiting for a server to come back after reboot/maintenance.