Skip to content

Networking Troubleshooting - Primer

Why This Matters

When someone says "the network is down," they mean something is not connecting and they do not know why. Your job is to turn that into a specific diagnosis: which layer, which hop, which direction.

Network problems are the most common production incident category. DNS misconfigurations, firewall rules blocking traffic, MTU mismatches, exhausted connection pools -- these cause outages at every scale. A systematic bottom-up approach finds root cause faster than guessing and restarting things.

Core Concepts

Remember: The network troubleshooting mantra: "Start at the bottom, work up." Mnemonic: "LNDTA" — Link, Network, DNS, TCP, App. At each layer, ask one question: "Does this work?" If yes, move up. If no, you found your failing layer. This prevents the most common mistake: jumping straight to the application layer and wasting time there when the problem is a missing route or firewall rule.

1. Systematic Approach: Layer by Layer

Debug bottom-up. Each layer depends on those below.

L1 Link:    Is the interface up?
L2 Network: IP assigned? Route exists?
L3 DNS:     Name resolves correctly?
L4 TCP/UDP: Port reachable? Connection succeeds?
L5 App:     Service responds correctly?

At each layer: "Does this work?" If yes, move up. If no, you found your problem.

Name origin: The "ping" command was written by Mike Muuss in December 1983, named after the sonar pulse that submarines send out — you send a signal and listen for the echo. Muuss wrote it in one evening to debug a network problem. It uses ICMP Echo Request/Reply (type 8/0), defined in RFC 792 (1981).

2. ping, traceroute, and mtr

# Basic connectivity
ping -c 4 10.0.1.50

# Path tracing
traceroute 10.0.1.50
traceroute -T -p 443 example.com  # TCP mode

# Combined real-time path analysis
mtr -r -c 10 example.com
mtr -T -P 443 example.com         # TCP mode

In mtr output, single-hop loss is often ICMP rate limiting (harmless). Loss persisting from one hop onward indicates a real problem at that hop.

3. DNS Debugging

War story: A production outage lasted 3 hours because the team kept investigating application code while the real problem was a stale DNS record. The TTL had expired, but a local resolver was caching aggressively. Running dig @8.8.8.8 from the start would have found it in 30 seconds. Rule: always test DNS from at least two different resolvers when something "suddenly stops working."

Most "network issues" are DNS. Always check early.

dig example.com                # Basic lookup
dig @8.8.8.8 example.com      # Specific server
dig +short example.com         # Short answer
dig +trace example.com         # Full resolution
dig -x 10.0.1.50              # Reverse lookup
dig example.com MX             # Record type
Symptom Likely cause
NXDOMAIN Name does not exist
SERVFAIL DNS server error
Timeout DNS server unreachable
Wrong IP Stale record or wrong zone
Works by IP, fails by name DNS broken
cat /etc/resolv.conf           # Check DNS config
resolvectl status              # systemd-resolved

Debug clue: When dig works but your application still cannot resolve names, check /etc/nsswitch.conf. The hosts: line controls resolution order. If it says hosts: files dns, the system checks /etc/hosts first. A stale entry in /etc/hosts overriding DNS is a surprisingly common cause of "DNS works from dig but not from the app."

4. TCP Connection Debugging

# Listening ports with process names
ss -tlnp

# Established connections
ss -tnp

# Connections to a specific port
ss -tnp dst :443

# Socket summary
ss -s

# Test port reachability
nc -zv -w 3 example.com 443

One-liner: ss -tlnp — "show me what is listening, with process names." This replaces the old netstat -tlnp and is faster because it reads directly from kernel socket tables instead of parsing /proc.

Connection states to know:

State Meaning
ESTABLISHED Working connection
TIME_WAIT Closed, waiting to expire (normal)
CLOSE_WAIT Remote closed, app has not (bug)
SYN_SENT Connection attempt in progress

Many CLOSE_WAIT = application bug (not closing connections). Many TIME_WAIT = normal under load.

Under the hood: The ss command (socket statistics) replaced netstat in modern Linux. ss reads directly from kernel netlink sockets via /proc/net/tcp and /proc/net/tcp6, while netstat parsed /proc/net/* text files. On a server with 50,000+ connections, ss completes in milliseconds where netstat can take seconds. The ss tool is part of the iproute2 package, maintained since 1999.

5. Packet Capture with tcpdump

tcpdump -i eth0 host 10.0.1.50      # By host
tcpdump -i eth0 port 443            # By port
tcpdump -i eth0 -w capture.pcap     # Save file
tcpdump -r capture.pcap             # Read file
tcpdump -A -i eth0 port 80          # ASCII dump
tcpdump -c 100 -i eth0 port 443     # Limit count

Interview tip: When asked "how would you debug a connection problem between two services?" the strongest answer follows layers: 1) ping (L3 reachability), 2) traceroute (path), 3) nc/telnet to port (L4 reachability), 4) curl -v (L7 application). Jumping straight to the app layer is the most common mistake — and the reason interviewers ask this question.

What to look for: - SYN without SYN-ACK: refused or filtered - RST packets: forcibly closed - Retransmissions: packet loss or congestion

6. Common Failure Patterns

Connection refused (ECONNREFUSED): Host reachable, nothing listening on that port.

ss -tlnp | grep :8080   # Is anything listening?

Connection timeout: Packets dropped silently. Firewall or routing issue.

iptables -L -n | grep 8080
ip route get 10.0.1.50
traceroute -T -p 8080 10.0.1.50

Connection reset (ECONNRESET): Remote forcibly closed. Could be firewall stateful inspection, load balancer timeout, or app crash.

DNS timeout:

dig @8.8.8.8 example.com    # External works?
# If yes, internal DNS is the problem

7. MTU Issues

Analogy: MTU is like the maximum box size a conveyor belt will accept. If you try to send a box that is too large and stamped "do not fragment" (the DF bit), it gets rejected at the belt. The error message (ICMP "fragmentation needed") tells the sender to use smaller boxes — but if a firewall blocks ICMP, the sender never gets the message, and the transfer just hangs. This is the classic "PMTUD black hole" problem.

MTU problems: small requests work, large transfers fail or hang.

ip link show eth0 | grep mtu
# Test path MTU (1472 = 1500 - 28 byte header)
ping -c 4 -s 1472 -M do 10.0.1.50

Common in: VPN tunnels, container networks, cloud encapsulation, cross-network paths. Fix by lowering MTU or enabling PMTUD (requires ICMP not blocked).

Gotcha: A common TCP debugging trap: you test with telnet host 443 and it works, so you assume the connection is fine. But telnet only tests TCP handshake, not TLS. If the real problem is a certificate mismatch, expired cert, or TLS version incompatibility, telnet will show success while curl -v https://host will fail. Always test with the actual protocol.

8. Firewall Rule Debugging

# iptables
iptables -L -n --line-numbers
# Add debug logging
iptables -I INPUT 1 -p tcp --dport 8080 \
  -j LOG --log-prefix "DEBUG-8080: "
journalctl -k | grep "DEBUG-8080"
# Watch counters
watch -n1 'iptables -L -n -v | grep 8080'

# firewalld (RHEL/CentOS)
firewall-cmd --list-all
firewall-cmd --add-port=8080/tcp --permanent
firewall-cmd --reload

Firewall checklist: 1. Is a firewall active? 2. Is there a DROP/REJECT matching this traffic? 3. Right chain? (INPUT vs FORWARD) 4. Rule order correct? (first match wins) 5. Default policy? (DROP at end?) 6. Cloud: check security groups/NACLs separately

What Experienced People Know

  • Always ask "From where to where?" Network problems are directional.
  • Check DNS first. Even when you are sure it is not DNS. It is usually DNS.
  • curl -v shows DNS, TCP connect, TLS handshake, and full request/response. Best HTTP debug tool.
  • ss -tlnp is faster and better than netstat. Make it muscle memory.
  • tcpdump captures are proof. A pcap file ends disagreements about whether packets arrive.
  • ICMP blocked does not mean host is down. Test with the actual protocol.
  • Between containers/pods, run tcpdump inside the container, not on the host (different namespaces).
  • Cloud security groups are stateful (return traffic auto-allowed). NACLs are stateless (need rules both directions).
  • MTU issues look like "large transfers fail, small ones work." Test with ping -s 1472 -M do early.
  • openssl s_client -connect host:443 debugs TLS issues: cert expiry, wrong CN, chain problems.

Wiki Navigation

Prerequisites