- networking
- l1
- topic-pack
- networking-troubleshooting
- networking-troubleshooting-tools
- linux-networking
- tcp-ip --- Portal | Level: L1: Foundations | Topics: Networking Troubleshooting, Networking Troubleshooting Tools, Linux Networking Tools, TCP/IP | Domain: Networking
Networking Troubleshooting - Primer¶
Why This Matters¶
When someone says "the network is down," they mean something is not connecting and they do not know why. Your job is to turn that into a specific diagnosis: which layer, which hop, which direction.
Network problems are the most common production incident category. DNS misconfigurations, firewall rules blocking traffic, MTU mismatches, exhausted connection pools -- these cause outages at every scale. A systematic bottom-up approach finds root cause faster than guessing and restarting things.
Core Concepts¶
Remember: The network troubleshooting mantra: "Start at the bottom, work up." Mnemonic: "LNDTA" — Link, Network, DNS, TCP, App. At each layer, ask one question: "Does this work?" If yes, move up. If no, you found your failing layer. This prevents the most common mistake: jumping straight to the application layer and wasting time there when the problem is a missing route or firewall rule.
1. Systematic Approach: Layer by Layer¶
Debug bottom-up. Each layer depends on those below.
L1 Link: Is the interface up?
L2 Network: IP assigned? Route exists?
L3 DNS: Name resolves correctly?
L4 TCP/UDP: Port reachable? Connection succeeds?
L5 App: Service responds correctly?
At each layer: "Does this work?" If yes, move up. If no, you found your problem.
Name origin: The "ping" command was written by Mike Muuss in December 1983, named after the sonar pulse that submarines send out — you send a signal and listen for the echo. Muuss wrote it in one evening to debug a network problem. It uses ICMP Echo Request/Reply (type 8/0), defined in RFC 792 (1981).
2. ping, traceroute, and mtr¶
# Basic connectivity
ping -c 4 10.0.1.50
# Path tracing
traceroute 10.0.1.50
traceroute -T -p 443 example.com # TCP mode
# Combined real-time path analysis
mtr -r -c 10 example.com
mtr -T -P 443 example.com # TCP mode
In mtr output, single-hop loss is often ICMP rate limiting (harmless). Loss persisting from one hop onward indicates a real problem at that hop.
3. DNS Debugging¶
War story: A production outage lasted 3 hours because the team kept investigating application code while the real problem was a stale DNS record. The TTL had expired, but a local resolver was caching aggressively. Running
dig @8.8.8.8from the start would have found it in 30 seconds. Rule: always test DNS from at least two different resolvers when something "suddenly stops working."
Most "network issues" are DNS. Always check early.
dig example.com # Basic lookup
dig @8.8.8.8 example.com # Specific server
dig +short example.com # Short answer
dig +trace example.com # Full resolution
dig -x 10.0.1.50 # Reverse lookup
dig example.com MX # Record type
| Symptom | Likely cause |
|---|---|
| NXDOMAIN | Name does not exist |
| SERVFAIL | DNS server error |
| Timeout | DNS server unreachable |
| Wrong IP | Stale record or wrong zone |
| Works by IP, fails by name | DNS broken |
Debug clue: When
digworks but your application still cannot resolve names, check/etc/nsswitch.conf. Thehosts:line controls resolution order. If it sayshosts: files dns, the system checks/etc/hostsfirst. A stale entry in/etc/hostsoverriding DNS is a surprisingly common cause of "DNS works from dig but not from the app."
4. TCP Connection Debugging¶
# Listening ports with process names
ss -tlnp
# Established connections
ss -tnp
# Connections to a specific port
ss -tnp dst :443
# Socket summary
ss -s
# Test port reachability
nc -zv -w 3 example.com 443
One-liner:
ss -tlnp— "show me what is listening, with process names." This replaces the oldnetstat -tlnpand is faster because it reads directly from kernel socket tables instead of parsing/proc.
Connection states to know:
| State | Meaning |
|---|---|
| ESTABLISHED | Working connection |
| TIME_WAIT | Closed, waiting to expire (normal) |
| CLOSE_WAIT | Remote closed, app has not (bug) |
| SYN_SENT | Connection attempt in progress |
Many CLOSE_WAIT = application bug (not closing
connections). Many TIME_WAIT = normal under load.
Under the hood: The
sscommand (socket statistics) replacednetstatin modern Linux.ssreads directly from kernel netlink sockets via/proc/net/tcpand/proc/net/tcp6, whilenetstatparsed/proc/net/*text files. On a server with 50,000+ connections,sscompletes in milliseconds wherenetstatcan take seconds. Thesstool is part of theiproute2package, maintained since 1999.
5. Packet Capture with tcpdump¶
tcpdump -i eth0 host 10.0.1.50 # By host
tcpdump -i eth0 port 443 # By port
tcpdump -i eth0 -w capture.pcap # Save file
tcpdump -r capture.pcap # Read file
tcpdump -A -i eth0 port 80 # ASCII dump
tcpdump -c 100 -i eth0 port 443 # Limit count
Interview tip: When asked "how would you debug a connection problem between two services?" the strongest answer follows layers: 1) ping (L3 reachability), 2) traceroute (path), 3) nc/telnet to port (L4 reachability), 4) curl -v (L7 application). Jumping straight to the app layer is the most common mistake — and the reason interviewers ask this question.
What to look for: - SYN without SYN-ACK: refused or filtered - RST packets: forcibly closed - Retransmissions: packet loss or congestion
6. Common Failure Patterns¶
Connection refused (ECONNREFUSED): Host reachable, nothing listening on that port.
Connection timeout: Packets dropped silently. Firewall or routing issue.
Connection reset (ECONNRESET): Remote forcibly closed. Could be firewall stateful inspection, load balancer timeout, or app crash.
DNS timeout:
7. MTU Issues¶
Analogy: MTU is like the maximum box size a conveyor belt will accept. If you try to send a box that is too large and stamped "do not fragment" (the DF bit), it gets rejected at the belt. The error message (ICMP "fragmentation needed") tells the sender to use smaller boxes — but if a firewall blocks ICMP, the sender never gets the message, and the transfer just hangs. This is the classic "PMTUD black hole" problem.
MTU problems: small requests work, large transfers fail or hang.
ip link show eth0 | grep mtu
# Test path MTU (1472 = 1500 - 28 byte header)
ping -c 4 -s 1472 -M do 10.0.1.50
Common in: VPN tunnels, container networks, cloud encapsulation, cross-network paths. Fix by lowering MTU or enabling PMTUD (requires ICMP not blocked).
Gotcha: A common TCP debugging trap: you test with
telnet host 443and it works, so you assume the connection is fine. Buttelnetonly tests TCP handshake, not TLS. If the real problem is a certificate mismatch, expired cert, or TLS version incompatibility,telnetwill show success whilecurl -v https://hostwill fail. Always test with the actual protocol.
8. Firewall Rule Debugging¶
# iptables
iptables -L -n --line-numbers
# Add debug logging
iptables -I INPUT 1 -p tcp --dport 8080 \
-j LOG --log-prefix "DEBUG-8080: "
journalctl -k | grep "DEBUG-8080"
# Watch counters
watch -n1 'iptables -L -n -v | grep 8080'
# firewalld (RHEL/CentOS)
firewall-cmd --list-all
firewall-cmd --add-port=8080/tcp --permanent
firewall-cmd --reload
Firewall checklist: 1. Is a firewall active? 2. Is there a DROP/REJECT matching this traffic? 3. Right chain? (INPUT vs FORWARD) 4. Rule order correct? (first match wins) 5. Default policy? (DROP at end?) 6. Cloud: check security groups/NACLs separately
What Experienced People Know¶
- Always ask "From where to where?" Network problems are directional.
- Check DNS first. Even when you are sure it is not DNS. It is usually DNS.
curl -vshows DNS, TCP connect, TLS handshake, and full request/response. Best HTTP debug tool.ss -tlnpis faster and better than netstat. Make it muscle memory.- tcpdump captures are proof. A pcap file ends disagreements about whether packets arrive.
- ICMP blocked does not mean host is down. Test with the actual protocol.
- Between containers/pods, run tcpdump inside the container, not on the host (different namespaces).
- Cloud security groups are stateful (return traffic auto-allowed). NACLs are stateless (need rules both directions).
- MTU issues look like "large transfers fail, small
ones work." Test with
ping -s 1472 -M doearly. openssl s_client -connect host:443debugs TLS issues: cert expiry, wrong CN, chain problems.
Wiki Navigation¶
Prerequisites¶
- Networking Deep Dive (Topic Pack, L1)
Related Content¶
- Case Study: Duplex Mismatch Symptoms (Case Study, L1) — Linux Networking Tools, TCP/IP
- Deep Dive: TCP/IP Deep Dive (deep_dive, L2) — Linux Networking Tools, TCP/IP
- Networking Deep Dive (Topic Pack, L1) — Linux Networking Tools, TCP/IP
- Networking Drills (Drill, L1) — Linux Networking Tools, TCP/IP
- AWS Networking (Topic Pack, L1) — TCP/IP
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — TCP/IP
- Case Study: API Latency Spike — BGP Route Leak, Fix Is Network ACL (Case Study, L2) — Linux Networking Tools
- Case Study: ARP Flux Duplicate IP (Case Study, L2) — Linux Networking Tools
- Case Study: DHCP Relay Broken (Case Study, L1) — Linux Networking Tools
- Case Study: IPTables Blocking Unexpected (Case Study, L2) — Linux Networking Tools
Pages that link here¶
- ARP Flux / Duplicate IP
- Anti-Primer: Networking Troubleshooting
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- DHCP Not Working on Remote VLAN
- Duplex Mismatch
- Incident Replay: DNS Split-Horizon Confusion
- Incident Replay: Duplex Mismatch Symptoms
- Incident Replay: Firewall Shadow Rule
- Incident Replay: Jumbo Frames Partial Deployment
- Incident Replay: LACP Mismatch — One Link Hot
- Incident Replay: MTU Blackhole — TLS Stalls
- Incident Replay: Multicast Not Crossing Router
- Incident Replay: NAT Exhaustion — Intermittent Connectivity
- Incident Replay: Network Loop — Broadcast Storm