Portal | Level: L2: Operations | Topics: Linux Networking Tools, Packet Path | Domain: Linux
Linux Network Packet Flow¶
Scope¶
This document explains what happens to a packet on Linux:
- from NIC receive to userspace socket
- from userspace send to NIC transmit
- with routing, conntrack, netfilter, bridge, NAT, and local delivery in the picture
- in both host and container-heavy environments
This is the mental model you need for:
- debugging dropped packets
- tracing weird latency
- understanding Docker / Kubernetes networking
- making sense of
iptables,nftables,tc,ip rule, andip route - answering interview questions without hand-waving
Big picture¶
A packet is not "handled by Linux" in one giant lump. It moves through a series of layers and hook points.
Receive path¶
wire
-> NIC
-> DMA into RAM
-> interrupt or NAPI poll
-> driver builds sk_buff
-> ingress path
-> optional XDP / tc ingress
-> netfilter PREROUTING
-> routing decision
-> local delivery
-> forwarding
-> bridge path
-> protocol handler (TCP/UDP/ICMP/...)
-> socket receive queue
-> userspace read/recv
Transmit path¶
userspace send/write
-> socket layer
-> TCP/UDP/IP stack
-> routing decision
-> netfilter OUTPUT / POSTROUTING
-> qdisc / tc egress
-> driver queue
-> NIC DMA
-> wire
The packet is usually represented inside the kernel as an sk_buff (skb). If you understand that one object is being classified, routed, rewritten, queued, and finally transmitted or delivered, the whole stack becomes much less mystical.
The main building blocks¶
NIC and driver¶
The network card receives frames from the wire. The driver:
- coordinates DMA so packet data lands in memory
- exposes RX and TX rings
- acknowledges interrupts
- participates in NAPI polling
- hands packets to the kernel networking stack
Important consequences:
- packet loss can happen before the IP stack even sees the packet
- RX ring starvation, IRQ affinity, or driver bugs can look like "network problems"
- high packet rates are often an interrupt / queue / CPU placement problem, not just a bandwidth problem
sk_buff¶
Linux uses the sk_buff structure as the canonical packet wrapper. It tracks:
- pointers to packet data
- protocol headers
- device information
- metadata such as marks, priority, checksum state, timestamps, conntrack association, and routing information
The payload may be linear or fragmented. That matters for offloads and for packet mangling.
NAPI¶
At high packet rates Linux avoids taking one interrupt per packet. Drivers typically use NAPI:
- hardware signals receive activity
- interrupt schedules polling
- kernel polls a bounded amount of RX work
- if traffic subsides, interrupts are re-enabled
Why you care:
- it improves throughput and reduces interrupt storms
- it can increase latency if CPUs are pinned or overloaded badly
- tuning IRQ affinity and queue placement matters on multi-core hosts
Routing subsystem¶
The kernel decides whether a packet is:
- for the local host
- to be forwarded elsewhere
- to be bridged at L2
- to be dropped / blackholed / rejected
- subject to policy routing
This is where ip route, ip rule, routing tables, marks, and VRFs matter.
Netfilter / nftables / iptables¶
Netfilter provides hook points in the stack. Rulesets implemented via nftables or iptables can:
- filter
- NAT
- mark
- log
- redirect
- classify
Classic hook names:
PREROUTINGINPUTFORWARDOUTPUTPOSTROUTING
These are not random names; they describe where in the path the packet currently is.
Conntrack¶
Connection tracking tracks flows and flow state such as:
- NEW
- ESTABLISHED
- RELATED
- INVALID
It is central to:
- stateful firewalling
- many NAT use cases
- service load balancing patterns
Conntrack is also a common production bottleneck when tables overflow or timeouts are wrong.
Receive path in detail¶
1. Frame arrives at the NIC¶
An Ethernet frame hits the card. The NIC:
- verifies enough of the frame to accept it
- may perform checksum validation or segmentation offload support
- places data into host memory using DMA
- updates RX descriptor rings
At this stage, the CPU may not yet have touched the packet body.
Failure modes here¶
- bad cable / switch / duplex / physical errors
- RX drops in hardware
- small ring sizes
- driver bugs
- IRQ pinned to a saturated CPU
- packet rate too high for polling budget
Tools¶
ip -s linkethtool -S eth0ethtool -k eth0ethtool -l eth0cat /proc/interrupts
2. Driver and NAPI handoff¶
The driver notices RX work, usually by interrupt, and schedules NAPI polling. It creates or fills an skb and hands it to the receive path.
This is the point where Linux meaningfully "owns" the packet.
Important detail: packets may already be coalesced, checksummed, or partially offloaded depending on NIC features.
Why offload awareness matters¶
If you sniff on the host and see strange checksum behavior, it may be because:
- checksum offload is happening later than you think
- segmentation offload means the kernel sees a large logical packet that is later split by NIC hardware
- packet captures can mislead you if you do not account for offloads
3. Optional early packet processing: XDP and tc ingress¶
Before the packet proceeds further, optional fast-path mechanisms may act on it.
XDP¶
XDP runs very early, often in the driver path, before full skb allocation in some cases. It is used for:
- ultra-fast drop
- filtering
- load balancing
- DDoS mitigation
- redirection
Actions are typically things like pass, drop, transmit back out, or redirect.
tc ingress¶
Traffic control ingress hooks can classify or shape packets. It is slower than XDP but more integrated with traditional Linux networking behavior.
Why you care¶
If packets vanish before normal firewall rules, check whether:
- an XDP program is attached
- a
tcfilter exists - a CNI plugin or security product inserted ingress logic
4. Netfilter PREROUTING¶
This is the first major L3/L4 policy point for normal IP processing.
Typical uses:
- DNAT
- marking
- filtering decisions before local-vs-forward routing choice
- transparent proxying tricks
Conceptually:
packet just entered host
-> should we rewrite destination?
-> should we mark it?
-> should we drop it?
If DNAT occurs here, later routing happens based on the translated destination.
5. Routing decision¶
Linux now asks: where should this packet go?
Possible outcomes:
- local delivery to the host
- forwarding to another interface
- bridge forwarding if in bridge path
- drop due to no route / policy / rp_filter / explicit firewall action
The route lookup considers:
- destination prefix
- policy routing rules
- fwmark
- incoming interface
- VRF / network namespace context
- source constraints for some cases
Local delivery¶
If the packet is for a local address, it heads toward protocol handlers and eventually a socket.
Forwarding¶
If the host is acting as a router and forwarding is enabled, it may go through FORWARD and then out another interface.
Common confusion¶
A host can be both:
- an endpoint for some addresses
- a router for other traffic
- a bridge for L2 forwarding
- a NAT box
These paths overlap but are not identical.
6. Local delivery path¶
If the destination is local:
Netfilter INPUT¶
This is where you commonly allow or deny inbound traffic destined for the local host.
Examples:
- allow SSH to the server
- allow Prometheus scrape traffic
- drop random inbound junk
Protocol demux¶
The IP layer passes to the next protocol:
- TCP
- UDP
- ICMP
- SCTP
- others
The transport layer then tries to match a socket by:
- destination IP
- destination port
- source tuple where relevant
- namespace
- socket options like
SO_REUSEPORT
Socket receive queue¶
If a matching socket exists, data goes to the receive queue. Userspace then pulls it with:
recvreadrecvmsgacceptfor connection setup path
If no listener exists¶
For TCP:
- kernel usually replies with RST
For UDP:
- packet is dropped; an ICMP unreachable may be generated depending on conditions
7. Forwarding path¶
If the packet is not for the local host and forwarding is enabled:
FORWARD¶
This is the filtering point for routed transit traffic.
POSTROUTING¶
Typical uses:
- SNAT / MASQUERADE
- marks
- final packet policy before egress
Why forwarding breaks in real life¶
Common reasons:
net.ipv4.ip_forward=0- firewall allows local input but not forwarding
- reverse path filtering
- missing route back
- broken conntrack state
- MTU issues
- asymmetric routing
8. Bridge path¶
If Linux is acting as a bridge, frame processing can stay at layer 2.
Bridge logic decides which port should receive a frame based on MAC learning and forwarding tables.
Important bridge features:
- MAC learning
- STP / RSTP behavior depending on setup
- VLAN filtering
- optional bridge netfilter interaction
Container networking often uses Linux bridges, so a lot of "container networking" is really just "Linux bridge plus veth plus NAT plus policy rules."
Transmit path in detail¶
1. Userspace writes to a socket¶
An application performs:
sendsendmsgwriteconnect+writefor TCP stream traffic
For TCP, the kernel manages:
- connection state
- retransmission
- congestion control
- segmentation
- ACK processing
For UDP, the path is simpler: datagram in, datagram out.
2. Socket layer and transport processing¶
The kernel turns application data into transport and network packets.
For TCP this includes:
- sequence numbers
- congestion window
- retransmission timers
- segmentation
- checksums
- state transitions
For UDP this includes:
- datagram framing
- checksums
- optional fragmentation downstream if MTU requires it
3. Output routing decision¶
The kernel picks:
- egress interface
- next hop
- source address
- route attributes
- policy-based overrides if present
This is where:
- wrong source address selection
- weird
ip rulematches - VRF confusion
- missing routes
turn into black holes.
4. Netfilter OUTPUT¶
This affects packets generated locally by the host.
Typical use cases:
- host egress filtering
- local-service redirection
- service mesh / proxy tricks
- packet marking for policy routing
Do not confuse INPUT and OUTPUT:
INPUT= packet to local hostOUTPUT= packet created by local host
5. Netfilter POSTROUTING¶
This is where final egress NAT often happens.
Common examples:
- container subnet egress MASQUERADE
- host acting as NAT gateway
- policy marks before wire
6. Traffic control / qdisc / egress¶
Before the driver transmits, Linux can queue and shape traffic.
Key ideas:
- every interface has a qdisc
- qdiscs manage queueing, fairness, delay, shaping, and sometimes drops
fq,fq_codel,htb, andmqcommonly appear
This is where you shape bandwidth, enforce fairness, or accidentally create latency.
When qdisc matters¶
- VoIP / latency-sensitive apps
- multi-tenant hosts
- egress congestion
- bufferbloat mitigation
- CNI bandwidth policies
7. Driver TX queue and NIC transmit¶
The driver maps packet buffers for DMA, places descriptors into TX rings, and the NIC eventually transmits on the wire.
At very high rates, bottlenecks can be:
- qdisc lock contention
- TX queue imbalance
- CPU softirq saturation
- offload mismatch
- NIC queue count too small
- NUMA placement problems
Routing, policy routing, and marks¶
Standard routing picks the longest-prefix match from the main routing table. Policy routing adds extra decision layers via ip rule.
Examples of selectors:
- source address
- fwmark
- incoming interface
- TOS / DSCP
- UID in some scenarios
This is heavily used in:
- multi-homed systems
- VPN split routing
- CNI plugins
- transparent proxies
- traffic engineering
Fwmark¶
A packet mark is just metadata attached in the kernel. Rules can later say:
- if mark 0x1, consult table 100
- if mark 0x2, send to different gateway
Marks are extremely useful and extremely easy to lose track of.
Conntrack and NAT¶
Conntrack records flow state. NAT uses that state so response traffic can be rewritten consistently.
DNAT¶
Changes destination, usually early:
- incoming packet to public IP:443
- rewrite to internal IP:8443
SNAT / MASQUERADE¶
Changes source, usually late:
- internal packet from 10.0.0.5
- rewrite source to public IP of egress interface
Why conntrack breaks things¶
Problems include:
- table exhaustion
- stale state
- asymmetric routing
- timeouts too long or too short
- NAT mapping collisions in pathological cases
Useful commands¶
conntrack -Lnft list rulesetiptables-savesysctl net.netfilter.nf_conntrack_max
Network namespaces and containers¶
Containers do not invent a new networking stack. They reuse Linux primitives.
Common pattern:
Inside the container, it looks like it has its own:
- interfaces
- routes
- sockets
- firewall namespace context
But those are namespace-isolated kernel views, not separate kernels.
Important consequences¶
- a container packet often crosses namespaces, veth, bridge, netfilter, and NAT before leaving the box
- "it works on the host but not the pod" often means namespace, policy, or bridge path differences
- packet path length in container setups is much more complex than on a plain host
Practical debugging workflow¶
Step 1: establish where the packet dies¶
Ask:
- does it reach the NIC?
- does the host receive it?
- does it hit local socket or forwarding path?
- does it leave the host?
- does reply traffic come back?
Tools¶
tcpdump -i anytcpdump -i eth0ip -s linkss -tulpnip route get <dst>nft list ruleset
Step 2: separate layers¶
Layer 1/2 questions¶
- interface up?
- carrier?
- VLAN correct?
- bridge forwarding?
- MAC learning?
Layer 3 questions¶
- route exists?
- source IP sane?
- rp_filter?
- policy routing?
Layer 4 questions¶
- listener present?
- conntrack state?
- firewall port allowed?
- MTU / MSS mismatch?
Step 3: check offloads before trusting captures¶
Sometimes the capture looks insane because:
- checksum offload makes outbound packets look bad before NIC fixes them
- GRO/LRO changes packet shapes
- TSO/GSO changes segmentation behavior
Do not build a fantasy diagnosis from a capture taken without offload awareness.
Common production failure patterns¶
1. "Port is open on the server but unreachable remotely"¶
Could be:
- service bound to loopback
INPUTrules dropping- cloud SG / ACL outside host
- routing asymmetry
- wrong interface
- rp_filter
2. "Containers can reach out but nothing can reach them"¶
Usually one of:
- bridge NAT only configured for egress
- no port publish / DNAT
- no route to overlay / container subnet
- policy rules inserted by container runtime
3. "Intermittent drops under load"¶
Possibilities:
- RX ring overflow
- conntrack table full
- CPU softirq saturation
- qdisc drops
- NIC queue imbalance
- MTU / fragmentation trouble
4. "Packets visible in tcpdump but application never sees them"¶
Often:
- wrong socket binding
- namespace mismatch
- firewall drop later in path
- packet reaches host but not that socket
- protocol mismatch
- userspace backlog or accept queue issue
Key mental model¶
A Linux packet path is a conveyor belt with decision stations. When debugging, never ask "is the network broken?" — ask "which exact station did the packet fail to pass?"
References¶
- Linux kernel networking docs
- Linux networking and device APIs
- Linux bridge documentation
- cgroup v2
- man 7 cgroups
- Docker networking docs
- Docker bridge driver docs
- Kubernetes services and networking
- Kubernetes virtual IPs and service proxies
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Case Study: API Latency Spike — BGP Route Leak, Fix Is Network ACL (Case Study, L2) — Linux Networking Tools
- Case Study: ARP Flux Duplicate IP (Case Study, L2) — Linux Networking Tools
- Case Study: DHCP Relay Broken (Case Study, L1) — Linux Networking Tools
- Case Study: Duplex Mismatch Symptoms (Case Study, L1) — Linux Networking Tools
- Case Study: IPTables Blocking Unexpected (Case Study, L2) — Linux Networking Tools
- Case Study: Jumbo Frames Partial (Case Study, L2) — Linux Networking Tools
- Case Study: Service Mesh 503s — Envoy Misconfigured, RBAC Policy (Case Study, L2) — Linux Networking Tools
- Case Study: Source Routing Policy Miss (Case Study, L2) — Linux Networking Tools
- Case Study: Stuck NFS Mount (Case Study, L2) — Linux Networking Tools
- Deep Dive: AWS VPC Internals (deep_dive, L2) — Linux Networking Tools
Pages that link here¶
- ARP Flux / Duplicate IP
- AWS VPC Internals
- DHCP Not Working on Remote VLAN
- Duplex Mismatch
- Jumbo Frames Enabled But Some Paths Failing
- Kubernetes Networking
- Primer
- Scenario: Duplex Mismatch Causing Slow Transfers and Late Collisions
- Symptoms
- Symptoms
- Symptoms: API Latency Spike, BGP Route Leak, Fix Is Network ACL
- Symptoms: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy
- TCP/IP Deep Dive
- Traffic From Specific Source Not Taking Expected Path
- Wireshark & Packet Analysis - Primer