Portal | Level: L2: Operations | Topics: Linux Networking Tools, Packet Path | Domain: Linux

Linux Network Packet Flow¶

Scope¶

This document explains what happens to a packet on Linux:

from NIC receive to userspace socket
from userspace send to NIC transmit
with routing, conntrack, netfilter, bridge, NAT, and local delivery in the picture
in both host and container-heavy environments

This is the mental model you need for:

debugging dropped packets
tracing weird latency
understanding Docker / Kubernetes networking
making sense of iptables, nftables, tc, ip rule, and ip route
answering interview questions without hand-waving

Big picture¶

A packet is not "handled by Linux" in one giant lump. It moves through a series of layers and hook points.

Receive path¶

wire
  -> NIC
  -> DMA into RAM
  -> interrupt or NAPI poll
  -> driver builds sk_buff
  -> ingress path
  -> optional XDP / tc ingress
  -> netfilter PREROUTING
  -> routing decision
      -> local delivery
      -> forwarding
      -> bridge path
  -> protocol handler (TCP/UDP/ICMP/...)
  -> socket receive queue
  -> userspace read/recv

Transmit path¶

userspace send/write
  -> socket layer
  -> TCP/UDP/IP stack
  -> routing decision
  -> netfilter OUTPUT / POSTROUTING
  -> qdisc / tc egress
  -> driver queue
  -> NIC DMA
  -> wire

The packet is usually represented inside the kernel as an sk_buff (skb). If you understand that one object is being classified, routed, rewritten, queued, and finally transmitted or delivered, the whole stack becomes much less mystical.

The main building blocks¶

NIC and driver¶

The network card receives frames from the wire. The driver:

coordinates DMA so packet data lands in memory
exposes RX and TX rings
acknowledges interrupts
participates in NAPI polling
hands packets to the kernel networking stack

Important consequences:

packet loss can happen before the IP stack even sees the packet
RX ring starvation, IRQ affinity, or driver bugs can look like "network problems"
high packet rates are often an interrupt / queue / CPU placement problem, not just a bandwidth problem

`sk_buff`¶

Linux uses the sk_buff structure as the canonical packet wrapper. It tracks:

pointers to packet data
protocol headers
device information
metadata such as marks, priority, checksum state, timestamps, conntrack association, and routing information

The payload may be linear or fragmented. That matters for offloads and for packet mangling.

NAPI¶

At high packet rates Linux avoids taking one interrupt per packet. Drivers typically use NAPI:

hardware signals receive activity
interrupt schedules polling
kernel polls a bounded amount of RX work
if traffic subsides, interrupts are re-enabled

Why you care:

it improves throughput and reduces interrupt storms
it can increase latency if CPUs are pinned or overloaded badly
tuning IRQ affinity and queue placement matters on multi-core hosts

Routing subsystem¶

The kernel decides whether a packet is:

for the local host
to be forwarded elsewhere
to be bridged at L2
to be dropped / blackholed / rejected
subject to policy routing

This is where ip route, ip rule, routing tables, marks, and VRFs matter.

Netfilter / nftables / iptables¶

Netfilter provides hook points in the stack. Rulesets implemented via nftables or iptables can:

filter
NAT
mark
log
redirect
classify

Classic hook names:

PREROUTING
INPUT
FORWARD
OUTPUT
POSTROUTING

These are not random names; they describe where in the path the packet currently is.

Conntrack¶

Connection tracking tracks flows and flow state such as:

NEW
ESTABLISHED
RELATED
INVALID

It is central to:

stateful firewalling
many NAT use cases
service load balancing patterns

Conntrack is also a common production bottleneck when tables overflow or timeouts are wrong.

Receive path in detail¶

1. Frame arrives at the NIC¶

An Ethernet frame hits the card. The NIC:

verifies enough of the frame to accept it
may perform checksum validation or segmentation offload support
places data into host memory using DMA
updates RX descriptor rings

At this stage, the CPU may not yet have touched the packet body.

Failure modes here¶

bad cable / switch / duplex / physical errors
RX drops in hardware
small ring sizes
driver bugs
IRQ pinned to a saturated CPU
packet rate too high for polling budget

Tools¶

ip -s link
ethtool -S eth0
ethtool -k eth0
ethtool -l eth0
cat /proc/interrupts

2. Driver and NAPI handoff¶

The driver notices RX work, usually by interrupt, and schedules NAPI polling. It creates or fills an skb and hands it to the receive path.

This is the point where Linux meaningfully "owns" the packet.

Important detail: packets may already be coalesced, checksummed, or partially offloaded depending on NIC features.

Why offload awareness matters¶

If you sniff on the host and see strange checksum behavior, it may be because:

checksum offload is happening later than you think
segmentation offload means the kernel sees a large logical packet that is later split by NIC hardware
packet captures can mislead you if you do not account for offloads

3. Optional early packet processing: XDP and tc ingress¶

Before the packet proceeds further, optional fast-path mechanisms may act on it.

XDP¶

XDP runs very early, often in the driver path, before full skb allocation in some cases. It is used for:

ultra-fast drop
filtering
load balancing
DDoS mitigation
redirection

Actions are typically things like pass, drop, transmit back out, or redirect.

`tc` ingress¶

Traffic control ingress hooks can classify or shape packets. It is slower than XDP but more integrated with traditional Linux networking behavior.

Why you care¶

If packets vanish before normal firewall rules, check whether:

an XDP program is attached
a tc filter exists
a CNI plugin or security product inserted ingress logic

4. Netfilter `PREROUTING`¶

This is the first major L3/L4 policy point for normal IP processing.

Typical uses:

DNAT
marking
filtering decisions before local-vs-forward routing choice
transparent proxying tricks

Conceptually:

packet just entered host
  -> should we rewrite destination?
  -> should we mark it?
  -> should we drop it?

If DNAT occurs here, later routing happens based on the translated destination.

5. Routing decision¶

Linux now asks: where should this packet go?

Possible outcomes:

local delivery to the host
forwarding to another interface
bridge forwarding if in bridge path
drop due to no route / policy / rp_filter / explicit firewall action

The route lookup considers:

destination prefix
policy routing rules
fwmark
incoming interface
VRF / network namespace context
source constraints for some cases

Local delivery¶

If the packet is for a local address, it heads toward protocol handlers and eventually a socket.

Forwarding¶

If the host is acting as a router and forwarding is enabled, it may go through FORWARD and then out another interface.

Common confusion¶

A host can be both:

an endpoint for some addresses
a router for other traffic
a bridge for L2 forwarding
a NAT box

These paths overlap but are not identical.

6. Local delivery path¶

If the destination is local:

PREROUTING
  -> route says "this host"
  -> INPUT
  -> protocol handler
  -> socket

Netfilter `INPUT`¶

This is where you commonly allow or deny inbound traffic destined for the local host.

Examples:

allow SSH to the server
allow Prometheus scrape traffic
drop random inbound junk

Protocol demux¶

The IP layer passes to the next protocol:

TCP
UDP
ICMP
SCTP
others

The transport layer then tries to match a socket by:

destination IP
destination port
source tuple where relevant
namespace
socket options like SO_REUSEPORT

Socket receive queue¶

If a matching socket exists, data goes to the receive queue. Userspace then pulls it with:

recv
read
recvmsg
accept for connection setup path

If no listener exists¶

For TCP:

kernel usually replies with RST

For UDP:

packet is dropped; an ICMP unreachable may be generated depending on conditions

7. Forwarding path¶

If the packet is not for the local host and forwarding is enabled:

PREROUTING
  -> route says "forward"
  -> FORWARD
  -> POSTROUTING
  -> egress qdisc / driver
  -> NIC

`FORWARD`¶

This is the filtering point for routed transit traffic.

`POSTROUTING`¶

Typical uses:

SNAT / MASQUERADE
marks
final packet policy before egress

Why forwarding breaks in real life¶

Common reasons:

net.ipv4.ip_forward=0
firewall allows local input but not forwarding
reverse path filtering
missing route back
broken conntrack state
MTU issues
asymmetric routing

8. Bridge path¶

If Linux is acting as a bridge, frame processing can stay at layer 2.

Bridge logic decides which port should receive a frame based on MAC learning and forwarding tables.

Important bridge features:

MAC learning
STP / RSTP behavior depending on setup
VLAN filtering
optional bridge netfilter interaction

Container networking often uses Linux bridges, so a lot of "container networking" is really just "Linux bridge plus veth plus NAT plus policy rules."

Transmit path in detail¶

1. Userspace writes to a socket¶

An application performs:

send
sendmsg
write
connect + write for TCP stream traffic

For TCP, the kernel manages:

connection state
retransmission
congestion control
segmentation
ACK processing

For UDP, the path is simpler: datagram in, datagram out.

2. Socket layer and transport processing¶

The kernel turns application data into transport and network packets.

For TCP this includes:

sequence numbers
congestion window
retransmission timers
segmentation
checksums
state transitions

For UDP this includes:

datagram framing
checksums
optional fragmentation downstream if MTU requires it

3. Output routing decision¶

The kernel picks:

egress interface
next hop
source address
route attributes
policy-based overrides if present

This is where:

wrong source address selection
weird ip rule matches
VRF confusion
missing routes

turn into black holes.

4. Netfilter `OUTPUT`¶

This affects packets generated locally by the host.

Typical use cases:

host egress filtering
local-service redirection
service mesh / proxy tricks
packet marking for policy routing

Do not confuse INPUT and OUTPUT:

INPUT = packet to local host
OUTPUT = packet created by local host

5. Netfilter `POSTROUTING`¶

This is where final egress NAT often happens.

Common examples:

container subnet egress MASQUERADE
host acting as NAT gateway
policy marks before wire

6. Traffic control / qdisc / egress¶

Before the driver transmits, Linux can queue and shape traffic.

Key ideas:

every interface has a qdisc
qdiscs manage queueing, fairness, delay, shaping, and sometimes drops
fq, fq_codel, htb, and mq commonly appear

This is where you shape bandwidth, enforce fairness, or accidentally create latency.

When qdisc matters¶

VoIP / latency-sensitive apps
multi-tenant hosts
egress congestion
bufferbloat mitigation
CNI bandwidth policies

7. Driver TX queue and NIC transmit¶

The driver maps packet buffers for DMA, places descriptors into TX rings, and the NIC eventually transmits on the wire.

At very high rates, bottlenecks can be:

qdisc lock contention
TX queue imbalance
CPU softirq saturation
offload mismatch
NIC queue count too small
NUMA placement problems

Routing, policy routing, and marks¶

Standard routing picks the longest-prefix match from the main routing table. Policy routing adds extra decision layers via ip rule.

Examples of selectors:

source address
fwmark
incoming interface
TOS / DSCP
UID in some scenarios

This is heavily used in:

multi-homed systems
VPN split routing
CNI plugins
transparent proxies
traffic engineering

Fwmark¶

A packet mark is just metadata attached in the kernel. Rules can later say:

if mark 0x1, consult table 100
if mark 0x2, send to different gateway

Marks are extremely useful and extremely easy to lose track of.

Conntrack and NAT¶

Conntrack records flow state. NAT uses that state so response traffic can be rewritten consistently.

DNAT¶

Changes destination, usually early:

incoming packet to public IP:443
rewrite to internal IP:8443

SNAT / MASQUERADE¶

Changes source, usually late:

internal packet from 10.0.0.5
rewrite source to public IP of egress interface

Why conntrack breaks things¶

Problems include:

table exhaustion
stale state
asymmetric routing
timeouts too long or too short
NAT mapping collisions in pathological cases

Useful commands¶

conntrack -L
nft list ruleset
iptables-save
sysctl net.netfilter.nf_conntrack_max

Network namespaces and containers¶

Containers do not invent a new networking stack. They reuse Linux primitives.

Common pattern:

container netns
  -> veth pair
  -> host namespace bridge
  -> NAT / routing / policy
  -> host NIC

Inside the container, it looks like it has its own:

interfaces
routes
sockets
firewall namespace context

But those are namespace-isolated kernel views, not separate kernels.

Important consequences¶

a container packet often crosses namespaces, veth, bridge, netfilter, and NAT before leaving the box
"it works on the host but not the pod" often means namespace, policy, or bridge path differences
packet path length in container setups is much more complex than on a plain host

Practical debugging workflow¶

Step 1: establish where the packet dies¶

Ask:

does it reach the NIC?
does the host receive it?
does it hit local socket or forwarding path?
does it leave the host?
does reply traffic come back?

Tools¶

tcpdump -i any
tcpdump -i eth0
ip -s link
ss -tulpn
ip route get <dst>
nft list ruleset

Step 2: separate layers¶

Layer 1/2 questions¶

interface up?
carrier?
VLAN correct?
bridge forwarding?
MAC learning?

Layer 3 questions¶

route exists?
source IP sane?
rp_filter?
policy routing?

Layer 4 questions¶

listener present?
conntrack state?
firewall port allowed?
MTU / MSS mismatch?

Step 3: check offloads before trusting captures¶

Sometimes the capture looks insane because:

checksum offload makes outbound packets look bad before NIC fixes them
GRO/LRO changes packet shapes
TSO/GSO changes segmentation behavior

Do not build a fantasy diagnosis from a capture taken without offload awareness.

Common production failure patterns¶

1. "Port is open on the server but unreachable remotely"¶

Could be:

service bound to loopback
INPUT rules dropping
cloud SG / ACL outside host
routing asymmetry
wrong interface
rp_filter

2. "Containers can reach out but nothing can reach them"¶

Usually one of:

bridge NAT only configured for egress
no port publish / DNAT
no route to overlay / container subnet
policy rules inserted by container runtime

3. "Intermittent drops under load"¶

Possibilities:

RX ring overflow
conntrack table full
CPU softirq saturation
qdisc drops
NIC queue imbalance
MTU / fragmentation trouble

4. "Packets visible in tcpdump but application never sees them"¶

Often:

wrong socket binding
namespace mismatch
firewall drop later in path
packet reaches host but not that socket
protocol mismatch
userspace backlog or accept queue issue

Key mental model¶

A Linux packet path is a conveyor belt with decision stations. When debugging, never ask "is the network broken?" — ask "which exact station did the packet fail to pass?"

References¶

Prerequisites¶

Linux Ops (Topic Pack, L0)

Case Study: API Latency Spike — BGP Route Leak, Fix Is Network ACL (Case Study, L2) — Linux Networking Tools
Case Study: ARP Flux Duplicate IP (Case Study, L2) — Linux Networking Tools
Case Study: DHCP Relay Broken (Case Study, L1) — Linux Networking Tools
Case Study: Duplex Mismatch Symptoms (Case Study, L1) — Linux Networking Tools
Case Study: IPTables Blocking Unexpected (Case Study, L2) — Linux Networking Tools
Case Study: Jumbo Frames Partial (Case Study, L2) — Linux Networking Tools
Case Study: Service Mesh 503s — Envoy Misconfigured, RBAC Policy (Case Study, L2) — Linux Networking Tools
Case Study: Source Routing Policy Miss (Case Study, L2) — Linux Networking Tools
Case Study: Stuck NFS Mount (Case Study, L2) — Linux Networking Tools
Deep Dive: AWS VPC Internals (deep_dive, L2) — Linux Networking Tools

Linux Network Packet Flow¶

Scope¶

Big picture¶

Receive path¶

Transmit path¶

The main building blocks¶

NIC and driver¶

sk_buff¶

NAPI¶

Routing subsystem¶

Netfilter / nftables / iptables¶

Conntrack¶

Receive path in detail¶

1. Frame arrives at the NIC¶

Failure modes here¶

Tools¶

2. Driver and NAPI handoff¶

Why offload awareness matters¶

3. Optional early packet processing: XDP and tc ingress¶

XDP¶

tc ingress¶

Why you care¶

4. Netfilter PREROUTING¶

5. Routing decision¶

Local delivery¶

Forwarding¶

Common confusion¶

6. Local delivery path¶

Netfilter INPUT¶

Protocol demux¶

Socket receive queue¶

If no listener exists¶

7. Forwarding path¶

FORWARD¶

POSTROUTING¶

Why forwarding breaks in real life¶

8. Bridge path¶

Transmit path in detail¶

1. Userspace writes to a socket¶

2. Socket layer and transport processing¶

3. Output routing decision¶

4. Netfilter OUTPUT¶

5. Netfilter POSTROUTING¶

6. Traffic control / qdisc / egress¶

When qdisc matters¶

7. Driver TX queue and NIC transmit¶

Routing, policy routing, and marks¶

Fwmark¶

Conntrack and NAT¶

DNAT¶

SNAT / MASQUERADE¶

Why conntrack breaks things¶

Useful commands¶

Network namespaces and containers¶

Important consequences¶

Practical debugging workflow¶

Step 1: establish where the packet dies¶

Tools¶

Step 2: separate layers¶

Layer 1/2 questions¶

Layer 3 questions¶

Layer 4 questions¶

Step 3: check offloads before trusting captures¶

Common production failure patterns¶

1. "Port is open on the server but unreachable remotely"¶

2. "Containers can reach out but nothing can reach them"¶

3. "Intermittent drops under load"¶

4. "Packets visible in tcpdump but application never sees them"¶

Key mental model¶

References¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶

`sk_buff`¶

`tc` ingress¶

4. Netfilter `PREROUTING`¶

Netfilter `INPUT`¶

`FORWARD`¶

`POSTROUTING`¶

4. Netfilter `OUTPUT`¶

5. Netfilter `POSTROUTING`¶