Skip to content

Portal | Level: L2: Operations | Topics: Linux Networking Tools, Packet Path | Domain: Linux

Linux Network Packet Flow

Scope

This document explains what happens to a packet on Linux:

  • from NIC receive to userspace socket
  • from userspace send to NIC transmit
  • with routing, conntrack, netfilter, bridge, NAT, and local delivery in the picture
  • in both host and container-heavy environments

This is the mental model you need for:

  • debugging dropped packets
  • tracing weird latency
  • understanding Docker / Kubernetes networking
  • making sense of iptables, nftables, tc, ip rule, and ip route
  • answering interview questions without hand-waving

Big picture

A packet is not "handled by Linux" in one giant lump. It moves through a series of layers and hook points.

Receive path

wire
  -> NIC
  -> DMA into RAM
  -> interrupt or NAPI poll
  -> driver builds sk_buff
  -> ingress path
  -> optional XDP / tc ingress
  -> netfilter PREROUTING
  -> routing decision
      -> local delivery
      -> forwarding
      -> bridge path
  -> protocol handler (TCP/UDP/ICMP/...)
  -> socket receive queue
  -> userspace read/recv

Transmit path

userspace send/write
  -> socket layer
  -> TCP/UDP/IP stack
  -> routing decision
  -> netfilter OUTPUT / POSTROUTING
  -> qdisc / tc egress
  -> driver queue
  -> NIC DMA
  -> wire

The packet is usually represented inside the kernel as an sk_buff (skb). If you understand that one object is being classified, routed, rewritten, queued, and finally transmitted or delivered, the whole stack becomes much less mystical.


The main building blocks

NIC and driver

The network card receives frames from the wire. The driver:

  • coordinates DMA so packet data lands in memory
  • exposes RX and TX rings
  • acknowledges interrupts
  • participates in NAPI polling
  • hands packets to the kernel networking stack

Important consequences:

  • packet loss can happen before the IP stack even sees the packet
  • RX ring starvation, IRQ affinity, or driver bugs can look like "network problems"
  • high packet rates are often an interrupt / queue / CPU placement problem, not just a bandwidth problem

sk_buff

Linux uses the sk_buff structure as the canonical packet wrapper. It tracks:

  • pointers to packet data
  • protocol headers
  • device information
  • metadata such as marks, priority, checksum state, timestamps, conntrack association, and routing information

The payload may be linear or fragmented. That matters for offloads and for packet mangling.

NAPI

At high packet rates Linux avoids taking one interrupt per packet. Drivers typically use NAPI:

  1. hardware signals receive activity
  2. interrupt schedules polling
  3. kernel polls a bounded amount of RX work
  4. if traffic subsides, interrupts are re-enabled

Why you care:

  • it improves throughput and reduces interrupt storms
  • it can increase latency if CPUs are pinned or overloaded badly
  • tuning IRQ affinity and queue placement matters on multi-core hosts

Routing subsystem

The kernel decides whether a packet is:

  • for the local host
  • to be forwarded elsewhere
  • to be bridged at L2
  • to be dropped / blackholed / rejected
  • subject to policy routing

This is where ip route, ip rule, routing tables, marks, and VRFs matter.

Netfilter / nftables / iptables

Netfilter provides hook points in the stack. Rulesets implemented via nftables or iptables can:

  • filter
  • NAT
  • mark
  • log
  • redirect
  • classify

Classic hook names:

  • PREROUTING
  • INPUT
  • FORWARD
  • OUTPUT
  • POSTROUTING

These are not random names; they describe where in the path the packet currently is.

Conntrack

Connection tracking tracks flows and flow state such as:

  • NEW
  • ESTABLISHED
  • RELATED
  • INVALID

It is central to:

  • stateful firewalling
  • many NAT use cases
  • service load balancing patterns

Conntrack is also a common production bottleneck when tables overflow or timeouts are wrong.


Receive path in detail

1. Frame arrives at the NIC

An Ethernet frame hits the card. The NIC:

  • verifies enough of the frame to accept it
  • may perform checksum validation or segmentation offload support
  • places data into host memory using DMA
  • updates RX descriptor rings

At this stage, the CPU may not yet have touched the packet body.

Failure modes here

  • bad cable / switch / duplex / physical errors
  • RX drops in hardware
  • small ring sizes
  • driver bugs
  • IRQ pinned to a saturated CPU
  • packet rate too high for polling budget

Tools

  • ip -s link
  • ethtool -S eth0
  • ethtool -k eth0
  • ethtool -l eth0
  • cat /proc/interrupts

2. Driver and NAPI handoff

The driver notices RX work, usually by interrupt, and schedules NAPI polling. It creates or fills an skb and hands it to the receive path.

This is the point where Linux meaningfully "owns" the packet.

Important detail: packets may already be coalesced, checksummed, or partially offloaded depending on NIC features.

Why offload awareness matters

If you sniff on the host and see strange checksum behavior, it may be because:

  • checksum offload is happening later than you think
  • segmentation offload means the kernel sees a large logical packet that is later split by NIC hardware
  • packet captures can mislead you if you do not account for offloads

3. Optional early packet processing: XDP and tc ingress

Before the packet proceeds further, optional fast-path mechanisms may act on it.

XDP

XDP runs very early, often in the driver path, before full skb allocation in some cases. It is used for:

  • ultra-fast drop
  • filtering
  • load balancing
  • DDoS mitigation
  • redirection

Actions are typically things like pass, drop, transmit back out, or redirect.

tc ingress

Traffic control ingress hooks can classify or shape packets. It is slower than XDP but more integrated with traditional Linux networking behavior.

Why you care

If packets vanish before normal firewall rules, check whether:

  • an XDP program is attached
  • a tc filter exists
  • a CNI plugin or security product inserted ingress logic

4. Netfilter PREROUTING

This is the first major L3/L4 policy point for normal IP processing.

Typical uses:

  • DNAT
  • marking
  • filtering decisions before local-vs-forward routing choice
  • transparent proxying tricks

Conceptually:

packet just entered host
  -> should we rewrite destination?
  -> should we mark it?
  -> should we drop it?

If DNAT occurs here, later routing happens based on the translated destination.


5. Routing decision

Linux now asks: where should this packet go?

Possible outcomes:

  • local delivery to the host
  • forwarding to another interface
  • bridge forwarding if in bridge path
  • drop due to no route / policy / rp_filter / explicit firewall action

The route lookup considers:

  • destination prefix
  • policy routing rules
  • fwmark
  • incoming interface
  • VRF / network namespace context
  • source constraints for some cases

Local delivery

If the packet is for a local address, it heads toward protocol handlers and eventually a socket.

Forwarding

If the host is acting as a router and forwarding is enabled, it may go through FORWARD and then out another interface.

Common confusion

A host can be both:

  • an endpoint for some addresses
  • a router for other traffic
  • a bridge for L2 forwarding
  • a NAT box

These paths overlap but are not identical.


6. Local delivery path

If the destination is local:

PREROUTING
  -> route says "this host"
  -> INPUT
  -> protocol handler
  -> socket

Netfilter INPUT

This is where you commonly allow or deny inbound traffic destined for the local host.

Examples:

  • allow SSH to the server
  • allow Prometheus scrape traffic
  • drop random inbound junk

Protocol demux

The IP layer passes to the next protocol:

  • TCP
  • UDP
  • ICMP
  • SCTP
  • others

The transport layer then tries to match a socket by:

  • destination IP
  • destination port
  • source tuple where relevant
  • namespace
  • socket options like SO_REUSEPORT

Socket receive queue

If a matching socket exists, data goes to the receive queue. Userspace then pulls it with:

  • recv
  • read
  • recvmsg
  • accept for connection setup path

If no listener exists

For TCP:

  • kernel usually replies with RST

For UDP:

  • packet is dropped; an ICMP unreachable may be generated depending on conditions

7. Forwarding path

If the packet is not for the local host and forwarding is enabled:

PREROUTING
  -> route says "forward"
  -> FORWARD
  -> POSTROUTING
  -> egress qdisc / driver
  -> NIC

FORWARD

This is the filtering point for routed transit traffic.

POSTROUTING

Typical uses:

  • SNAT / MASQUERADE
  • marks
  • final packet policy before egress

Why forwarding breaks in real life

Common reasons:

  • net.ipv4.ip_forward=0
  • firewall allows local input but not forwarding
  • reverse path filtering
  • missing route back
  • broken conntrack state
  • MTU issues
  • asymmetric routing

8. Bridge path

If Linux is acting as a bridge, frame processing can stay at layer 2.

Bridge logic decides which port should receive a frame based on MAC learning and forwarding tables.

Important bridge features:

  • MAC learning
  • STP / RSTP behavior depending on setup
  • VLAN filtering
  • optional bridge netfilter interaction

Container networking often uses Linux bridges, so a lot of "container networking" is really just "Linux bridge plus veth plus NAT plus policy rules."


Transmit path in detail

1. Userspace writes to a socket

An application performs:

  • send
  • sendmsg
  • write
  • connect + write for TCP stream traffic

For TCP, the kernel manages:

  • connection state
  • retransmission
  • congestion control
  • segmentation
  • ACK processing

For UDP, the path is simpler: datagram in, datagram out.


2. Socket layer and transport processing

The kernel turns application data into transport and network packets.

For TCP this includes:

  • sequence numbers
  • congestion window
  • retransmission timers
  • segmentation
  • checksums
  • state transitions

For UDP this includes:

  • datagram framing
  • checksums
  • optional fragmentation downstream if MTU requires it

3. Output routing decision

The kernel picks:

  • egress interface
  • next hop
  • source address
  • route attributes
  • policy-based overrides if present

This is where:

  • wrong source address selection
  • weird ip rule matches
  • VRF confusion
  • missing routes

turn into black holes.


4. Netfilter OUTPUT

This affects packets generated locally by the host.

Typical use cases:

  • host egress filtering
  • local-service redirection
  • service mesh / proxy tricks
  • packet marking for policy routing

Do not confuse INPUT and OUTPUT:

  • INPUT = packet to local host
  • OUTPUT = packet created by local host

5. Netfilter POSTROUTING

This is where final egress NAT often happens.

Common examples:

  • container subnet egress MASQUERADE
  • host acting as NAT gateway
  • policy marks before wire

6. Traffic control / qdisc / egress

Before the driver transmits, Linux can queue and shape traffic.

Key ideas:

  • every interface has a qdisc
  • qdiscs manage queueing, fairness, delay, shaping, and sometimes drops
  • fq, fq_codel, htb, and mq commonly appear

This is where you shape bandwidth, enforce fairness, or accidentally create latency.

When qdisc matters

  • VoIP / latency-sensitive apps
  • multi-tenant hosts
  • egress congestion
  • bufferbloat mitigation
  • CNI bandwidth policies

7. Driver TX queue and NIC transmit

The driver maps packet buffers for DMA, places descriptors into TX rings, and the NIC eventually transmits on the wire.

At very high rates, bottlenecks can be:

  • qdisc lock contention
  • TX queue imbalance
  • CPU softirq saturation
  • offload mismatch
  • NIC queue count too small
  • NUMA placement problems

Routing, policy routing, and marks

Standard routing picks the longest-prefix match from the main routing table. Policy routing adds extra decision layers via ip rule.

Examples of selectors:

  • source address
  • fwmark
  • incoming interface
  • TOS / DSCP
  • UID in some scenarios

This is heavily used in:

  • multi-homed systems
  • VPN split routing
  • CNI plugins
  • transparent proxies
  • traffic engineering

Fwmark

A packet mark is just metadata attached in the kernel. Rules can later say:

  • if mark 0x1, consult table 100
  • if mark 0x2, send to different gateway

Marks are extremely useful and extremely easy to lose track of.


Conntrack and NAT

Conntrack records flow state. NAT uses that state so response traffic can be rewritten consistently.

DNAT

Changes destination, usually early:

  • incoming packet to public IP:443
  • rewrite to internal IP:8443

SNAT / MASQUERADE

Changes source, usually late:

  • internal packet from 10.0.0.5
  • rewrite source to public IP of egress interface

Why conntrack breaks things

Problems include:

  • table exhaustion
  • stale state
  • asymmetric routing
  • timeouts too long or too short
  • NAT mapping collisions in pathological cases

Useful commands

  • conntrack -L
  • nft list ruleset
  • iptables-save
  • sysctl net.netfilter.nf_conntrack_max

Network namespaces and containers

Containers do not invent a new networking stack. They reuse Linux primitives.

Common pattern:

container netns
  -> veth pair
  -> host namespace bridge
  -> NAT / routing / policy
  -> host NIC

Inside the container, it looks like it has its own:

  • interfaces
  • routes
  • sockets
  • firewall namespace context

But those are namespace-isolated kernel views, not separate kernels.

Important consequences

  • a container packet often crosses namespaces, veth, bridge, netfilter, and NAT before leaving the box
  • "it works on the host but not the pod" often means namespace, policy, or bridge path differences
  • packet path length in container setups is much more complex than on a plain host

Practical debugging workflow

Step 1: establish where the packet dies

Ask:

  • does it reach the NIC?
  • does the host receive it?
  • does it hit local socket or forwarding path?
  • does it leave the host?
  • does reply traffic come back?

Tools

  • tcpdump -i any
  • tcpdump -i eth0
  • ip -s link
  • ss -tulpn
  • ip route get <dst>
  • nft list ruleset

Step 2: separate layers

Layer 1/2 questions

  • interface up?
  • carrier?
  • VLAN correct?
  • bridge forwarding?
  • MAC learning?

Layer 3 questions

  • route exists?
  • source IP sane?
  • rp_filter?
  • policy routing?

Layer 4 questions

  • listener present?
  • conntrack state?
  • firewall port allowed?
  • MTU / MSS mismatch?

Step 3: check offloads before trusting captures

Sometimes the capture looks insane because:

  • checksum offload makes outbound packets look bad before NIC fixes them
  • GRO/LRO changes packet shapes
  • TSO/GSO changes segmentation behavior

Do not build a fantasy diagnosis from a capture taken without offload awareness.


Common production failure patterns

1. "Port is open on the server but unreachable remotely"

Could be:

  • service bound to loopback
  • INPUT rules dropping
  • cloud SG / ACL outside host
  • routing asymmetry
  • wrong interface
  • rp_filter

2. "Containers can reach out but nothing can reach them"

Usually one of:

  • bridge NAT only configured for egress
  • no port publish / DNAT
  • no route to overlay / container subnet
  • policy rules inserted by container runtime

3. "Intermittent drops under load"

Possibilities:

  • RX ring overflow
  • conntrack table full
  • CPU softirq saturation
  • qdisc drops
  • NIC queue imbalance
  • MTU / fragmentation trouble

4. "Packets visible in tcpdump but application never sees them"

Often:

  • wrong socket binding
  • namespace mismatch
  • firewall drop later in path
  • packet reaches host but not that socket
  • protocol mismatch
  • userspace backlog or accept queue issue

Key mental model

A Linux packet path is a conveyor belt with decision stations. When debugging, never ask "is the network broken?" — ask "which exact station did the packet fail to pass?"


References


Wiki Navigation

Prerequisites