Linux Networking: Bridges, Bonds, and VLANs
- lesson
- network-namespaces
- veth-pairs
- linux-bridges
- bonding/lacp
- vlans
- macvlan/ipvlan
- tap/tun
- docker-networking
- ovs
- tc
- kubernetes-cni ---# Linux Networking — Bridges, Bonds, and VLANs
Topics: network namespaces, veth pairs, Linux bridges, bonding/LACP, VLANs, macvlan/ipvlan, tap/tun, Docker networking, OVS, tc, Kubernetes CNI Level: L1–L2 (Foundations → Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
You just inherited a bare-metal server that needs to host four isolated tenant workloads. Each tenant gets its own network segment. Two tenants need VLAN access to the physical network. The server has two 10G NICs that should be bonded for redundancy. And the whole thing needs to resemble — at a conceptual level — what Docker and Kubernetes do under the hood.
By the end of this lesson, you'll have built the whole setup from scratch using nothing
but ip commands. More importantly, you'll understand why container networking works
the way it does, because you'll have built it yourself, piece by piece:
- Network namespaces: the isolation primitive that makes containers possible
- veth pairs: the virtual cables that connect isolated worlds
- Linux bridges: the software switches that tie everything together
- VLANs: Layer 2 segmentation on a single wire
- Bonding: turning two NICs into one for redundancy and bandwidth
- How Docker's bridge networking is just namespaces + veth + bridge + iptables
- Where Kubernetes CNI picks up the story
We start with a single namespace. We end with a multi-tenant network. Let's go.
Part 1: Network Namespaces — Your Own Private Network Stack¶
Every process on Linux shares the same network stack by default — the same interfaces, the same routing table, the same iptables rules. Network namespaces change that. A namespace gets its own everything: interfaces, routes, ARP table, firewall rules, sockets. It's a complete network stack in a box.
Name Origin: The first Linux namespace (mount, 2002) used the flag
CLONE_NEWNS— "new namespace" — because nobody expected there would be more than one type. Every subsequent namespace got a more specific name:CLONE_NEWPID,CLONE_NEWNET, etc. The mount namespace is still stuck with the generic flag as a historical accident.
Let's create one:
Now look inside it:
Output:
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
That's it. One loopback interface, and it's DOWN. No eth0. No routes. No connectivity. This namespace is completely isolated from the host and from every other namespace.
# Bring up loopback (you'll need this for local communication)
ip netns exec tenant1 ip link set lo up
# Check the routing table — it's empty
ip netns exec tenant1 ip route
Nothing comes back. This namespace can't reach anything. That's the point.
Mental Model: Think of a network namespace as a brand new computer with no network cables plugged in. It has a network stack, but no connections. Everything you want it to reach, you have to wire up yourself.
What lives in a namespace¶
Each namespace has its own:
| Resource | Isolated? | Example |
|---|---|---|
| Interfaces | Yes | lo, eth0, veth, bridges |
| Routing table | Yes | ip route shows different routes per namespace |
| ARP/neighbor table | Yes | ip neigh is per-namespace |
| iptables/nftables rules | Yes | Firewall rules are namespace-scoped |
| Sockets | Yes | A port 80 listener in ns1 doesn't conflict with ns2 |
/proc/net/* |
Yes | Each namespace has its own proc network files |
Interview Bridge: "How does a container get its own IP address and routing table?" The answer is network namespaces. Every container runtime (Docker, containerd, CRI-O) creates a network namespace per container (or per pod in Kubernetes). That's the entire isolation mechanism. There's no magic.
Part 2: veth Pairs — Virtual Ethernet Cables¶
A namespace with no connections is useless. You need a way to get packets in and out.
Enter veth pairs.
Name Origin:
veth= virtual ethernet. A veth pair is two virtual Ethernet interfaces connected back-to-back. Whatever goes in one end comes out the other. Think of it as a virtual crossover cable with an interface on each end.
# Create a veth pair: veth-host and veth-tenant1
ip link add veth-host type veth peer name veth-tenant1
You now have two interfaces on the host. Let's move one end into the namespace:
Now veth-tenant1 has vanished from the host — it only exists inside tenant1. But the
two ends are still connected. Assign IPs and bring them up:
# Host side
ip addr add 10.0.1.1/24 dev veth-host
ip link set veth-host up
# Tenant side (run inside the namespace)
ip netns exec tenant1 ip addr add 10.0.1.2/24 dev veth-tenant1
ip netns exec tenant1 ip link set veth-tenant1 up
Test it:
# From the host, ping the tenant
ping -c 2 10.0.1.2
# From the tenant, ping the host
ip netns exec tenant1 ping -c 2 10.0.1.1
Both should work. You just connected an isolated namespace to the host using a virtual cable.
Under the Hood: When you write to one end of a veth pair, the kernel's
veth_xmit()function takes the packet, flips the source/destination device pointers, and delivers it to the peer's receive path — as if it arrived from a physical wire. There's no copy; the samesk_buff(socket buffer) is passed to the other end. This is why veth pairs have near-zero overhead.
The problem with point-to-point¶
What we built works for one namespace. But what if you have four tenants that all need to talk to each other and to the host? You'd need a veth pair between every pair of namespaces — that's 6 pairs for 4 namespaces, 10 pairs for 5, and it scales as n(n-1)/2.
This is exactly the problem that switches solve in the physical world. In the virtual world, we use a Linux bridge.
Flashcard Check #1¶
Cover the answers. Test yourself.
| Question | Answer |
|---|---|
| What kernel feature gives a container its own network stack? | Network namespace (CLONE_NEWNET) |
What does ip netns exec tenant1 bash do? |
Opens a shell inside the tenant1 network namespace |
| What is a veth pair? | Two virtual Ethernet interfaces connected back-to-back — a virtual cable |
Why can't you see veth-tenant1 on the host after moving it? |
It was moved into the tenant1 namespace; interfaces belong to exactly one namespace |
| What's the scaling problem with veth-only connectivity? | Point-to-point pairs scale as n(n-1)/2 — you need a bridge |
Part 3: Linux Bridges — Software Switches¶
A Linux bridge is a Layer 2 switch implemented in the kernel. It learns MAC addresses, forwards frames between ports, and acts as the central meeting point for veth pairs, physical NICs, VLAN interfaces, and tap devices.
Name Origin: The term "bridge" comes from the original networking device that "bridged" two separate network segments, allowing them to act as one. The Linux bridge implementation dates back to the 2.2 kernel era (late 1990s). The old tool was
brctl(bridge control); the modern equivalent isip link add type bridge.
# Create a bridge
ip link add br-tenant type bridge
ip link set br-tenant up
# Give the bridge an IP (this becomes the gateway for tenants)
ip addr add 10.0.1.1/24 dev br-tenant
Now connect namespaces to it. Let's set up two tenants this time:
# Clean up the earlier point-to-point setup
ip link del veth-host 2>/dev/null
# Create namespace and veth pairs for tenant1 and tenant2
for i in 1 2; do
ip netns add tenant${i} 2>/dev/null
ip link add veth-br-t${i} type veth peer name veth-t${i}
ip link set veth-t${i} netns tenant${i}
# Attach host end to the bridge
ip link set veth-br-t${i} master br-tenant
ip link set veth-br-t${i} up
# Configure inside the namespace
ip netns exec tenant${i} ip addr add 10.0.1.$((i+1))/24 dev veth-t${i}
ip netns exec tenant${i} ip link set veth-t${i} up
ip netns exec tenant${i} ip link set lo up
# Set the bridge as the default gateway
ip netns exec tenant${i} ip route add default via 10.0.1.1
done
Let's break down what just happened:
| Command | What it does |
|---|---|
ip link add veth-br-t1 type veth peer name veth-t1 |
Create a veth pair |
ip link set veth-t1 netns tenant1 |
Move one end into the namespace |
ip link set veth-br-t1 master br-tenant |
Attach the other end to the bridge |
ip netns exec tenant1 ip route add default via 10.0.1.1 |
Route traffic through the bridge |
Test connectivity:
# Tenant1 → Tenant2 (through the bridge)
ip netns exec tenant1 ping -c 2 10.0.1.3
# Tenant2 → Host (through the bridge)
ip netns exec tenant2 ping -c 2 10.0.1.1
Both tenants can reach each other and the host through the bridge. The bridge does MAC learning — it knows which MAC is behind which port, just like a physical switch.
Trivia: The
docker0bridge that Docker creates automatically is exactly this — a Linux bridge. When you rundocker run, Docker creates a veth pair, moves one end into the container's network namespace, and attaches the other todocker0. Every default Docker container network is built on the same primitives you just used.
Giving tenants internet access¶
Right now, tenants can reach the host and each other, but not the outside world. For that, you need IP forwarding and NAT — the same thing your home router does:
# Enable IP forwarding
sysctl -w net.ipv4.ip_forward=1
# NAT outbound traffic from the bridge network
iptables -t nat -A POSTROUTING -s 10.0.1.0/24 ! -o br-tenant -j MASQUERADE
# Allow forwarding for established connections
iptables -A FORWARD -i br-tenant -j ACCEPT
iptables -A FORWARD -o br-tenant -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
Now tenants can reach the internet:
Under the Hood: That MASQUERADE rule is doing SNAT — rewriting the source IP of outbound packets from 10.0.1.x to whatever IP is on the outgoing interface. The kernel's conntrack module remembers the mapping so return packets are translated back. This is exactly what Docker does when you
docker runwithout-p. With-p 8080:80, Docker adds a DNAT rule in the PREROUTING chain to forward incoming traffic on host port 8080 to the container's port 80.
Part 4: How Docker Networking Actually Works¶
Now that you've built a bridge network from scratch, here's the punchline: Docker's
default bridge network is this exact setup, automated.
When you run docker run -d --name web -p 8080:80 nginx, Docker:
- Creates a network namespace for the container
- Creates a veth pair
- Moves one end (
eth0inside the container) into the namespace - Attaches the other end to the
docker0bridge - Assigns an IP from the bridge's subnet (e.g., 172.17.0.2/16)
- Adds an iptables MASQUERADE rule for outbound NAT
- Adds an iptables DNAT rule to forward host:8080 → container:80
- Adds DNS configuration pointing to Docker's embedded DNS server (127.0.0.11)
You can see all of this:
# See the docker0 bridge
ip link show docker0
bridge link show
# See the veth pairs
ip link show type veth
# See Docker's iptables rules
iptables -t nat -L -n -v | grep -A5 DOCKER
# See the container's namespace
pid=$(docker inspect --format '{{.State.Pid}}' web)
nsenter -t $pid -n ip addr
nsenter -t $pid -n ip route
Gotcha: Docker's default bridge does not provide DNS resolution between containers by name. Only user-defined bridge networks (
docker network create) get Docker's built-in DNS. This is whydocker-composealways creates a custom network — so services can reach each other by name.
The macvlan alternative¶
Sometimes you don't want NAT. You want the container to appear as a real host on the
physical network. Docker's macvlan driver does this:
docker network create -d macvlan \
--subnet=10.100.0.0/24 \
--gateway=10.100.0.1 \
-o parent=eth0 \
direct_net
docker run --network direct_net --ip 10.100.0.50 -d nginx
The container gets 10.100.0.50 directly on the physical network. No NAT, no bridge. The switch sees the container's MAC address as a separate host.
Gotcha: With macvlan, the container can reach everything on the network except the host itself. This is a known kernel limitation — the host's interface and its macvlan children can't communicate at Layer 2. You need a separate macvlan interface on the host or a different physical NIC for host-to-container traffic.
Flashcard Check #2¶
| Question | Answer |
|---|---|
What does ip link set veth-br master br0 do? |
Attaches the veth interface to bridge br0 (like plugging a cable into a switch port) |
| What iptables chain does Docker use for port forwarding (-p)? | PREROUTING with a DNAT rule |
| Why doesn't Docker's default bridge support container name DNS? | Only user-defined networks get Docker's embedded DNS server |
| What does MASQUERADE do in the POSTROUTING chain? | Rewrites the source IP to the outgoing interface's IP (dynamic SNAT) |
| Why can't a macvlan container reach its host? | Kernel limitation: a physical interface and its macvlan children can't communicate at L2 |
Part 5: VLANs — Segmenting the Wire¶
So far, all our tenants share the same Layer 2 domain. Tenant1 can see Tenant2's broadcast traffic. For real isolation, you need VLANs — separate broadcast domains on the same physical wire.
Name Origin: VLAN = Virtual Local Area Network. Standardized as IEEE 802.1Q in 1998, VLANs were invented because moving a user between departments used to require physically re-cabling their switch port. The "virtual" means the segmentation is logical, not physical.
Trivia: The 802.1Q tag is only 4 bytes — inserted between the source MAC and the EtherType field. Those 4 bytes contain a 12-bit VLAN ID, giving you 4,094 usable VLANs (0 and 4095 are reserved). That seemed enormous in 1998. It became a hard constraint that drove the invention of VXLAN (24-bit ID, 16 million segments) for cloud-scale multi-tenancy.
Creating VLAN interfaces on Linux¶
First, load the kernel module:
Gotcha: If the
8021qmodule isn't loaded, Linux will happily create the VLAN interface and show it as UP, but no tagged frames will be sent or received. Everything looks fine, nothing works. Always verify the module is loaded.
Now create VLAN interfaces on a physical NIC:
# Create VLAN 100 on eth0
ip link add link eth0 name eth0.100 type vlan id 100
ip addr add 10.100.0.5/24 dev eth0.100
ip link set eth0.100 up
# Create VLAN 200 on eth0
ip link add link eth0 name eth0.200 type vlan id 200
ip addr add 10.200.0.5/24 dev eth0.200
ip link set eth0.200 up
# Verify — look for "vlan protocol 802.1Q id 100"
ip -d link show eth0.100
The switch port connected to eth0 must be a trunk carrying VLANs 100 and 200. If it's an access port, tagged frames are silently dropped.
VLAN-aware bridges for tenant isolation¶
Here's where it gets powerful. You can create a separate bridge per VLAN, giving each tenant true Layer 2 isolation:
# Bridge for VLAN 100 tenants
ip link add br-vlan100 type bridge
ip link set br-vlan100 up
ip link set eth0.100 master br-vlan100
# Bridge for VLAN 200 tenants
ip link add br-vlan200 type bridge
ip link set br-vlan200 up
ip link set eth0.200 master br-vlan200
# Connect tenant3 to VLAN 100
ip netns add tenant3
ip link add veth-br-t3 type veth peer name veth-t3
ip link set veth-t3 netns tenant3
ip link set veth-br-t3 master br-vlan100
ip link set veth-br-t3 up
ip netns exec tenant3 ip addr add 10.100.0.10/24 dev veth-t3
ip netns exec tenant3 ip link set veth-t3 up
# Connect tenant4 to VLAN 200
ip netns add tenant4
ip link add veth-br-t4 type veth peer name veth-t4
ip link set veth-t4 netns tenant4
ip link set veth-br-t4 master br-vlan200
ip link set veth-br-t4 up
ip netns exec tenant4 ip addr add 10.200.0.10/24 dev veth-t4
ip netns exec tenant4 ip link set veth-t4 up
Now tenant3 is on VLAN 100 and tenant4 is on VLAN 200. They're completely isolated at Layer 2 — tenant3's broadcasts never reach tenant4, and vice versa. Exactly like being on different physical switches.
# tenant3 can reach other VLAN 100 hosts
ip netns exec tenant3 ping -c 2 10.100.0.5
# tenant4 can reach other VLAN 200 hosts
ip netns exec tenant4 ping -c 2 10.200.0.5
# tenant3 CANNOT reach tenant4 (different L2 domain)
ip netns exec tenant3 ping -c 2 10.200.0.10 # fails — no route, different VLAN
Mental Model: A bridge-per-VLAN is like having multiple physical switches inside your server. Each bridge is a switch, each VLAN interface is an uplink to the physical network, and each veth pair is a patch cable to a namespace. The namespaces are the servers.
Part 6: Bonding — Two NICs, One Fate¶
A single NIC is a single point of failure. Bonding combines multiple physical interfaces into one logical interface for redundancy and aggregate bandwidth.
Bonding modes at a glance¶
| Mode | Name | What it does | Switch config? |
|---|---|---|---|
| 0 | balance-rr | Round-robin packets across links | Yes (static LAG) |
| 1 | active-backup | One link active, others standby | No |
| 2 | balance-xor | Hash-based distribution | Yes (static LAG) |
| 3 | broadcast | Send on all links | Yes |
| 4 | 802.3ad (LACP) | Dynamic aggregation with negotiation | Yes (LACP) |
| 5 | balance-tlb | Adaptive transmit load balance | No |
| 6 | balance-alb | Adaptive TX+RX load balance | No |
Remember: "1 for simple, 4 for fast." Mode 1 (active-backup) is the safe default — no switch coordination needed, instant failover. Mode 4 (LACP) is production standard when you want both bandwidth and redundancy, but requires switch configuration.
Setting up mode 4 (LACP)¶
# Create the bond
ip link add bond0 type bond mode 802.3ad
# Set fast LACP rate (1-second PDU interval, 3-second failure detection)
ip link set bond0 type bond lacp_rate fast
# Set hash policy for good traffic distribution
ip link set bond0 type bond xmit_hash_policy layer3+4
# Enable link monitoring (100ms polling)
ip link set bond0 type bond miimon 100
# Add member interfaces
ip link set eth0 down
ip link set eth1 down
ip link set eth0 master bond0
ip link set eth1 master bond0
# Bring everything up
ip link set bond0 up
ip addr add 10.0.0.1/24 dev bond0
Let's break down the key options:
| Option | Value | Why |
|---|---|---|
mode 802.3ad |
LACP | Dynamic negotiation, detects one-sided failures |
lacp_rate fast |
1-second PDUs | Failure detection in 3 seconds (vs 90 seconds on slow) |
xmit_hash_policy layer3+4 |
Hash on IP+port | Distributes flows across links |
miimon 100 |
Poll every 100ms | Detects physical link failure |
Gotcha: The default
lacp_rateisslow— PDUs every 30 seconds, failure detection at 90 seconds. That's a minute and a half of sending traffic into a dead link. Always setlacp_rate fastin production.
Verifying the bond¶
Look for:
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
LACP rate: fast
MII Status: up
Slave Interface: eth0
MII Status: up
Aggregator ID: 1
Partner Mac Address: aa:bb:cc:dd:ee:ff # <-- switch's MAC
Slave Interface: eth1
MII Status: up
Aggregator ID: 1 # <-- same ID = correctly bundled
Partner Mac Address: aa:bb:cc:dd:ee:ff
Debug Clue: If
Partner Mac Addressshows00:00:00:00:00:00, the switch isn't sending LACP PDUs. Either the switch port isn't configured for LACP, the switch is in passive mode (and so is your host), or there's a physical layer issue. If the two members show different Aggregator IDs, they're not actually bundled — check for speed/duplex mismatches.
Setting up mode 1 (active-backup)¶
When you don't control the switch or just need simple failover:
ip link add bond0 type bond mode active-backup
ip link set bond0 type bond miimon 100
ip link set bond0 type bond primary eth0
ip link set eth0 master bond0
ip link set eth1 master bond0
ip link set bond0 up
No switch configuration needed. eth0 handles all traffic. If eth0 goes down, eth1 takes
over immediately. When eth0 recovers, it becomes active again (because of primary eth0).
War Story: The Bonding Mode That Split the Brain¶
War Story: A team configured a 2x10G bond on their database servers using mode 4 (LACP). Everything worked great — until a firmware update on the switch silently changed the port-channel configuration from LACP to static. The Linux side kept sending LACP PDUs. The switch ignored them. The bond stayed "up" because physical links were fine, but the switch now treated each port independently. Inbound traffic arrived on both ports with different MAC forwarding, causing duplicate packets and MAC flapping across the switch fabric. The database saw intermittent connection resets. It took three days to diagnose because monitoring only checked "bond0 is up" — nobody was checking whether the LACP partner was actually responding.
The fix: monitor
/proc/net/bonding/bond0forPartner Mac Address: 00:00:00:00:00:00and alert on it. Also: always uselacp_rate fastso you detect switch-side misconfigurations in seconds, not minutes.
Part 7: VLANs on a Bond — The Full Stack¶
In production, you don't put VLANs on a bare NIC. You put them on a bond. The layering looks like this:
┌─────────────┐
│ br-vlan100 │ ← bridge (switch for VLAN 100)
└──────┬──────┘
│
┌──────┴──────┐
│ bond0.100 │ ← VLAN sub-interface
└──────┬──────┘
│
┌──────┴──────┐
│ bond0 │ ← bond (2x10G LACP)
└──┬──────┬──┘
│ │
┌──┴──┐┌──┴──┐
│eth0 ││eth1 │ ← physical NICs
└─────┘└─────┘
Build it:
# Assume bond0 already exists from the previous section
# Create VLAN interfaces on the bond
ip link add link bond0 name bond0.100 type vlan id 100
ip link add link bond0 name bond0.200 type vlan id 200
ip link set bond0.100 up
ip link set bond0.200 up
# Create bridges for each VLAN
ip link add br-vlan100 type bridge
ip link add br-vlan200 type bridge
ip link set br-vlan100 up
ip link set br-vlan200 up
# Attach VLAN interfaces to their bridges
ip link set bond0.100 master br-vlan100
ip link set bond0.200 master br-vlan200
# Give bridges IPs (optional — if this host routes between VLANs)
ip addr add 10.100.0.1/24 dev br-vlan100
ip addr add 10.200.0.1/24 dev br-vlan200
Now you can connect tenant namespaces to these bridges exactly like before. Each tenant lands on a VLAN with full Layer 2 isolation, carried over a redundant bonded link.
Gotcha: When you switch from bare NICs to a bond, delete the old VLAN interfaces first. An
eth0.100and abond0.100can coexist — one will work, the other will silently drop traffic, and you'll spend hours confused about why half your connections fail.
Part 8: Other Virtual Interface Types¶
veth pairs and bridges aren't the only virtual interfaces. Here's the extended family:
tap and tun¶
Name Origin:
tun= tunnel (operates at Layer 3, IP packets).tap= network tap (operates at Layer 2, Ethernet frames). The names describe what level of the stack they expose to userspace.
# Create a tap device
ip tuntap add dev tap0 mode tap
ip link set tap0 up
# Create a tun device
ip tuntap add dev tun0 mode tun
ip link set tun0 up
tap/tun devices let userspace programs send and receive packets by reading/writing a file descriptor. This is how VPNs work — OpenVPN reads encrypted packets from the network, decrypts them, and writes cleartext packets into a tun device. The kernel routes them as if they arrived on a real interface.
| Device | Layer | Delivers to userspace | Used by |
|---|---|---|---|
tun |
L3 | Raw IP packets | OpenVPN, WireGuard (older mode) |
tap |
L2 | Ethernet frames | QEMU/KVM VMs, OpenVPN (bridge mode) |
macvlan and ipvlan¶
Both create virtual interfaces on a physical NIC. The key difference:
| Feature | macvlan | ipvlan |
|---|---|---|
| MAC address | Unique per child | Shared with parent |
| Switch sees | Multiple MACs per port | One MAC per port |
| Host-to-child L2 | Broken (kernel limitation) | Works |
| Use case | Containers as "real" hosts | Environments with MAC port-security limits |
# macvlan — each child gets its own MAC
ip link add macvlan0 link eth0 type macvlan mode bridge
# ipvlan — all children share parent's MAC
ip link add ipvlan0 link eth0 type ipvlan mode l2
Part 9: Traffic Control (tc) — One-Minute Overview¶
The tc command controls how the kernel queues outbound packets. Two things worth
knowing:
# Limit outbound bandwidth to 10 Mbit
tc qdisc add dev veth-br-t1 root tbf rate 10mbit burst 32kbit latency 400ms
# Simulate 100ms latency and 1% packet loss (chaos engineering)
tc qdisc add dev veth-br-t1 root netem delay 100ms loss 1%
# Remove
tc qdisc del dev veth-br-t1 root
Interview Bridge: "How would you test whether your application handles network latency gracefully?" Use
tc netem. This is what chaos engineering tools (Pumba, Chaos Mesh) use under the hood.
Part 10: Open vSwitch (OVS) — When Linux Bridges Aren't Enough¶
When you need thousands of virtual ports, OpenFlow programming, or VXLAN tunnel endpoints, you reach for Open vSwitch:
# Create a switch and add ports
ovs-vsctl add-br ovs-br0
ovs-vsctl add-port ovs-br0 eth0
ovs-vsctl add-port ovs-br0 veth-br-t1
# Add a VXLAN tunnel to another host
ovs-vsctl add-port ovs-br0 vxlan0 -- \
set Interface vxlan0 type=vxlan options:remote_ip=10.0.0.2
ovs-vsctl show
OVS is the networking backbone of OpenStack and several Kubernetes CNI plugins (Antrea, OVN-Kubernetes).
Trivia: OVS was developed at Nicira (founded by Martin Casado, who also invented OpenFlow as part of his PhD at Stanford). VMware acquired Nicira in 2012 for $1.26 billion. OVS remains open source.
Part 11: Kubernetes CNI — Where This All Comes Together¶
Everything we've built in this lesson — namespaces, veth pairs, bridges, VLANs, OVS — is exactly what Kubernetes CNI plugins do. CNI (Container Network Interface) is a specification: the kubelet calls a CNI binary, passes it a namespace path, and says "set up networking for this pod."
Different CNI plugins use different strategies:
| CNI Plugin | Strategy | What it creates |
|---|---|---|
| Flannel (VXLAN) | Overlay | Bridge + veth pair + VXLAN tunnel per node |
| Calico (no overlay) | Routing | veth pair + BGP routes (no bridge) |
| Cilium | eBPF | veth pair, bypasses iptables entirely |
| Weave | Overlay | Bridge + veth + encrypted tunnel |
| Multus | Meta-CNI | Delegates to multiple CNIs per pod |
But they all start with the same two steps:
- Create a veth pair
- Move one end into the pod's network namespace
The differences are in step 3: how traffic gets from the veth's host end to other pods and the outside world.
Mental Model: Every Kubernetes CNI plugin is answering the same question: "I have a veth pair. The pod end has an IP. How does a packet from this pod reach a pod on another node?" Flannel says "wrap it in VXLAN." Calico says "route it with BGP." Cilium says "program eBPF to forward it." The primitives are always the same.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What Linux bonding mode uses LACP for dynamic negotiation? | Mode 4 (802.3ad) |
Why should you set lacp_rate fast? |
Default slow rate takes 90 seconds to detect a dead link; fast rate detects in 3 seconds |
What's the relationship between bond0.100 and br-vlan100? |
bond0.100 is a VLAN sub-interface attached to bridge br-vlan100 as an uplink |
What does tc netem delay 100ms do? |
Adds 100ms of simulated latency to outbound packets |
| How does the tun device differ from tap? | tun passes L3 (IP) packets to userspace; tap passes L2 (Ethernet) frames |
| What two steps do ALL Kubernetes CNI plugins share? | Create a veth pair, move one end into the pod namespace |
| How does Docker implement port forwarding (-p)? | DNAT rule in iptables PREROUTING chain |
Exercises¶
Exercise 1: Build a two-namespace bridge (5 minutes)¶
Create two namespaces (ns1 and ns2), a bridge, and connect them. Verify they can
ping each other.
Hint
Follow the pattern from Part 3: create bridge, create veth pairs, move ends into namespaces, attach host ends to bridge, assign IPs, bring everything up.Solution
ip link add br0 type bridge
ip link set br0 up
for i in 1 2; do
ip netns add ns${i}
ip link add veth-br${i} type veth peer name veth${i}
ip link set veth${i} netns ns${i}
ip link set veth-br${i} master br0
ip link set veth-br${i} up
ip netns exec ns${i} ip addr add 10.0.0.${i}/24 dev veth${i}
ip netns exec ns${i} ip link set veth${i} up
ip netns exec ns${i} ip link set lo up
done
ip netns exec ns1 ping -c 2 10.0.0.2
Exercise 2: Isolate with VLANs (10 minutes)¶
Extend Exercise 1. Create two bridges, one per VLAN (100 and 200). Put ns1 on VLAN 100 and ns2 on VLAN 200. Verify they cannot ping each other.
Hint
You'll need VLAN sub-interfaces on a parent interface (or you can use separate bridges without VLAN uplinks for pure L2 isolation between namespaces).Exercise 3: Trace Docker's network setup (15 minutes)¶
Run docker run -d --name trace-me nginx. Then:
1. Find the container's PID
2. Find its veth pair on the host
3. Confirm the veth is attached to the docker0 bridge
4. List the iptables NAT rules Docker created
5. Enter the container's network namespace with nsenter and run ip route
Hint
Exercise 4: Bond + VLAN (20 minutes, requires two NICs or VMs)¶
Set up a mode 1 bond with two interfaces, create a VLAN 100 sub-interface on the bond,
and verify connectivity. Check /proc/net/bonding/bond0 and pull a cable (or bring
down an interface) to test failover.
Cheat Sheet¶
Namespace operations¶
| Task | Command |
|---|---|
| Create namespace | ip netns add NAME |
| List namespaces | ip netns list |
| Run command in namespace | ip netns exec NAME COMMAND |
| Delete namespace | ip netns del NAME |
| Enter container's netns | nsenter -t PID -n COMMAND |
veth and bridge operations¶
| Task | Command |
|---|---|
| Create veth pair | ip link add NAME type veth peer name PEER |
| Move interface to namespace | ip link set NAME netns NSNAME |
| Create bridge | ip link add NAME type bridge |
| Attach port to bridge | ip link set NAME master BRIDGE |
| Show bridge members | bridge link show |
| Show bridge MAC table | bridge fdb show br BRIDGE |
VLAN operations¶
| Task | Command |
|---|---|
| Load 802.1Q module | modprobe 8021q |
| Create VLAN interface | ip link add link PARENT name PARENT.VID type vlan id VID |
| Show VLAN details | ip -d link show PARENT.VID |
| Capture tagged frames | tcpdump -eni PARENT 'vlan VID' |
Bond operations¶
| Task | Command |
|---|---|
| Create bond | ip link add bond0 type bond mode 802.3ad |
| Set LACP rate | ip link set bond0 type bond lacp_rate fast |
| Set hash policy | ip link set bond0 type bond xmit_hash_policy layer3+4 |
| Add member | ip link set ethX master bond0 |
| Check status | cat /proc/net/bonding/bond0 |
| Set monitoring | ip link set bond0 type bond miimon 100 |
Traffic control (tc)¶
| Task | Command |
|---|---|
| Limit bandwidth | tc qdisc add dev DEV root tbf rate 10mbit burst 32kbit latency 400ms |
| Simulate latency | tc qdisc add dev DEV root netem delay 100ms |
| Remove qdisc | tc qdisc del dev DEV root |
Takeaways¶
-
Network namespaces are the foundation of container networking. Every container gets its own network stack through
CLONE_NEWNET. No namespace, no isolation. -
veth pairs are the only way to get packets across namespace boundaries. They're virtual cables. One end in the namespace, one end on the host (or bridge). Every container runtime uses them.
-
Docker's bridge network is just namespace + veth + bridge + iptables NAT. Once you understand the primitives, Docker networking stops being magic and starts being predictable.
-
LACP (mode 4) is the production standard for NIC bonding. Always set
lacp_rate fastandmiimon 100. Monitor the partner MAC address — if it's all zeros, your bond is not actually bonded. -
VLANs are Layer 2 isolation on a single wire. The 802.1Q tag is 4 bytes. Load the
8021qmodule. Make sure the switch port is a trunk. The 4,094 VLAN limit is why clouds use VXLAN. -
Every Kubernetes CNI plugin starts with the same two steps: create a veth pair, move one end into the pod's namespace. The difference is what happens after that.
Related Lessons¶
- What Happens When You Click a Link — follows a packet end-to-end through DNS, TCP, TLS
- iptables: Following a Packet Through the Chains — deep dive into the netfilter framework Docker uses for NAT
- The Hanging Deploy — processes, namespaces, and cgroups from the PID perspective
- Kubernetes Services: How Traffic Finds Your Pod — what happens after CNI sets up the namespace
- The Subnet Calculator in Your Head — IP addressing fundamentals for the VLAN subnets in this lesson