BGP: How the Internet Routes Your Packets
- lesson
- bgp
- routing
- autonomous-systems
- internet-architecture
- datacenter-networking
- kubernetes-networking
- security ---# BGP — How the Internet Routes Your Packets
Topics: BGP, routing, autonomous systems, internet architecture, datacenter networking, Kubernetes networking, security Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's Monday morning. Your monitoring lights up: customers in Europe can't reach your app. US traffic is fine. Nothing changed on your side — no deploys, no DNS updates, no infra changes. Your hosting provider says everything is green.
Thirty minutes later you find it: someone else's network is announcing your IP prefix to the internet. Traffic destined for your servers is being swallowed by a network in Eastern Europe. Your IPs have been hijacked.
This is not theoretical. It has happened to YouTube, Amazon, Google, and hundreds of smaller companies. It happens because BGP — the protocol that routes every packet on the internet — was built on trust, not verification.
By the end of this lesson you'll understand: - What BGP actually does and why it matters even if you never configure a router - How a single misconfiguration can redirect traffic for millions of users - The route selection algorithm that determines where your packets go - How modern datacenters use BGP inside the building (not just between ISPs) - How Kubernetes uses BGP to route traffic to your pods - What RPKI and ROA are, and why they're the future of internet routing security
Part 1: What Is BGP, Actually?¶
Every device on the internet needs to answer one question for every packet: where do I send this next?
Your laptop answers simply: "send everything to my default gateway." Your home router answers almost as simply: "send everything to my ISP." But your ISP? Your ISP has to know how to reach every publicly routable IP address on the internet. That's over a million distinct routes.
BGP — Border Gateway Protocol — is how those million routes get distributed. Every major network on the internet runs BGP to tell its neighbors: "I can reach these IP addresses. Send me traffic for them."
Name Origin: BGP = Border Gateway Protocol. The "border" refers to the boundary between networks. A "gateway" is old-school for "router." BGP runs at the borders between autonomous systems — the organizational boundaries of the internet. The current version, BGP-4, is defined in RFC 4271 (2006), but the protocol dates back to RFC 1105 (1989). Three engineers — Kirk Lougheed and Yakov Rekhter — sketched the first version on two napkins at an IETF meeting. Yes, the protocol that routes the entire internet was designed on napkins.
Trivia: Those "two napkins" are part of internet folklore. Lougheed and Rekhter needed a protocol to replace EGP (Exterior Gateway Protocol), which couldn't handle the growing internet's topology. Their napkin design became RFC 1105 within months. The napkin originals have not survived, but the story is referenced in multiple IETF histories and by Rekhter himself in conference talks.
The Autonomous System: The Internet's Organizational Unit¶
The internet isn't one network — it's roughly 75,000 independently operated networks, each identified by an Autonomous System Number (ASN).
Name Origin: AS = Autonomous System. "Autonomous" because each AS sets its own internal routing policy — nobody outside tells it how to route within its borders. ASNs are assigned by Regional Internet Registries (RIRs): ARIN (North America), RIPE NCC (Europe/Middle East), APNIC (Asia-Pacific), AFRINIC (Africa), and LACNIC (Latin America).
Examples you'll recognize:
| ASN | Operator | What they carry |
|---|---|---|
| AS15169 | Google services, YouTube, Cloud | |
| AS16509 | Amazon | AWS, amazon.com |
| AS13335 | Cloudflare | CDN, DNS (1.1.1.1) |
| AS32934 | Meta | Facebook, Instagram, WhatsApp |
| AS7018 | AT&T | Consumer and enterprise ISP |
Every public IP address on the internet belongs to an AS. When you browse a website, your packets traverse multiple autonomous systems to get there, and each AS uses BGP to decide the next hop.
Path Vector: BGP's Core Idea¶
BGP is a path vector protocol. When a network announces a route, it includes the full list of autonomous systems the announcement has traversed. This AS path serves two purposes:
- Loop prevention — if a router sees its own AS in the path, it rejects the route
- Path selection — shorter AS paths are generally preferred
# What a BGP route looks like (simplified)
Prefix: 203.0.113.0/24
AS Path: AS3356 → AS16509
Next Hop: 198.51.100.1
Origin: IGP
# Translation: "To reach 203.0.113.0/24, send traffic to 198.51.100.1.
# It will cross AS3356 (Lumen/Level3) then AS16509 (Amazon)."
Mental Model: Think of BGP like a postal system between countries. Each country (autonomous system) doesn't need to know the internal streets of every other country — it just needs to know which neighboring country to hand the mail to. BGP is the agreement between countries about which mail to accept and where to forward it.
Part 2: iBGP vs eBGP — Two Flavors, Same Protocol¶
BGP comes in two modes that look the same on the wire but behave differently:
eBGP (external BGP) — sessions between routers in different autonomous systems. This is the BGP of the internet. When AT&T peers with Google at an internet exchange point, that's eBGP.
iBGP (internal BGP) — sessions between routers in the same autonomous system. Used to distribute externally-learned routes to all routers within an organization.
| Property | eBGP | iBGP |
|---|---|---|
| Peers in | Different ASes | Same AS |
| TTL default | 1 (directly connected) | 255 (can cross internal hops) |
| AS path modified? | Prepends own ASN | Does not modify AS path |
| Next-hop behavior | Sets next-hop to self | Preserves original next-hop |
| Full mesh required? | No | Yes (or use route reflectors) |
The iBGP full-mesh requirement is the gotcha that catches everyone. If you have 5 iBGP routers, you need 10 sessions (n(n-1)/2). With 50 routers, that's 1,225 sessions. This is why route reflectors* exist — a designated router that receives routes from all iBGP peers and reflects them to everyone else, reducing the mesh to a hub-and-spoke.
Gotcha: iBGP does NOT modify the AS path when advertising routes to other iBGP peers. This means iBGP cannot use the AS path for loop prevention like eBGP does. Instead, iBGP uses two rules: (1) a route learned from an iBGP peer is never re-advertised to another iBGP peer (which is why you need full mesh or route reflectors), and (2) the ORIGINATOR_ID and CLUSTER_LIST attributes prevent loops in route-reflector topologies.
Flashcard Check #1¶
Test yourself before moving on. Cover the answers.
| Question | Answer |
|---|---|
| What does BGP stand for? | Border Gateway Protocol |
| What is an autonomous system? | An independently operated network with its own routing policy, identified by an ASN |
| What type of routing protocol is BGP? | Path vector |
| What's the difference between eBGP and iBGP? | eBGP runs between different ASes; iBGP runs within the same AS |
| Why can't 50 routers all run iBGP without route reflectors? | Full mesh requires 1,225 sessions — operationally impossible to maintain |
| How does BGP prevent routing loops? | Each route carries the full AS path; a router rejects routes containing its own ASN |
Part 3: The BGP Route Selection Algorithm¶
When a BGP router receives multiple routes to the same prefix, it doesn't just pick the shortest path. It runs a decision tree with 13+ steps. Most traffic engineering happens in the first few.
Here's the simplified decision tree that covers 95% of real-world behavior:
Multiple routes to same prefix
│
┌─────────▼──────────┐
│ 1. Highest LOCAL │
│ PREFERENCE? │ ← Set by your network. "I prefer this path."
└────────┬───────────┘
│ tie
┌────────▼───────────┐
│ 2. Shortest │
│ AS_PATH? │ ← Fewer hops through fewer networks
└────────┬───────────┘
│ tie
┌────────▼───────────┐
│ 3. Lowest ORIGIN │
│ type? │ ← IGP < EGP < Incomplete
└────────┬───────────┘
│ tie
┌────────▼───────────┐
│ 4. Lowest MED │
│ (Multi-Exit │ ← "If you're sending me traffic,
│ Discriminator)? │ use THIS entrance."
└────────┬───────────┘
│ tie
┌────────▼───────────┐
│ 5. eBGP over iBGP? │ ← Prefer externally learned routes
└────────┬───────────┘
│ tie
┌────────▼───────────┐
│ 6. Lowest IGP │
│ metric to │ ← "Which exit is closest to ME?"
│ next hop? │ (hot potato routing)
└────────┬───────────┘
│ tie
┌────────▼───────────┐
│ 7. Oldest route │ ← Stability wins
│ 8. Lowest router ID │ ← Tiebreaker
└────────────────────┘
Remember: The mnemonic for the first five steps: "Lovers Prefer Short Meetings Externally." Local preference → Path length → Shortest origin → MED → EBGP over iBGP. This covers the attributes you can actually use for traffic engineering.
The Two Most Important Knobs¶
LOCAL_PREF (step 1) — set within your AS to say "prefer this path." Higher wins. Default is 100. If you set LOCAL_PREF 200 on routes from Provider A and leave Provider B at 100, all traffic in your network prefers Provider A. This is how you do primary/backup with two ISPs.
AS_PATH prepending (step 2) — artificially lengthening the AS path to make a route look less attractive to your neighbors. If you announce your prefix through two ISPs but prepend your ASN three times on one, the internet sees:
Via ISP-A: AS64500 # path length 1 — preferred
Via ISP-B: AS64500 AS64500 AS64500 # path length 3 — backup only
Gotcha: AS path prepending is a blunt instrument. You're asking the entire internet to prefer one path over another, but you can only influence, not guarantee. A neighboring AS with a high LOCAL_PREF for the longer path will still use it — LOCAL_PREF beats AS path length. Prepending more than 3x rarely adds value and just wastes resources.
Part 4: Route Announcements, Withdrawals, and the Pakistan/YouTube Disaster¶
How Routes Propagate¶
When an AS wants the internet to reach its IP space, it announces the prefix to its BGP neighbors. Those neighbors evaluate the route, and if they accept it, they announce it to their neighbors (prepending their own ASN to the path). The announcement ripples outward across the internet.
When an AS wants to stop advertising a route, it sends a withdrawal. Withdrawals also propagate, but convergence (everyone agreeing the route is gone) can take minutes — far slower than announcements.
# Timeline of a route announcement
T=0s: AS64500 announces 203.0.113.0/24 to its neighbors
T=1s: Direct peers (AS3356, AS174) install the route
T=5s: Second-tier networks learn the route
T=30s: Most of the internet has converged
T=2min: Slow peers and route dampening catch up
The Longest Prefix Match: Why Hijacking Works¶
Here's the rule that makes BGP both powerful and dangerous: the most specific route always wins.
If AS64500 announces 203.0.113.0/24 and AS99999 announces 203.0.113.0/25 (a more
specific prefix covering half the range), every router on the internet will send traffic
for 203.0.113.0–203.0.113.127 to AS99999 — even if AS99999 has no legitimate claim to
those IPs.
This is exactly what happened to YouTube in 2008.
War Story: On February 24, 2008, the Pakistani government ordered ISPs to block YouTube. Pakistan Telecom (AS17557) created an internal route for YouTube's prefix
208.65.153.0/24pointing to a null route (blackhole). But an engineer made a mistake: instead of keeping the block internal, the more-specific route208.65.153.0/25leaked to PCCW (AS3491), a major Hong Kong-based transit provider. PCCW accepted the route and propagated it globally. Because /25 is more specific than YouTube's /24, the hijacked route won everywhere. YouTube was unreachable worldwide for approximately two hours. The fix required PCCW to manually filter the bogus route and YouTube (Google) to announce their own /25s to compete. Source: RIPE NCC analysis, Renesys (now Dyn/Oracle) blog post, February 2008.
The YouTube hijack was accidental. But the same technique is used intentionally — for surveillance, cryptocurrency theft, and traffic interception.
Trivia: In April 2018, a BGP hijack redirected traffic destined for Amazon's Route 53 DNS service. The attackers announced more-specific routes for Route 53's IP space through an ISP in Ohio, redirected DNS queries for MyEtherWallet.com to a fake site, and stole approximately $150,000 in cryptocurrency. The hijack lasted about two hours. Source: Ars Technica, Oracle/Dyn Internet Intelligence reporting, April 2018.
Part 5: Defending Against Hijacks — RPKI, ROA, and Prefix Filtering¶
BGP's trust problem is real: any AS can announce any prefix, and neighbors believe it by default. Three defenses exist, layered on top of each other.
1. Prefix Filtering (the manual seatbelt)¶
Every BGP session should have filters that define exactly which prefixes a neighbor is allowed to send. If your transit provider is AS3356, you build a filter allowing only the prefixes they legitimately carry.
# Conceptual prefix filter (router config pseudocode)
ip prefix-list FROM-CUSTOMER permit 203.0.113.0/24
ip prefix-list FROM-CUSTOMER deny 0.0.0.0/0 le 32
route-map CUSTOMER-IN permit 10
match ip address prefix-list FROM-CUSTOMER
neighbor 10.0.0.2 route-map CUSTOMER-IN in
The problem: maintaining manual prefix lists for thousands of peers is operationally brutal. This is where RPKI comes in.
2. RPKI and ROA (the cryptographic fix)¶
RPKI (Resource Public Key Infrastructure) lets IP address holders cryptographically sign which ASes are authorized to announce their prefixes. The signed object is called a ROA (Route Origin Authorization).
# What a ROA says (conceptual)
Prefix: 203.0.113.0/24
Max prefix len: /24
Authorized AS: AS64500
Signed by: ARIN (the RIR that allocated the IP space)
When a router receives a BGP route, it checks: "Is there an ROA for this prefix? Does the announcing AS match?" Three outcomes:
| ROA Status | Meaning | Action |
|---|---|---|
| Valid | ROA exists, AS matches | Accept |
| Invalid | ROA exists, AS does NOT match | Drop (if enforcing) |
| Not Found | No ROA exists for this prefix | Accept (for now) |
Under the Hood: RPKI validation doesn't happen on the router itself. Routers run a lightweight RPKI-to-Router (RTR) protocol client that talks to an RPKI validator (like Routinator, Fort, or rpki-client). The validator fetches signed ROA objects from the five RIR trust anchors, validates the cryptographic chain, and feeds the results to the router. The router marks each received route as Valid, Invalid, or Not Found.
As of 2025, roughly 50% of IPv4 prefixes and 60% of IPv6 prefixes have ROAs registered. Adoption is growing but not universal. The "Not Found" category remains large, which is why most operators accept Not Found routes — dropping them would break too much of the internet.
3. BGP Communities: Signaling Without New Protocols¶
BGP communities are tags attached to routes that carry policy information between ASes.
They're 32-bit values, typically written as two 16-bit numbers: ASN:value.
# Common well-known communities
65535:0 NO_EXPORT — don't advertise outside this AS
65535:1 NO_ADVERTISE — don't advertise to any peer
65535:2 NO_EXPORT_SUBCONFED — don't export outside confederation
# Operator-defined communities (examples)
AS3356:100 Lumen: learned from customer
AS3356:123 Lumen: learned at Ashburn IX
AS3356:666 Lumen: blackhole this prefix (DDoS mitigation)
The blackhole community is the one you'll use in an emergency. If your IP space is under DDoS attack, you announce the targeted prefix with the blackhole community to your upstream provider. They install a null route — traffic for that prefix is dropped at their edge, keeping the rest of your network alive.
Interview Bridge: "How would you mitigate a DDoS attack at the network level?" The answer involves RTBH (Remote Triggered Black Hole) using BGP communities. You sacrifice the attacked prefix to save everything else.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What is the first step in BGP route selection? | Highest LOCAL_PREFERENCE |
| What makes BGP hijacking possible? | Longest prefix match — a more-specific route always wins, regardless of who announces it |
| What is a ROA? | Route Origin Authorization — a cryptographic statement that an AS is authorized to announce a prefix |
| What are the three RPKI validation states? | Valid, Invalid, Not Found |
What does the NO_EXPORT community do? |
Tells neighbors not to advertise the route outside the AS |
| What happened in the Pakistan/YouTube incident? | Pakistan Telecom's /25 route for YouTube leaked globally via PCCW; the more-specific prefix won everywhere, blackholing YouTube for ~2 hours |
Part 6: BGP in the Datacenter — Not Just for ISPs Anymore¶
Here's where BGP gets interesting for DevOps engineers. Around 2012, hyperscalers like Facebook and Microsoft started doing something that would have seemed insane to 1990s network engineers: running BGP inside their datacenters, on every switch, replacing traditional protocols entirely.
The Clos Topology and eBGP Everywhere¶
Modern datacenters use a Clos topology (also called spine-leaf or fat-tree). Every leaf switch connects to every spine switch. No spanning tree. No VLANs stretching across racks. Pure Layer 3 routing.
Spine1 (AS65000) Spine2 (AS65000)
/ | \ / | \
Leaf1 Leaf2 Leaf3 Leaf1 Leaf2 Leaf3
AS65001 AS65002 AS65003
| | |
Rack1 Rack2 Rack3
Each leaf gets its own ASN. Each spine gets a shared ASN. Every link runs eBGP. The result:
- No spanning tree — every link is active, ECMP distributes traffic across all spines
- Sub-second failover — if a spine dies, BGP withdraws routes in milliseconds with BFD
- Uniform latency — any server is exactly 2 hops from any other server (leaf→spine→leaf)
- Simple troubleshooting —
show bgp summarytells you the state of the whole fabric
Trivia: Facebook published their datacenter fabric design around 2014, proving you could build a massive datacenter network with BGP as the sole routing protocol and zero Spanning Tree. This design — documented in RFC 7938 (August 2016, "Use of BGP for Routing in Large-Scale Data Centers") — became the template for modern leaf-spine architectures industry-wide. BGP was never designed for this, which makes its success inside the datacenter one of networking's great plot twists.
Why Not OSPF Inside the Datacenter?¶
OSPF works for small-to-medium datacenter fabrics, but BGP won the hyperscale race for specific reasons:
| Concern | OSPF | BGP |
|---|---|---|
| Policy control | Limited — metric-based | Rich — LOCAL_PREF, communities, AS path |
| Scaling | Area design gets complex at 500+ nodes | Flat eBGP scales linearly |
| Convergence | Fast, but flooding can be disruptive | Controlled with BFD + timers |
| Multi-tenancy | No native support | VRF + EVPN address families |
| Operational model | Requires area architecture planning | Simple: each device is its own AS |
Mental Model: Think of OSPF as a group chat — every router shares everything with everyone in the area. BGP is more like bilateral agreements — each pair of neighbors negotiates exactly what to share. In a datacenter with thousands of switches, the bilateral model gives you finer control.
Part 7: BGP and Kubernetes — Calico, MetalLB, and Pod Routing¶
If you run Kubernetes on bare metal or in a datacenter (not managed cloud), there's a good chance BGP is routing traffic to your pods right now.
Calico: BGP for Pod Networking¶
Calico is a Kubernetes CNI plugin that uses BGP to distribute pod routes between nodes. Instead of encapsulating pod traffic in VXLAN tunnels (like Flannel), Calico can advertise each node's pod CIDR via BGP, making pod IPs routable on the physical network.
# What Calico does on each node:
# 1. Assigns pod IPs from the node's CIDR (e.g., 10.244.1.0/24)
# 2. Creates routes for local pods
# 3. Runs BIRD (a BGP daemon) to advertise pod routes to other nodes
# On Node 1 (pods in 10.244.1.0/24):
$ ip route show | grep cali
10.244.1.5 dev cali1234abcd scope link # Pod A
10.244.1.9 dev cali5678efgh scope link # Pod B
10.244.2.0/24 via 192.168.1.102 dev eth0 # Node 2's pods — learned via BGP
# Check Calico's BGP peering status:
$ calicoctl node status
Calico process is running.
IPv4 BGP status
+----------------+-------------------+-------+----------+---+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+----------------+-------------------+-------+----------+---+
| 192.168.1.102 | node-to-node mesh | up | 09:15:32 | Established |
| 192.168.1.103 | node-to-node mesh | up | 09:15:35 | Established |
+----------------+-------------------+-------+----------+---+
Under the Hood: Calico runs BIRD (the BIRD Internet Routing Daemon — yes, "BIRD" is a recursive acronym: BIRD Internet Routing Daemon) on each node. BIRD maintains BGP sessions with other nodes, advertising per-pod /32 routes or per-node CIDR routes. When a pod is created, Calico programs a route on the node, and BIRD advertises it to all peers within seconds. When a pod is deleted, the route is withdrawn. It's the same announce/withdraw cycle that runs the internet — now running inside your cluster.
In larger clusters, Calico supports BGP route reflectors to avoid the full-mesh problem (same concept as ISP route reflectors). You designate 2-3 nodes as reflectors, and all other nodes peer only with them.
MetalLB: LoadBalancer Services via BGP¶
In cloud Kubernetes, type: LoadBalancer services get an external IP automatically from
the cloud provider. On bare metal, there's no cloud API to call. MetalLB fills this gap
using BGP.
MetalLB runs a speaker pod on each node. When a LoadBalancer service is created, MetalLB assigns an IP from a configured pool and announces it via BGP to the upstream network. The datacenter's ToR (top-of-rack) switches see the BGP route and start forwarding traffic to the node.
# MetalLB BGP configuration (simplified)
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: tor-switch
namespace: metallb-system
spec:
myASN: 64512
peerASN: 65000
peerAddress: 10.0.0.1 # ToR switch
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: production
namespace: metallb-system
spec:
addresses:
- 203.0.113.100-203.0.113.200 # IPs to assign to LoadBalancer services
When you create a service, MetalLB announces the assigned IP to the ToR switch via BGP. The ToR installs it in its routing table. Traffic from the internet arrives at your cluster without any cloud provider magic — just BGP.
Interview Bridge: "How do you expose Kubernetes services on bare metal without a cloud load balancer?" MetalLB in BGP mode is the standard answer. Know that it announces routes to upstream routers, and know the difference between its L2 mode (ARP-based, single-node bottleneck) and BGP mode (true ECMP across nodes).
Part 8: Route Leaks, Graceful Restart, and Other Things That Break¶
Route Leaks¶
A route leak is when an AS advertises routes it learned from one peer to another peer in violation of the expected routing policy. Unlike a hijack (announcing someone else's prefix), a leak announces legitimate routes through the wrong path.
The classic leak: a customer AS learns full internet routes from its transit provider, then accidentally re-advertises all of them to another provider. Suddenly, traffic for half the internet is trying to flow through a small customer network that can't handle it.
War Story: In June 2019, a small Pennsylvania ISP (AS396531) leaked over 20,000 routes it learned from one transit provider to another. The leaked routes propagated through Verizon's network, causing major routing disruptions for Cloudflare, Amazon, and others for several hours. The root cause was a BGP optimizer appliance that re-advertised routes without proper filtering. Source: Cloudflare blog, "How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today," June 24, 2019.
Graceful Restart¶
When a BGP speaker restarts (software upgrade, crash, daemon restart), its peers normally treat all routes from it as withdrawn — causing a brief traffic disruption. BGP Graceful Restart (RFC 4724) lets the peer keep forwarding traffic on the old routes while the restarting speaker re-establishes its sessions and re-advertises its routes.
# Timeline without graceful restart:
T=0s: Router restarts
T=0s: Peers withdraw all routes from this router
T=30s: Router finishes booting, re-establishes BGP
T=60s: Routes re-converge
→ 60 seconds of disruption
# Timeline with graceful restart:
T=0s: Router restarts
T=0s: Peers mark routes as "stale" but KEEP FORWARDING
T=30s: Router finishes booting, re-advertises routes
T=31s: Peers clear stale flag, continue forwarding
→ Near-zero disruption
Gotcha: Graceful restart has a timer (default varies by vendor, often 120 seconds). If the restarting router doesn't come back before the timer expires, all stale routes are withdrawn. If your router takes longer than the timer to boot, you get the worst of both worlds: stale routes forwarding into a black hole for the duration, followed by a mass withdrawal. Size the restart timer to your actual boot time + margin.
The "Two Napkins" Problem, 37 Years Later¶
BGP was designed when the internet was a few hundred networks run by people who knew each other. The trust model — "I'll believe what my neighbors tell me" — was adequate. Today, BGP connects 75,000+ autonomous systems operated by organizations with wildly different incentives and competence levels. The protocol has no built-in authentication of route announcements. RPKI is the bolt-on fix, but adoption is still partial.
This is the internet's dirty secret: the system that routes all of humanity's digital communication is still, at its core, a trust-based system running on a protocol sketched on napkins.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What CNI plugin uses BGP to distribute pod routes? | Calico (runs BIRD BGP daemon on each node) |
| What does MetalLB do in BGP mode? | Announces LoadBalancer service IPs to upstream routers via BGP, providing external IPs on bare metal |
| What is a route leak? | When an AS re-advertises routes to a peer in violation of expected policy (not a hijack — the routes are legitimate, just misadvertised) |
| What does BGP Graceful Restart do? | Lets peers keep forwarding on stale routes while the restarting router re-establishes BGP sessions |
| What topology do modern datacenters use with BGP? | Clos / spine-leaf — each leaf has its own ASN, eBGP on every link, no spanning tree |
| Why did hyperscalers choose BGP over OSPF for datacenter routing? | Richer policy control, linear scaling, no area planning, native multi-tenancy support |
Exercises¶
Exercise 1: Read a BGP Looking Glass (5 minutes)¶
BGP looking glasses are public tools that let you see how the internet routes traffic. No router access needed.
- Go to https://lg.he.net/ (Hurricane Electric's looking glass)
- Enter your company's public IP (or any IP you know, like
8.8.8.8) - Look at the BGP route:
- What is the origin AS?
- How long is the AS path?
- Are there multiple paths? (indicates the destination is multihomed)
What to look for
For `8.8.8.8`, you should see AS15169 (Google) as the origin. The AS path length depends on which looking glass location you query from — shorter paths from well-connected IXPs, longer paths from smaller networks. If you see multiple paths, Google is reachable through several transit providers from the looking glass's perspective.Exercise 2: Check RPKI Validation Status (5 minutes)¶
- Go to https://rpki-validator.ripe.net/
- Enter any prefix (e.g.,
1.1.1.0/24— Cloudflare) - Check: Is it Valid, Invalid, or Not Found?
- Try a few more prefixes — your employer's, your ISP's, a random university
What you'll find
Major providers (Cloudflare, Google, Amazon, Microsoft) will show Valid — they've published ROAs. Smaller organizations often show Not Found — they haven't set up RPKI yet. If you find an Invalid result, that's interesting — it means someone is announcing a prefix that contradicts the ROA, which could be a hijack or a misconfiguration.Exercise 3: Trace a BGP Hijack Scenario (15 minutes)¶
Walk through this scenario on paper:
- AS64500 legitimately owns
198.51.100.0/23and announces it to the internet - AS99999 (malicious) announces
198.51.100.0/24— a more-specific prefix
Questions:
1. Which announcement wins for traffic to 198.51.100.1? Why?
2. What about traffic to 198.51.101.1?
3. How could AS64500 defend against this before it happens? (two methods)
4. How could AS64500 respond after the hijack is detected? (two methods)
Answers
1. AS99999's /24 wins for `198.51.100.1` — longest prefix match. The /24 is more specific than the /23. 2. AS64500's /23 still wins for `198.51.101.1` — the /24 only covers .100.x, not .101.x. 3. Prevention: (a) Register ROAs via RPKI, so validators can flag AS99999's announcement as Invalid. (b) Ask transit providers to filter — accept only AS64500 for this prefix. 4. Response: (a) Announce their own /24s (`198.51.100.0/24` + `198.51.101.0/24`) to compete at the same specificity. (b) Contact upstream providers to filter the bogus route. (c) Alert via NANOG/mailing lists to get community pressure on AS99999's upstream.Exercise 4: Inspect Calico BGP on a Kubernetes Cluster (15 minutes)¶
If you have a Kubernetes cluster running Calico:
# 1. Check BGP peering status
calicoctl node status
# 2. Look at the routes Calico has programmed
ip route show | grep -E 'cali|bird|bgp'
# 3. Find a pod IP and trace its route
kubectl get pod -o wide # pick a pod on another node
ip route get <pod-ip> # how does the kernel route to it?
# 4. Watch what happens when you delete a pod
kubectl delete pod <name>
# In another terminal, watch routes change:
ip monitor route
What you'll see
`calicoctl node status` shows BGP peers (other nodes) and their state. `ip route show` will have per-pod /32 routes for local pods (via `cali*` interfaces) and per-node aggregate routes for remote pods (via the node's physical interface). When you delete a pod, the /32 route disappears within seconds — BIRD sends a BGP withdrawal to all peers.Cheat Sheet¶
| Concept | Key Detail |
|---|---|
| BGP type | Path vector protocol (RFC 4271) |
| AS number range | 1–64495 (public), 64512–65534 (private 2-byte), 4200000000–4294967294 (private 4-byte) |
| eBGP vs iBGP | eBGP = between ASes, iBGP = within an AS |
| Route selection order | LOCAL_PREF → AS path length → Origin → MED → eBGP > iBGP → IGP metric → age → router ID |
| Longest prefix match | /25 beats /24 beats /23 — always. This enables hijacking. |
| RPKI/ROA | Cryptographic proof of which AS can announce a prefix. States: Valid, Invalid, Not Found |
| BGP communities | 32-bit tags for signaling policy: NO_EXPORT, blackhole, location tagging |
| Clos/spine-leaf | Modern DC topology — eBGP on every link, each leaf its own ASN, no STP |
| Calico | K8s CNI using BGP (BIRD daemon) to distribute pod routes between nodes |
| MetalLB | Bare-metal LoadBalancer — announces service IPs via BGP to ToR switches |
| Graceful restart | Peers keep stale routes during speaker restart (RFC 4724) |
| BGP port | TCP 179 |
| Convergence time | Announcements: seconds to minutes. Withdrawals: slower. Graceful restart: near-zero |
| AS path prepending | Artificially lengthen path to steer traffic away from a link |
Key commands (when you have access to a BGP router):
show bgp summary # Peer status and route counts
show bgp ipv4 unicast # Full BGP table
show bgp <prefix> # Specific route with all attributes
show bgp neighbors <ip> advertised-routes # What you're sending
show bgp neighbors <ip> received-routes # What they're sending you
show bgp community <community> # Routes with a specific tag
Key commands (Linux host with Calico):
calicoctl node status # BGP peer state
ip route show | grep cali # Pod routes
ip route get <pod-ip> # Trace kernel routing decision
calicoctl get bgpPeer -o yaml # Calico BGP peering config
calicoctl get ipPool -o yaml # IP pools allocated to pods
Takeaways¶
-
BGP is the internet's routing protocol, and it works on trust — any AS can announce any prefix. RPKI is the fix, but adoption is still partial.
-
Longest prefix match is the most important routing concept. It's why routing works, why CIDR works, and why BGP hijacks work. A /25 always beats a /24.
-
The route selection algorithm is a decision tree, not a single metric. LOCAL_PREF and AS path length are the two knobs you actually use for traffic engineering.
-
BGP is not just for ISPs anymore. Modern datacenters run eBGP on every switch (Clos/spine-leaf). Your Kubernetes cluster might be running BGP right now (Calico, MetalLB).
-
The Pakistan/YouTube 2008 hijack is the canonical example of what goes wrong when route announcements aren't verified — a single misconfiguration took YouTube offline worldwide for two hours.
-
Know your defenses: prefix filtering (manual), RPKI/ROA (cryptographic), and BGP communities (signaling). Layers of protection because no single layer is complete.
Related Lessons¶
- What Happens When You Click a Link — follows a request through DNS, TCP, TLS, and routing
- Kubernetes Services: How Traffic Finds Your Pod — the Kubernetes networking model, kube-proxy, and service routing
- The Subnet Calculator in Your Head — CIDR, prefix lengths, and the math behind longest prefix match
- Why DNS Is Always the Problem — DNS resolution, TTLs, and the failures that look like routing issues
- IPTables: Following a Packet Through the Chains — packet filtering and NAT at the Linux host level