Kubernetes: From Scratch to Production Upgrade

lesson
kubeadm
eks
etcd
certificates
control-plane-architecture
cni
rbac
cluster-upgrades
disaster-recovery ---# Kubernetes — From Scratch to Production Upgrade

Topics: kubeadm, EKS, etcd, certificates, control plane architecture, CNI, RBAC, cluster upgrades, disaster recovery Level: L1–L3 (Foundations through Advanced Ops) Time: 90–120 minutes Strategy: Build-up + incident-driven

The Mission¶

You inherited a 12-node bare-metal Kubernetes cluster running 1.29.4. It powers an e-commerce platform handling 8,000 requests per second. The cluster was built with kubeadm 18 months ago, and nobody has touched the control plane since. Your job: upgrade it to 1.30 with zero downtime, fix the certificate time bomb that is 6 months from detonating, and document the process so the next person doesn't start from scratch.

Along the way, you'll also set up a parallel EKS cluster for the company's new microservices, because your team is going multi-environment. By the end of this lesson, you'll understand how Kubernetes clusters are born, how they grow, and how they die when nobody maintains them.

Let's build a cluster from nothing, then upgrade one under pressure.

Part 1: What kubeadm Actually Does¶

Before you upgrade, you need to understand what kubeadm built. Most people run kubeadm init once and never think about it again. That's how clusters die silently.

The init sequence¶

sudo kubeadm init \
  --control-plane-endpoint "k8s-api.internal.example.com:6443" \
  --pod-network-cidr "10.244.0.0/16" \
  --service-cidr "10.96.0.0/12" \
  --upload-certs

That single command does an enormous amount of work:

Step	What happens	Why it matters
1	Generates a Certificate Authority (CA)	Every component authenticates via TLS signed by this CA
2	Creates certificates for API server, etcd, kubelet, front-proxy	These expire in 1 year by default
3	Generates kubeconfig files	Admin, controller-manager, scheduler each get their own
4	Writes static pod manifests to `/etc/kubernetes/manifests/`	kubelet watches this directory and runs them directly
5	Starts etcd, API server, controller-manager, scheduler as static pods	The control plane is running
6	Applies RBAC rules and the bootstrap token	Workers can join using the token
7	Marks the node as a control plane node (taints it)	No workloads land here by default

Under the Hood: kubeadm doesn't use Deployments or DaemonSets for the control plane. It writes YAML manifests directly to /etc/kubernetes/manifests/, and the kubelet's static pod watcher picks them up. This means kubectl get pods -n kube-system shows pods like kube-apiserver-cp-1 and etcd-cp-1, but they aren't managed by any controller. If you delete the manifest file, the pod disappears. If you edit the file, the kubelet restarts the pod with the new config. This is how kubeadm upgrades work — it rewrites the manifest files.

The join sequence¶

After init, you get a kubeadm join command with a token and a CA cert hash:

# Worker node join
sudo kubeadm join k8s-api.internal.example.com:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:e3b0c44298fc1c149afbf4c8996fb924...

# Additional control plane node join
sudo kubeadm join k8s-api.internal.example.com:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:e3b0c44298fc1c149afbf4c8996fb924... \
  --control-plane --certificate-key <key-from-upload-certs>

The --control-plane flag is the difference between adding a worker and adding a control plane node. With it, kubeadm copies the CA certs, generates node-specific certs, and writes control plane static pod manifests on the new node.

Gotcha: The bootstrap token expires after 24 hours by default. If you need to add nodes later, generate a new one: kubeadm token create --print-join-command. This is the most common "why can't my new node join" question.

Flashcard Check #1¶

Question	Answer (cover this column)
What directory does kubeadm write control plane manifests to?	`/etc/kubernetes/manifests/` — kubelet watches it as static pods
How does a worker join the cluster?	`kubeadm join` with a bootstrap token and CA cert hash
What's the default bootstrap token lifetime?	24 hours
Why does `kubeadm init` need `--control-plane-endpoint`?	For HA — all nodes must reach the API through a single DNS name or load balancer

Part 2: The Four Control Plane Components¶

Your cluster has a brain, and it has four parts. When something goes wrong, you need to know which part is misfiring.

kube-apiserver — The Front Door¶

Every interaction with the cluster goes through the API server. kubectl, the scheduler, the controller-manager, kubelets, CI/CD pipelines — all of them talk to the API server and nothing else. The API server is the only component that talks to etcd.

# Check if the API server is responding
kubectl cluster-info
# Kubernetes control plane is running at https://k8s-api.internal.example.com:6443

# Check API server pod health
kubectl get pods -n kube-system -l component=kube-apiserver

# View API server logs (useful when kubectl itself is flaky)
sudo crictl logs $(sudo crictl ps --name kube-apiserver -q)

Trivia: The Kubernetes API server is completely stateless. It stores nothing locally — every read comes from etcd, every write goes to etcd. You can run 1, 3, or 10 API server instances behind a load balancer, and they're all identical. This statelessness is what makes the API server horizontally scalable, and why etcd performance directly determines API server responsiveness.

kube-controller-manager — The Reconciliation Engine¶

The controller-manager runs dozens of control loops that watch the cluster's actual state and push it toward the desired state. The ReplicaSet controller creates pods. The Node controller marks nodes as NotReady. The Endpoints controller populates service endpoints.

# Check controller-manager health
kubectl get pods -n kube-system -l component=kube-controller-manager

# Check leader election (only one controller-manager is active in HA setups)
kubectl get endpoints kube-controller-manager -n kube-system -o yaml

Mental Model: Think of the controller-manager as the "desired state enforcement department." You say "I want 3 replicas." The ReplicaSet controller sees 2 running, creates 1. You drain a node. The node controller marks it NotReady. The endpoint controller removes its pods from Service endpoints. Every controller runs a simple loop: observe → compare → act. That's reconciliation. That's all of Kubernetes.

kube-scheduler — The Matchmaker¶

The scheduler watches for pods with no nodeName (unscheduled) and assigns them to nodes. It runs a two-phase algorithm: filter (which nodes CAN run this pod?) then score (which of those nodes is BEST?).

# Check scheduler health
kubectl get pods -n kube-system -l component=kube-scheduler

# See why a pod isn't scheduled
kubectl describe pod <pending-pod> -n <namespace> | grep -A 10 Events

etcd — The Brain¶

etcd stores every Kubernetes object. We covered etcd in depth in the etcd lesson — here we focus on what you need to know for cluster operations.

# Quick health check (run on a control plane node)
export ETCDCTL_API=3
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key"

etcdctl endpoint health --cluster $ETCD_CERTS
etcdctl endpoint status --write-out=table --cluster $ETCD_CERTS
etcdctl member list --write-out=table $ETCD_CERTS

Name Origin: etcd = /etc (Unix configuration directory) + "d" (distributed). Created by CoreOS in 2013, it uses the Raft consensus algorithm — chosen specifically because it's understandable at 3 AM. The original Raft paper's title was literally "In Search of an Understandable Consensus Algorithm."

Stacked vs External etcd¶

kubeadm supports two topologies:

STACKED (default):                    EXTERNAL:
┌──────────────────┐                  ┌──────────────────┐
│  Control Plane 1 │                  │  Control Plane 1 │
│  ┌─────────────┐ │                  │  (no etcd)       │
│  │ API Server  │ │                  └──────────────────┘
│  │ etcd        │ │                        │
│  │ scheduler   │ │                  ┌─────┴─────────────┐
│  │ ctrl-mgr    │ │                  │  etcd cluster     │
│  └─────────────┘ │                  │  (3 dedicated     │
└──────────────────┘                  │   nodes)          │
                                      └───────────────────┘

Stacked: etcd runs on the same nodes as the control plane. Simpler to set up, fewer machines. But losing a control plane node also loses an etcd member.

External: etcd runs on its own dedicated nodes. Harder to set up, more machines, but etcd failures and control plane failures are decoupled. For enterprise bare-metal clusters, external etcd is the production-grade choice.

Remember: Quorum rule for etcd: you need (N/2) + 1 members alive. 3-member cluster tolerates 1 failure. 5-member cluster tolerates 2. Always odd numbers — a 4-member cluster has the same quorum as 5 (needs 3) but tolerates fewer failures.

Part 3: The Upgrade — 1.29 to 1.30¶

This is the mission. Your cluster is on 1.29.4 and needs to reach 1.30.x. Kubernetes only supports upgrading one minor version at a time — no skipping.

Pre-flight: The Checklist That Saves You¶

Before touching anything:

# 1. Read the release notes (non-negotiable)
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.30.md

# 2. Check current versions
kubectl get nodes -o wide
kubectl version --short

# 3. Back up etcd (THE MOST IMPORTANT STEP)
etcdctl snapshot save /backup/etcd-pre-upgrade-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

# Verify the backup
etcdctl snapshot status /backup/etcd-pre-upgrade-*.db --write-out=table

# 4. Back up PKI certificates
sudo cp -r /etc/kubernetes/pki /backup/pki-pre-upgrade-$(date +%Y%m%d)

# 5. Check PDBs — any that will block drains?
kubectl get pdb -A

# 6. Check version skew rules
# kubelet can be 1 minor version behind the API server
# kubectl must be within 1 minor version of the API server
# Upgrade order: control plane first, workers second

War Story: A team at a fintech company started a 1.28-to-1.29 upgrade on a Friday afternoon. They upgraded the control plane, then started draining workers. Mid-drain, the third worker's kubelet crashed during the upgrade — a known bug in their specific kernel version. They needed to roll back the control plane, but they had no etcd backup. The "rollback" became a 14-hour rebuild-from-scratch, restoring workloads from Helm releases and GitOps state. They lost all Secrets that weren't in their GitOps repo, including database credentials for three services. Monday morning was not fun. The root cause wasn't the kubelet bug — it was the missing etcd snapshot.

Step 1: Upgrade the first control plane node¶

# On the first control plane node (cp-1):

# Update package repos
sudo apt-get update

# Check available kubeadm versions
apt-cache madison kubeadm | head -5

# Install the new kubeadm
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1

# Verify kubeadm version
kubeadm version

# See what the upgrade will do (dry run)
sudo kubeadm upgrade plan

# Apply the upgrade to the control plane
sudo kubeadm upgrade apply v1.30.4

What kubeadm upgrade apply does:

Validates the cluster is healthy
Downloads new component images
Upgrades the static pod manifests in /etc/kubernetes/manifests/
Applies new RBAC rules and API migrations
Upgrades the kube-proxy and CoreDNS addons

The kubelet restarts each control plane component as its manifest is updated.

# Now upgrade kubelet and kubectl on cp-1
sudo apt-get install -y --allow-change-held-packages \
  kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Verify
kubectl get nodes
# cp-1 should show v1.30.4

Step 2: Upgrade additional control plane nodes¶

# On cp-2 and cp-3 (note: upgrade node, NOT apply)
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
sudo kubeadm upgrade node

# Then upgrade kubelet and kubectl
sudo apt-get install -y --allow-change-held-packages \
  kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

Gotcha: On the first control plane node, you run kubeadm upgrade apply. On subsequent control plane nodes, you run kubeadm upgrade node. Running apply on the second node won't break anything, but it'll redo work that was already done. node is faster and correct.

Step 3: Upgrade workers — one at a time¶

This is where zero downtime lives or dies.

# From a machine with kubectl access:

# Cordon the worker (no new pods)
kubectl cordon worker-1

# Drain the worker (evict existing pods)
kubectl drain worker-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

# SSH to the worker:
sudo apt-get update
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
sudo kubeadm upgrade node

sudo apt-get install -y --allow-change-held-packages \
  kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Back on the kubectl machine:
kubectl uncordon worker-1

# Verify before moving to the next worker
kubectl get nodes -o wide
kubectl get pods -A | grep -v Running | grep -v Completed

Repeat for each worker. For 12 workers, this takes 2-4 hours depending on drain time and how fast your pods reschedule.

Remember: Version skew mnemonic: "Control plane leads, workers follow, one version gap max." Upgrade order: etcd, API server, controller-manager/scheduler, then workers. Never skip minor versions — 1.28 to 1.30 is not supported.

Flashcard Check #2¶

Question	Answer (cover this column)
What must you back up before a cluster upgrade?	etcd snapshot AND `/etc/kubernetes/pki/` certificates
What command upgrades the first control plane node?	`kubeadm upgrade apply v1.30.4`
What command upgrades additional control plane nodes?	`kubeadm upgrade node`
What's the worker upgrade sequence?	Cordon, drain, upgrade kubeadm/kubelet/kubectl, uncordon
Can kubelet 1.29 work with API server 1.30?	Yes — kubelet can be 1 minor version behind

Part 4: The Certificate Time Bomb¶

Your cluster is 18 months old. The certificates kubeadm generated at init expired 6 months ago. Wait — the cluster is still running. How?

kubeadm auto-rotates certificates during kubeadm upgrade. Since you just upgraded to 1.30, the certs were renewed. But if you hadn't upgraded for 2+ years on the same minor version, the cluster would have died silently when certs expired.

Checking certificate expiration¶

sudo kubeadm certs check-expiration

CERTIFICATE                EXPIRES                  RESIDUAL TIME
admin.conf                 Mar 22, 2027 00:00 UTC   364d
apiserver                  Mar 22, 2027 00:00 UTC   364d
apiserver-etcd-client      Mar 22, 2027 00:00 UTC   364d
apiserver-kubelet-client   Mar 22, 2027 00:00 UTC   364d
controller-manager.conf    Mar 22, 2027 00:00 UTC   364d
etcd-healthcheck-client    Mar 22, 2027 00:00 UTC   364d
etcd-peer                  Mar 22, 2027 00:00 UTC   364d
etcd-server                Mar 22, 2027 00:00 UTC   364d
front-proxy-client         Mar 22, 2027 00:00 UTC   364d
scheduler.conf             Mar 22, 2027 00:00 UTC   364d

CERTIFICATE AUTHORITY      EXPIRES                  RESIDUAL TIME
ca                         Mar 19, 2036 00:00 UTC   9y
etcd-ca                    Mar 19, 2036 00:00 UTC   9y
front-proxy-ca             Mar 19, 2036 00:00 UTC   9y

Two critical things to notice:

Component certs expire in 1 year. This is the trap. If you don't upgrade or manually renew before then, the API server can't talk to etcd, kubelets can't talk to the API server, and your cluster goes dark.
CA certs expire in 10 years. These are the root certificates. When THEY expire, the only fix is a full cluster rebuild. Mark this date in a calendar.

Manual certificate renewal¶

If you can't upgrade (maybe you're on the latest version), renew manually:

# Renew all certificates
sudo kubeadm certs renew all

# Restart the control plane components (they need to pick up new certs)
# On kubeadm clusters, restart kubelet which restarts static pods:
sudo systemctl restart kubelet

# Verify the new admin kubeconfig works
kubectl get nodes

Gotcha: After renewing certs, you must also update the kubeconfig files (/etc/kubernetes/admin.conf, etc.). kubeadm certs renew all does this automatically, but if you're renewing individual certs, you need to regenerate the corresponding kubeconfigs manually. If your ~/.kube/config is a copy of admin.conf, recopy it.

Automating cert monitoring¶

Set up a cron job or monitoring alert:

# Check cert expiry in days (add to your monitoring)
CERT_EXPIRY=$(openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$CERT_EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "API server cert expires in $DAYS_LEFT days"
# Alert if < 30 days

Part 5: etcd Operations for Cluster Lifecycle¶

etcd is covered in depth in the etcd lesson. Here's what you need specifically for cluster lifecycle operations.

Backup before everything¶

This is non-negotiable. Before upgrades, before cert rotation, before adding or removing control plane nodes — back up etcd.

# The backup command you should have memorized
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify it's valid
etcdctl snapshot status /backup/etcd-*.db --write-out=table

Restore — the nuclear option¶

If an upgrade goes sideways and you need to roll back to a known-good state:

# 1. Stop the control plane on ALL nodes
# Move manifests out of the static pod directory
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for pods to stop
sudo crictl ps | grep -E "etcd|apiserver"

# 2. Restore the snapshot on each etcd member
# On cp-1:
etcdctl snapshot restore /backup/etcd-20260323-090000.db \
  --data-dir=/var/lib/etcd-restored \
  --name=cp-1 \
  --initial-cluster="cp-1=https://10.0.1.10:2380,cp-2=https://10.0.1.11:2380,cp-3=https://10.0.1.12:2380" \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# 3. Update etcd manifest to use new data directory
# Edit /tmp/etcd.yaml: change --data-dir to /var/lib/etcd-restored

# 4. Restore manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 5. Verify
kubectl get nodes
etcdctl endpoint health --cluster $ETCD_CERTS

Gotcha: The restore must be done on EVERY etcd member, each with its own --name and --initial-advertise-peer-urls. Restoring on only one member creates a split-state cluster where members disagree on data. This is a common and catastrophic mistake.

Member management¶

Adding or replacing etcd members (relevant when scaling control plane nodes):

# List current members
etcdctl member list --write-out=table $ETCD_CERTS

# Remove a failed member
etcdctl member remove <MEMBER_ID> $ETCD_CERTS

# Add a replacement (always add before removing the next one)
etcdctl member add cp-4 --peer-urls=https://10.0.1.13:2380 $ETCD_CERTS

Gotcha: Never remove two etcd members from a 3-member cluster simultaneously. Quorum requires 2 of 3. Remove one, add its replacement, verify health, then proceed to the next. Violating this kills the cluster.

Part 6: EKS — The Cloud Side¶

Your company also runs EKS for newer microservices. EKS abstracts away the control plane entirely — AWS manages the API server, etcd, controller-manager, and scheduler. You manage the worker nodes (or let AWS manage those too with managed node groups or Fargate).

Creating an EKS cluster¶

# Install eksctl
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_Linux_amd64.tar.gz"
tar xzf eksctl_Linux_amd64.tar.gz -C /usr/local/bin

# Create a cluster with managed node groups
eksctl create cluster \
  --name prod-eks \
  --region us-east-1 \
  --version 1.30 \
  --nodegroup-name workers \
  --node-type m6i.xlarge \
  --nodes 6 \
  --nodes-min 3 \
  --nodes-max 12 \
  --managed \
  --with-oidc \
  --ssh-access \
  --ssh-public-key my-key

This creates: VPC, subnets, security groups, IAM roles, the EKS control plane, a managed node group, and the aws-auth ConfigMap.

Node group types¶

Type	Who manages nodes	Scaling	Use case
Managed node groups	AWS manages AMI updates, draining	ASG + node group update	Default for most workloads
Self-managed	You manage everything	ASG only	Custom AMIs, GPU, specialized hardware
Fargate profiles	AWS manages everything (serverless)	Per-pod, no nodes visible	Batch jobs, burstable workloads

Mental Model: EKS managed node groups are like a "kubeadm with a concierge." AWS handles the OS patching, AMI rotation, and node draining during upgrades. Self-managed node groups are like running your own kubeadm workers — full control, full responsibility. Fargate is a different paradigm entirely: no nodes, just pods. You pay per pod, and AWS handles everything underneath.

EKS upgrades¶

EKS control plane upgrades are a single API call:

# Upgrade the control plane (AWS handles this — takes 20-40 minutes)
eksctl upgrade cluster --name prod-eks --version 1.30 --approve

# Check add-on compatibility BEFORE upgrading node groups
eksctl get addons --cluster prod-eks
# Common add-ons: vpc-cni, kube-proxy, coredns, ebs-csi-driver

# Update add-ons to compatible versions
eksctl update addon --name vpc-cni --cluster prod-eks --version latest
eksctl update addon --name coredns --cluster prod-eks --version latest
eksctl update addon --name kube-proxy --cluster prod-eks --version latest

# Upgrade node groups (rolling update — replaces nodes one by one)
eksctl upgrade nodegroup \
  --name workers \
  --cluster prod-eks \
  --kubernetes-version 1.30

Gotcha: EKS add-on versions must be compatible with the cluster version. Upgrading the control plane without updating add-ons can leave you with a kube-proxy that doesn't match the API server, causing subtle networking issues. Always check add-on versions after a control plane upgrade.

Flashcard Check #3¶

Question	Answer (cover this column)
Who manages the EKS control plane?	AWS — you never SSH into it, never back up etcd
What's the difference between managed and self-managed node groups?	Managed: AWS handles AMI updates, draining during upgrades. Self-managed: you do everything
What must you update after an EKS control plane upgrade?	Add-ons (vpc-cni, coredns, kube-proxy) and node groups
Can you run `etcdctl` on EKS?	No — etcd is fully managed and inaccessible

Part 7: kubeadm vs EKS vs k3s vs RKE2¶

You'll encounter all of these. Here's when to use which.

	kubeadm	EKS	k3s	RKE2
Control plane	You manage everything	AWS manages	Built-in, single binary	Built-in, RKE-managed
etcd	You manage (backup, certs, monitor)	AWS manages (invisible)	SQLite default, etcd optional	etcd built-in
Networking	BYO CNI	vpc-cni default	Flannel default, swap to Cilium	Canal (Calico+Flannel) default
Certs	kubeadm manages, 1-year default	AWS manages	k3s auto-rotates	RKE2 auto-rotates
Upgrades	Manual, node-by-node	API call + rolling node update	Binary replacement	RKE2 CLI upgrade
Best for	Bare metal, full control, learning	AWS production, managed experience	Edge, IoT, homelab, CI	Air-gapped, FIPS, government
Operational burden	High	Low	Very low	Medium
Cost	Hardware only	$0.10/hr control plane + nodes	Hardware only	Hardware only

Trivia: k3s gets its name from being "half of k8s" — the number of letters: Kubernetes (10 letters) = K8s, so half is K3s. Rancher Labs (now SUSE) designed it to run on hardware as small as a Raspberry Pi. Despite the lightweight reputation, k3s is a CNCF-certified conformant Kubernetes distribution.

Part 8: CNI Selection for Bare Metal¶

On EKS, you get vpc-cni and don't think about it. On bare metal, CNI choice is one of the first and most consequential decisions you make.

	Calico	Cilium	Flannel
Dataplane	iptables or eBPF	eBPF	VXLAN overlay
NetworkPolicy	Full support	Full + extended (L7, DNS)	None (pair with Calico)
Performance	Good	Best (bypasses iptables)	Adequate
BGP support	Native	Via MetalLB integration	No
kube-proxy replacement	No	Yes	No
Observability	Basic	Hubble (flow logs, service map)	None
Encryption	WireGuard	WireGuard	No
Complexity	Medium	Medium-high	Low
Best for bare metal	Enterprise, needs BGP peering	Modern clusters, needs L7 policy	Dev/test clusters, simplicity

Recommendation for enterprise bare metal¶

Calico if you need BGP peering with your physical network fabric (common in datacenter environments with ToR switches). Calico routes pod traffic via BGP, which means your physical network sees pod IPs directly — no overlay, no encapsulation overhead.

Cilium if you want the fastest dataplane, deep observability via Hubble, and L7 network policies (e.g., "only allow GET /api/v1/public from frontend pods"). Cilium can also replace kube-proxy entirely, eliminating iptables for service routing.

Flannel if you want the simplest possible setup for a non-production cluster. It works, it's stable, but it has no NetworkPolicy support and limited observability.

# Install Calico (BGP mode for bare metal)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

# Install Cilium (with Hubble and kube-proxy replacement)
cilium install --version 1.15.0 \
  --set kubeProxyReplacement=true \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true

# Verify CNI is healthy
kubectl get pods -n kube-system -l k8s-app=calico-node       # Calico
kubectl -n kube-system exec ds/cilium -- cilium status        # Cilium

Under the Hood: Cilium assigns a numeric identity to each unique set of pod labels. Network policy enforcement happens against these identities, not IP addresses. When a pod restarts with a new IP, Cilium's identity-based rules remain valid. This is why Cilium handles large-scale pod churn better than iptables-based CNIs, where every IP change triggers a rule update.

Part 9: Production Hardening¶

A running cluster is not a production-ready cluster. Here's the hardening checklist that separates "it works" from "it won't get us fired."

RBAC — Lock It Down¶

# Audit: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .subjects[]? | "\(.kind)/\(.name)"'

# Create a namespace-scoped deployer role
kubectl create role deployer \
  --verb=get,list,create,update,patch \
  --resource=deployments,services,configmaps \
  -n production

# Bind it to a service account
kubectl create rolebinding ci-deployer-binding \
  --role=deployer \
  --serviceaccount=ci:ci-deployer \
  -n production

# Test it
kubectl auth can-i create deployments -n production \
  --as=system:serviceaccount:ci:ci-deployer
# yes

kubectl auth can-i delete nodes \
  --as=system:serviceaccount:ci:ci-deployer
# no

Remember: RBAC mnemonic: 2R-2B — two Roles (Role, ClusterRole), two Bindings (RoleBinding, ClusterRoleBinding). The binding type determines scope, not the role type. A ClusterRole + RoleBinding = scoped to one namespace.

Network Policies — Default Deny¶

# Start with deny-all in every production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Then explicitly allow what's needed:

# Allow frontend to reach backend on port 8080, plus DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-ingress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

Gotcha: When you add an egress NetworkPolicy, you MUST explicitly allow DNS (UDP/TCP port 53 to kube-system). Otherwise, service discovery silently breaks — pods can't resolve service names, and the failure looks like a networking problem, not a policy problem. This catches almost everyone the first time.

Pod Security Standards¶

Kubernetes 1.25+ uses Pod Security Admission (replacing the deprecated PodSecurityPolicy):

# Label namespace to enforce restricted standard
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

This blocks: privileged containers, host networking, host PID/IPC, root users, privilege escalation, and writable root filesystems.

Audit Logging¶

Enable API server audit logs to see who did what:

# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods"]
      - group: "apps"
        resources: ["deployments"]
  - level: None
    resources:
      - group: ""
        resources: ["events"]

Add to the API server manifest:

--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit.log
--audit-log-maxage=30
--audit-log-maxbackup=10

Encrypt Secrets at rest¶

By default, Kubernetes Secrets are stored as base64 in etcd — not encrypted. Anyone with etcd access can read them.

# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - aescbc:
          keys:
            - name: key1
              secret: <base64-encoded-32-byte-key>
      - identity: {}

Add --encryption-provider-config=/etc/kubernetes/encryption-config.yaml to the API server flags.

Part 10: Day-2 Operations¶

Adding nodes¶

# Generate a new join token (existing tokens expire after 24h)
kubeadm token create --print-join-command

# On the new node:
sudo kubeadm join k8s-api.internal.example.com:6443 \
  --token <new-token> \
  --discovery-token-ca-cert-hash sha256:<hash>

# Verify
kubectl get nodes
kubectl label node worker-13 topology.kubernetes.io/zone=rack-3

Removing nodes¶

# Cordon and drain
kubectl cordon worker-5
kubectl drain worker-5 --ignore-daemonsets --delete-emptydir-data --timeout=300s

# Delete the node object
kubectl delete node worker-5

# On the node itself: reset kubeadm state
sudo kubeadm reset
sudo rm -rf /etc/kubernetes/ /var/lib/kubelet/ /var/lib/etcd/

Capacity planning¶

# Current resource usage at a glance
kubectl top nodes

# Allocated vs allocatable
kubectl describe nodes | grep -A 5 "Allocated resources"

# Find overcommitted nodes
kubectl describe nodes | grep -E "Name:|cpu.*%|memory.*%"

# Pods per node (default limit: 110)
kubectl get pods -A -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sort -rn

Mental Model: Capacity planning is about three thresholds: (1) where you are now, (2) where you'll be at peak, and (3) where things break. If peak is 80% of break, you don't have enough headroom. Target 60-70% utilization at peak to absorb surprises.

Part 11: Disaster Recovery¶

The disaster recovery hierarchy¶

Scenario	Recovery method	Time to recover
Single worker node failure	Pods reschedule automatically	5-10 minutes
Single control plane node failure	Cluster operates on remaining nodes	Immediate (if HA)
etcd member failure	Replace member, cluster self-heals	15-30 minutes
etcd quorum loss	Restore from snapshot	30-60 minutes
Total cluster loss	Rebuild + restore etcd + redeploy	2-8 hours
Certificate expiry	`kubeadm certs renew all` + restart	15 minutes
CA certificate expiry	Full cluster rebuild	4-12 hours

The minimum backup set¶

# 1. etcd snapshot (cluster state)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

# 2. PKI certificates (identity)
sudo tar czf /backup/pki-$(date +%Y%m%d).tar.gz /etc/kubernetes/pki/

# 3. kubeadm config (cluster parameters)
kubectl -n kube-system get configmap kubeadm-config -o yaml > /backup/kubeadm-config.yaml

# Store all three off-cluster (S3, NFS, different datacenter)

Automated backup script¶

#!/usr/bin/env bash
# /usr/local/bin/k8s-backup.sh — run hourly via cron
set -euo pipefail

BACKUP_DIR="/backup/k8s/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key"

# etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/etcd-$(date +%H%M).db" \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/etcd-$(date +%H%M).db" \
  --write-out=json > /dev/null 2>&1 || {
    echo "CRITICAL: etcd snapshot verification failed" >&2
    exit 1
}

# PKI backup (once daily)
if [[ "$(date +%H)" == "02" ]]; then
  tar czf "$BACKUP_DIR/pki.tar.gz" /etc/kubernetes/pki/
fi

# Retention: keep 7 days
find /backup/k8s/ -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +

echo "Backup completed: $BACKUP_DIR"

War Story: An untested backup is not a backup. A team had etcd snapshots running hourly for a year. When they needed to restore after a failed upgrade, the snapshots were zero-byte files — the backup script was writing to a full disk and silently producing empty files. They had 365 days of "backups" and zero usable data. The lesson: the snapshot status verification step in the script above isn't optional paranoia. It's the difference between "we restored in 30 minutes" and "we rebuilt for 12 hours."

Flashcard Check #4¶

Question	Answer (cover this column)
What three things should you back up for disaster recovery?	etcd snapshot, PKI certificates, kubeadm config
How long do kubeadm-generated certificates last by default?	1 year (CA certs last 10 years)
What happens if etcd loses quorum?	Cluster becomes read-only. Restore from snapshot or force-new-cluster
How do you verify an etcd backup is valid?	`etcdctl snapshot status <file>` — check it's not zero-byte
Can a cluster survive losing one control plane node in a 3-node HA setup?	Yes — etcd quorum needs 2 of 3, API server runs on remaining nodes

Exercises¶

Exercise 1: Health Check (5 minutes)¶

Run these commands against any Kubernetes cluster (minikube, kind, or production) and interpret the output:

kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get --raw /healthz      # componentstatuses is deprecated/removed in recent K8s

What to look for

- All nodes should be `Ready` and on the same minor version - Control plane pods (apiserver, controller-manager, scheduler, etcd) should be `Running` - CoreDNS pods should be `Running` (DNS is the most common failure point) - If any pods show `CrashLoopBackOff` or `Error`, dig deeper with `kubectl describe pod`

Exercise 2: Certificate Audit (10 minutes)¶

On a kubeadm cluster, run kubeadm certs check-expiration and answer:

Which certificates expire soonest?
How many days until they expire?
What's the CA cert expiration date?

If you don't have a kubeadm cluster

Use kind to create one: `kind create cluster`. Then exec into the control plane container and run `kubeadm certs check-expiration`.

Exercise 3: etcd Backup and Restore (20 minutes)¶

Create an etcd snapshot
Create some test resources (kubectl create namespace test-backup)
Delete them (kubectl delete namespace test-backup)
Restore the snapshot
Verify the namespace is back

Hint

The tricky part is stopping the control plane before restore. On a kubeadm cluster, move the static pod manifests out of `/etc/kubernetes/manifests/`, restore, move them back. On kind, you can stop the kubelet container.

Exercise 4: Upgrade Planning (15 minutes)¶

Look at your current cluster version. Plan an upgrade to the next minor version:

What version are you currently running?
What's the target version?
Read the changelog for breaking changes
What PDBs exist that might block drains?
Write out the exact command sequence you'd use

Key commands

kubectl version --short
kubectl get pdb -A
kubeadm upgrade plan  # on a kubeadm cluster

Cheat Sheet¶

kubeadm lifecycle¶

Task	Command
Initialize cluster	`kubeadm init --control-plane-endpoint <lb> --pod-network-cidr <cidr>`
Join worker	`kubeadm join <api-endpoint> --token <token> --discovery-token-ca-cert-hash <hash>`
Join control plane	Add `--control-plane --certificate-key <key>` to join
New join token	`kubeadm token create --print-join-command`
Check cert expiry	`kubeadm certs check-expiration`
Renew all certs	`kubeadm certs renew all`
Upgrade plan	`kubeadm upgrade plan`
Upgrade first CP	`kubeadm upgrade apply v1.30.x`
Upgrade other CP/workers	`kubeadm upgrade node`
Reset node	`kubeadm reset`

Worker upgrade sequence¶

cordon → drain → apt install kubeadm → kubeadm upgrade node →
apt install kubelet kubectl → systemctl restart kubelet → uncordon

etcd essentials¶

Task	Command
Health check	`etcdctl endpoint health --cluster`
Status + leader	`etcdctl endpoint status --write-out=table --cluster`
Backup	`etcdctl snapshot save /backup/snap.db`
Verify backup	`etcdctl snapshot status /backup/snap.db`
Restore	`etcdctl snapshot restore /backup/snap.db --data-dir=/var/lib/etcd-new`
Member list	`etcdctl member list --write-out=table`

EKS essentials¶

Task	Command
Create cluster	`eksctl create cluster --name <name> --version <ver> --managed`
Upgrade control plane	`eksctl upgrade cluster --name <name> --version <ver> --approve`
Upgrade node group	`eksctl upgrade nodegroup --name <ng> --cluster <name>`
Update add-on	`eksctl update addon --name <addon> --cluster <name>`
List add-ons	`eksctl get addons --cluster <name>`

Version skew rules¶

API server 1.30:
  kubelet:  1.29 or 1.30 (1 minor behind OK)
  kubectl:  1.29, 1.30, or 1.31 (+-1 minor)
  etcd:     3.5.x (check release notes)

Upgrade order: etcd → API server → controller-manager → scheduler → workers
Never skip minor versions.

Takeaways¶

etcd backup before upgrades is non-negotiable. Everything else is recoverable. Lost etcd data is not. Make it the first step, not an afterthought.
Certificates are a silent killer. kubeadm certs expire in 1 year. If you don't upgrade or renew, the cluster dies without warning. Monitor expiry dates like you monitor disk space.
Control plane first, workers second, one at a time. Version skew rules exist for a reason. The kubelet can be 1 minor version behind the API server — use this to your advantage for rolling upgrades.
EKS trades control for convenience. AWS manages the hard parts (etcd, certs, control plane upgrades) but you lose direct access. For bare metal, you ARE the managed service.
CNI choice is permanent (practically). Switching CNIs on a running cluster is possible but painful. Choose Calico for BGP environments, Cilium for modern eBPF clusters, Flannel only for non-production.
Default deny network policies + RBAC = production baseline. Without both, your cluster is one compromised pod away from a full breach. Add them before you add workloads, not after.

etcd — The Database That Runs Kubernetes — Deep dive on Raft consensus, etcd performance, and disaster recovery
Kubernetes Debugging — When Pods Won't Behave — CrashLoopBackOff, OOMKilled, and probe failures
Kubernetes Services — How Traffic Finds Your Pod — ClusterIP, NodePort, LoadBalancer, and DNS
What Happens When You kubectl apply — End-to-end trace from YAML to running pod
Linux Hardening — Closing the Doors — OS-level security that complements Kubernetes hardening