Skip to content

Kubernetes: From Scratch to Production Upgrade

  • lesson
  • kubeadm
  • eks
  • etcd
  • certificates
  • control-plane-architecture
  • cni
  • rbac
  • cluster-upgrades
  • disaster-recovery ---# Kubernetes — From Scratch to Production Upgrade

Topics: kubeadm, EKS, etcd, certificates, control plane architecture, CNI, RBAC, cluster upgrades, disaster recovery Level: L1–L3 (Foundations through Advanced Ops) Time: 90–120 minutes Strategy: Build-up + incident-driven


The Mission

You inherited a 12-node bare-metal Kubernetes cluster running 1.29.4. It powers an e-commerce platform handling 8,000 requests per second. The cluster was built with kubeadm 18 months ago, and nobody has touched the control plane since. Your job: upgrade it to 1.30 with zero downtime, fix the certificate time bomb that is 6 months from detonating, and document the process so the next person doesn't start from scratch.

Along the way, you'll also set up a parallel EKS cluster for the company's new microservices, because your team is going multi-environment. By the end of this lesson, you'll understand how Kubernetes clusters are born, how they grow, and how they die when nobody maintains them.

Let's build a cluster from nothing, then upgrade one under pressure.


Part 1: What kubeadm Actually Does

Before you upgrade, you need to understand what kubeadm built. Most people run kubeadm init once and never think about it again. That's how clusters die silently.

The init sequence

sudo kubeadm init \
  --control-plane-endpoint "k8s-api.internal.example.com:6443" \
  --pod-network-cidr "10.244.0.0/16" \
  --service-cidr "10.96.0.0/12" \
  --upload-certs

That single command does an enormous amount of work:

Step What happens Why it matters
1 Generates a Certificate Authority (CA) Every component authenticates via TLS signed by this CA
2 Creates certificates for API server, etcd, kubelet, front-proxy These expire in 1 year by default
3 Generates kubeconfig files Admin, controller-manager, scheduler each get their own
4 Writes static pod manifests to /etc/kubernetes/manifests/ kubelet watches this directory and runs them directly
5 Starts etcd, API server, controller-manager, scheduler as static pods The control plane is running
6 Applies RBAC rules and the bootstrap token Workers can join using the token
7 Marks the node as a control plane node (taints it) No workloads land here by default

Under the Hood: kubeadm doesn't use Deployments or DaemonSets for the control plane. It writes YAML manifests directly to /etc/kubernetes/manifests/, and the kubelet's static pod watcher picks them up. This means kubectl get pods -n kube-system shows pods like kube-apiserver-cp-1 and etcd-cp-1, but they aren't managed by any controller. If you delete the manifest file, the pod disappears. If you edit the file, the kubelet restarts the pod with the new config. This is how kubeadm upgrades work — it rewrites the manifest files.

The join sequence

After init, you get a kubeadm join command with a token and a CA cert hash:

# Worker node join
sudo kubeadm join k8s-api.internal.example.com:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:e3b0c44298fc1c149afbf4c8996fb924...

# Additional control plane node join
sudo kubeadm join k8s-api.internal.example.com:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:e3b0c44298fc1c149afbf4c8996fb924... \
  --control-plane --certificate-key <key-from-upload-certs>

The --control-plane flag is the difference between adding a worker and adding a control plane node. With it, kubeadm copies the CA certs, generates node-specific certs, and writes control plane static pod manifests on the new node.

Gotcha: The bootstrap token expires after 24 hours by default. If you need to add nodes later, generate a new one: kubeadm token create --print-join-command. This is the most common "why can't my new node join" question.


Flashcard Check #1

Question Answer (cover this column)
What directory does kubeadm write control plane manifests to? /etc/kubernetes/manifests/ — kubelet watches it as static pods
How does a worker join the cluster? kubeadm join with a bootstrap token and CA cert hash
What's the default bootstrap token lifetime? 24 hours
Why does kubeadm init need --control-plane-endpoint? For HA — all nodes must reach the API through a single DNS name or load balancer

Part 2: The Four Control Plane Components

Your cluster has a brain, and it has four parts. When something goes wrong, you need to know which part is misfiring.

kube-apiserver — The Front Door

Every interaction with the cluster goes through the API server. kubectl, the scheduler, the controller-manager, kubelets, CI/CD pipelines — all of them talk to the API server and nothing else. The API server is the only component that talks to etcd.

# Check if the API server is responding
kubectl cluster-info
# Kubernetes control plane is running at https://k8s-api.internal.example.com:6443

# Check API server pod health
kubectl get pods -n kube-system -l component=kube-apiserver

# View API server logs (useful when kubectl itself is flaky)
sudo crictl logs $(sudo crictl ps --name kube-apiserver -q)

Trivia: The Kubernetes API server is completely stateless. It stores nothing locally — every read comes from etcd, every write goes to etcd. You can run 1, 3, or 10 API server instances behind a load balancer, and they're all identical. This statelessness is what makes the API server horizontally scalable, and why etcd performance directly determines API server responsiveness.

kube-controller-manager — The Reconciliation Engine

The controller-manager runs dozens of control loops that watch the cluster's actual state and push it toward the desired state. The ReplicaSet controller creates pods. The Node controller marks nodes as NotReady. The Endpoints controller populates service endpoints.

# Check controller-manager health
kubectl get pods -n kube-system -l component=kube-controller-manager

# Check leader election (only one controller-manager is active in HA setups)
kubectl get endpoints kube-controller-manager -n kube-system -o yaml

Mental Model: Think of the controller-manager as the "desired state enforcement department." You say "I want 3 replicas." The ReplicaSet controller sees 2 running, creates 1. You drain a node. The node controller marks it NotReady. The endpoint controller removes its pods from Service endpoints. Every controller runs a simple loop: observe → compare → act. That's reconciliation. That's all of Kubernetes.

kube-scheduler — The Matchmaker

The scheduler watches for pods with no nodeName (unscheduled) and assigns them to nodes. It runs a two-phase algorithm: filter (which nodes CAN run this pod?) then score (which of those nodes is BEST?).

# Check scheduler health
kubectl get pods -n kube-system -l component=kube-scheduler

# See why a pod isn't scheduled
kubectl describe pod <pending-pod> -n <namespace> | grep -A 10 Events

etcd — The Brain

etcd stores every Kubernetes object. We covered etcd in depth in the etcd lesson — here we focus on what you need to know for cluster operations.

# Quick health check (run on a control plane node)
export ETCDCTL_API=3
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key"

etcdctl endpoint health --cluster $ETCD_CERTS
etcdctl endpoint status --write-out=table --cluster $ETCD_CERTS
etcdctl member list --write-out=table $ETCD_CERTS

Name Origin: etcd = /etc (Unix configuration directory) + "d" (distributed). Created by CoreOS in 2013, it uses the Raft consensus algorithm — chosen specifically because it's understandable at 3 AM. The original Raft paper's title was literally "In Search of an Understandable Consensus Algorithm."

Stacked vs External etcd

kubeadm supports two topologies:

STACKED (default):                    EXTERNAL:
┌──────────────────┐                  ┌──────────────────┐
│  Control Plane 1 │                  │  Control Plane 1 │
│  ┌─────────────┐ │                  │  (no etcd)       │
│  │ API Server  │ │                  └──────────────────┘
│  │ etcd        │ │                        │
│  │ scheduler   │ │                  ┌─────┴─────────────┐
│  │ ctrl-mgr    │ │                  │  etcd cluster     │
│  └─────────────┘ │                  │  (3 dedicated     │
└──────────────────┘                  │   nodes)          │
                                      └───────────────────┘

Stacked: etcd runs on the same nodes as the control plane. Simpler to set up, fewer machines. But losing a control plane node also loses an etcd member.

External: etcd runs on its own dedicated nodes. Harder to set up, more machines, but etcd failures and control plane failures are decoupled. For enterprise bare-metal clusters, external etcd is the production-grade choice.

Remember: Quorum rule for etcd: you need (N/2) + 1 members alive. 3-member cluster tolerates 1 failure. 5-member cluster tolerates 2. Always odd numbers — a 4-member cluster has the same quorum as 5 (needs 3) but tolerates fewer failures.


Part 3: The Upgrade — 1.29 to 1.30

This is the mission. Your cluster is on 1.29.4 and needs to reach 1.30.x. Kubernetes only supports upgrading one minor version at a time — no skipping.

Pre-flight: The Checklist That Saves You

Before touching anything:

# 1. Read the release notes (non-negotiable)
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.30.md

# 2. Check current versions
kubectl get nodes -o wide
kubectl version --short

# 3. Back up etcd (THE MOST IMPORTANT STEP)
etcdctl snapshot save /backup/etcd-pre-upgrade-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

# Verify the backup
etcdctl snapshot status /backup/etcd-pre-upgrade-*.db --write-out=table

# 4. Back up PKI certificates
sudo cp -r /etc/kubernetes/pki /backup/pki-pre-upgrade-$(date +%Y%m%d)

# 5. Check PDBs — any that will block drains?
kubectl get pdb -A

# 6. Check version skew rules
# kubelet can be 1 minor version behind the API server
# kubectl must be within 1 minor version of the API server
# Upgrade order: control plane first, workers second

War Story: A team at a fintech company started a 1.28-to-1.29 upgrade on a Friday afternoon. They upgraded the control plane, then started draining workers. Mid-drain, the third worker's kubelet crashed during the upgrade — a known bug in their specific kernel version. They needed to roll back the control plane, but they had no etcd backup. The "rollback" became a 14-hour rebuild-from-scratch, restoring workloads from Helm releases and GitOps state. They lost all Secrets that weren't in their GitOps repo, including database credentials for three services. Monday morning was not fun. The root cause wasn't the kubelet bug — it was the missing etcd snapshot.

Step 1: Upgrade the first control plane node

# On the first control plane node (cp-1):

# Update package repos
sudo apt-get update

# Check available kubeadm versions
apt-cache madison kubeadm | head -5

# Install the new kubeadm
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1

# Verify kubeadm version
kubeadm version

# See what the upgrade will do (dry run)
sudo kubeadm upgrade plan

# Apply the upgrade to the control plane
sudo kubeadm upgrade apply v1.30.4

What kubeadm upgrade apply does:

  1. Validates the cluster is healthy
  2. Downloads new component images
  3. Upgrades the static pod manifests in /etc/kubernetes/manifests/
  4. Applies new RBAC rules and API migrations
  5. Upgrades the kube-proxy and CoreDNS addons

The kubelet restarts each control plane component as its manifest is updated.

# Now upgrade kubelet and kubectl on cp-1
sudo apt-get install -y --allow-change-held-packages \
  kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Verify
kubectl get nodes
# cp-1 should show v1.30.4

Step 2: Upgrade additional control plane nodes

# On cp-2 and cp-3 (note: upgrade node, NOT apply)
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
sudo kubeadm upgrade node

# Then upgrade kubelet and kubectl
sudo apt-get install -y --allow-change-held-packages \
  kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

Gotcha: On the first control plane node, you run kubeadm upgrade apply. On subsequent control plane nodes, you run kubeadm upgrade node. Running apply on the second node won't break anything, but it'll redo work that was already done. node is faster and correct.

Step 3: Upgrade workers — one at a time

This is where zero downtime lives or dies.

# From a machine with kubectl access:

# Cordon the worker (no new pods)
kubectl cordon worker-1

# Drain the worker (evict existing pods)
kubectl drain worker-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

# SSH to the worker:
sudo apt-get update
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
sudo kubeadm upgrade node

sudo apt-get install -y --allow-change-held-packages \
  kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Back on the kubectl machine:
kubectl uncordon worker-1

# Verify before moving to the next worker
kubectl get nodes -o wide
kubectl get pods -A | grep -v Running | grep -v Completed

Repeat for each worker. For 12 workers, this takes 2-4 hours depending on drain time and how fast your pods reschedule.

Remember: Version skew mnemonic: "Control plane leads, workers follow, one version gap max." Upgrade order: etcd, API server, controller-manager/scheduler, then workers. Never skip minor versions — 1.28 to 1.30 is not supported.


Flashcard Check #2

Question Answer (cover this column)
What must you back up before a cluster upgrade? etcd snapshot AND /etc/kubernetes/pki/ certificates
What command upgrades the first control plane node? kubeadm upgrade apply v1.30.4
What command upgrades additional control plane nodes? kubeadm upgrade node
What's the worker upgrade sequence? Cordon, drain, upgrade kubeadm/kubelet/kubectl, uncordon
Can kubelet 1.29 work with API server 1.30? Yes — kubelet can be 1 minor version behind

Part 4: The Certificate Time Bomb

Your cluster is 18 months old. The certificates kubeadm generated at init expired 6 months ago. Wait — the cluster is still running. How?

kubeadm auto-rotates certificates during kubeadm upgrade. Since you just upgraded to 1.30, the certs were renewed. But if you hadn't upgraded for 2+ years on the same minor version, the cluster would have died silently when certs expired.

Checking certificate expiration

sudo kubeadm certs check-expiration
CERTIFICATE                EXPIRES                  RESIDUAL TIME
admin.conf                 Mar 22, 2027 00:00 UTC   364d
apiserver                  Mar 22, 2027 00:00 UTC   364d
apiserver-etcd-client      Mar 22, 2027 00:00 UTC   364d
apiserver-kubelet-client   Mar 22, 2027 00:00 UTC   364d
controller-manager.conf    Mar 22, 2027 00:00 UTC   364d
etcd-healthcheck-client    Mar 22, 2027 00:00 UTC   364d
etcd-peer                  Mar 22, 2027 00:00 UTC   364d
etcd-server                Mar 22, 2027 00:00 UTC   364d
front-proxy-client         Mar 22, 2027 00:00 UTC   364d
scheduler.conf             Mar 22, 2027 00:00 UTC   364d

CERTIFICATE AUTHORITY      EXPIRES                  RESIDUAL TIME
ca                         Mar 19, 2036 00:00 UTC   9y
etcd-ca                    Mar 19, 2036 00:00 UTC   9y
front-proxy-ca             Mar 19, 2036 00:00 UTC   9y

Two critical things to notice:

  1. Component certs expire in 1 year. This is the trap. If you don't upgrade or manually renew before then, the API server can't talk to etcd, kubelets can't talk to the API server, and your cluster goes dark.

  2. CA certs expire in 10 years. These are the root certificates. When THEY expire, the only fix is a full cluster rebuild. Mark this date in a calendar.

Manual certificate renewal

If you can't upgrade (maybe you're on the latest version), renew manually:

# Renew all certificates
sudo kubeadm certs renew all

# Restart the control plane components (they need to pick up new certs)
# On kubeadm clusters, restart kubelet which restarts static pods:
sudo systemctl restart kubelet

# Verify the new admin kubeconfig works
kubectl get nodes

Gotcha: After renewing certs, you must also update the kubeconfig files (/etc/kubernetes/admin.conf, etc.). kubeadm certs renew all does this automatically, but if you're renewing individual certs, you need to regenerate the corresponding kubeconfigs manually. If your ~/.kube/config is a copy of admin.conf, recopy it.

Automating cert monitoring

Set up a cron job or monitoring alert:

# Check cert expiry in days (add to your monitoring)
CERT_EXPIRY=$(openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$CERT_EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "API server cert expires in $DAYS_LEFT days"
# Alert if < 30 days

Part 5: etcd Operations for Cluster Lifecycle

etcd is covered in depth in the etcd lesson. Here's what you need specifically for cluster lifecycle operations.

Backup before everything

This is non-negotiable. Before upgrades, before cert rotation, before adding or removing control plane nodes — back up etcd.

# The backup command you should have memorized
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify it's valid
etcdctl snapshot status /backup/etcd-*.db --write-out=table

Restore — the nuclear option

If an upgrade goes sideways and you need to roll back to a known-good state:

# 1. Stop the control plane on ALL nodes
# Move manifests out of the static pod directory
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for pods to stop
sudo crictl ps | grep -E "etcd|apiserver"

# 2. Restore the snapshot on each etcd member
# On cp-1:
etcdctl snapshot restore /backup/etcd-20260323-090000.db \
  --data-dir=/var/lib/etcd-restored \
  --name=cp-1 \
  --initial-cluster="cp-1=https://10.0.1.10:2380,cp-2=https://10.0.1.11:2380,cp-3=https://10.0.1.12:2380" \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# 3. Update etcd manifest to use new data directory
# Edit /tmp/etcd.yaml: change --data-dir to /var/lib/etcd-restored

# 4. Restore manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 5. Verify
kubectl get nodes
etcdctl endpoint health --cluster $ETCD_CERTS

Gotcha: The restore must be done on EVERY etcd member, each with its own --name and --initial-advertise-peer-urls. Restoring on only one member creates a split-state cluster where members disagree on data. This is a common and catastrophic mistake.

Member management

Adding or replacing etcd members (relevant when scaling control plane nodes):

# List current members
etcdctl member list --write-out=table $ETCD_CERTS

# Remove a failed member
etcdctl member remove <MEMBER_ID> $ETCD_CERTS

# Add a replacement (always add before removing the next one)
etcdctl member add cp-4 --peer-urls=https://10.0.1.13:2380 $ETCD_CERTS

Gotcha: Never remove two etcd members from a 3-member cluster simultaneously. Quorum requires 2 of 3. Remove one, add its replacement, verify health, then proceed to the next. Violating this kills the cluster.


Part 6: EKS — The Cloud Side

Your company also runs EKS for newer microservices. EKS abstracts away the control plane entirely — AWS manages the API server, etcd, controller-manager, and scheduler. You manage the worker nodes (or let AWS manage those too with managed node groups or Fargate).

Creating an EKS cluster

# Install eksctl
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_Linux_amd64.tar.gz"
tar xzf eksctl_Linux_amd64.tar.gz -C /usr/local/bin

# Create a cluster with managed node groups
eksctl create cluster \
  --name prod-eks \
  --region us-east-1 \
  --version 1.30 \
  --nodegroup-name workers \
  --node-type m6i.xlarge \
  --nodes 6 \
  --nodes-min 3 \
  --nodes-max 12 \
  --managed \
  --with-oidc \
  --ssh-access \
  --ssh-public-key my-key

This creates: VPC, subnets, security groups, IAM roles, the EKS control plane, a managed node group, and the aws-auth ConfigMap.

Node group types

Type Who manages nodes Scaling Use case
Managed node groups AWS manages AMI updates, draining ASG + node group update Default for most workloads
Self-managed You manage everything ASG only Custom AMIs, GPU, specialized hardware
Fargate profiles AWS manages everything (serverless) Per-pod, no nodes visible Batch jobs, burstable workloads

Mental Model: EKS managed node groups are like a "kubeadm with a concierge." AWS handles the OS patching, AMI rotation, and node draining during upgrades. Self-managed node groups are like running your own kubeadm workers — full control, full responsibility. Fargate is a different paradigm entirely: no nodes, just pods. You pay per pod, and AWS handles everything underneath.

EKS upgrades

EKS control plane upgrades are a single API call:

# Upgrade the control plane (AWS handles this — takes 20-40 minutes)
eksctl upgrade cluster --name prod-eks --version 1.30 --approve

# Check add-on compatibility BEFORE upgrading node groups
eksctl get addons --cluster prod-eks
# Common add-ons: vpc-cni, kube-proxy, coredns, ebs-csi-driver

# Update add-ons to compatible versions
eksctl update addon --name vpc-cni --cluster prod-eks --version latest
eksctl update addon --name coredns --cluster prod-eks --version latest
eksctl update addon --name kube-proxy --cluster prod-eks --version latest

# Upgrade node groups (rolling update — replaces nodes one by one)
eksctl upgrade nodegroup \
  --name workers \
  --cluster prod-eks \
  --kubernetes-version 1.30

Gotcha: EKS add-on versions must be compatible with the cluster version. Upgrading the control plane without updating add-ons can leave you with a kube-proxy that doesn't match the API server, causing subtle networking issues. Always check add-on versions after a control plane upgrade.


Flashcard Check #3

Question Answer (cover this column)
Who manages the EKS control plane? AWS — you never SSH into it, never back up etcd
What's the difference between managed and self-managed node groups? Managed: AWS handles AMI updates, draining during upgrades. Self-managed: you do everything
What must you update after an EKS control plane upgrade? Add-ons (vpc-cni, coredns, kube-proxy) and node groups
Can you run etcdctl on EKS? No — etcd is fully managed and inaccessible

Part 7: kubeadm vs EKS vs k3s vs RKE2

You'll encounter all of these. Here's when to use which.

kubeadm EKS k3s RKE2
Control plane You manage everything AWS manages Built-in, single binary Built-in, RKE-managed
etcd You manage (backup, certs, monitor) AWS manages (invisible) SQLite default, etcd optional etcd built-in
Networking BYO CNI vpc-cni default Flannel default, swap to Cilium Canal (Calico+Flannel) default
Certs kubeadm manages, 1-year default AWS manages k3s auto-rotates RKE2 auto-rotates
Upgrades Manual, node-by-node API call + rolling node update Binary replacement RKE2 CLI upgrade
Best for Bare metal, full control, learning AWS production, managed experience Edge, IoT, homelab, CI Air-gapped, FIPS, government
Operational burden High Low Very low Medium
Cost Hardware only $0.10/hr control plane + nodes Hardware only Hardware only

Trivia: k3s gets its name from being "half of k8s" — the number of letters: Kubernetes (10 letters) = K8s, so half is K3s. Rancher Labs (now SUSE) designed it to run on hardware as small as a Raspberry Pi. Despite the lightweight reputation, k3s is a CNCF-certified conformant Kubernetes distribution.


Part 8: CNI Selection for Bare Metal

On EKS, you get vpc-cni and don't think about it. On bare metal, CNI choice is one of the first and most consequential decisions you make.

Calico Cilium Flannel
Dataplane iptables or eBPF eBPF VXLAN overlay
NetworkPolicy Full support Full + extended (L7, DNS) None (pair with Calico)
Performance Good Best (bypasses iptables) Adequate
BGP support Native Via MetalLB integration No
kube-proxy replacement No Yes No
Observability Basic Hubble (flow logs, service map) None
Encryption WireGuard WireGuard No
Complexity Medium Medium-high Low
Best for bare metal Enterprise, needs BGP peering Modern clusters, needs L7 policy Dev/test clusters, simplicity

Recommendation for enterprise bare metal

Calico if you need BGP peering with your physical network fabric (common in datacenter environments with ToR switches). Calico routes pod traffic via BGP, which means your physical network sees pod IPs directly — no overlay, no encapsulation overhead.

Cilium if you want the fastest dataplane, deep observability via Hubble, and L7 network policies (e.g., "only allow GET /api/v1/public from frontend pods"). Cilium can also replace kube-proxy entirely, eliminating iptables for service routing.

Flannel if you want the simplest possible setup for a non-production cluster. It works, it's stable, but it has no NetworkPolicy support and limited observability.

# Install Calico (BGP mode for bare metal)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

# Install Cilium (with Hubble and kube-proxy replacement)
cilium install --version 1.15.0 \
  --set kubeProxyReplacement=true \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true

# Verify CNI is healthy
kubectl get pods -n kube-system -l k8s-app=calico-node       # Calico
kubectl -n kube-system exec ds/cilium -- cilium status        # Cilium

Under the Hood: Cilium assigns a numeric identity to each unique set of pod labels. Network policy enforcement happens against these identities, not IP addresses. When a pod restarts with a new IP, Cilium's identity-based rules remain valid. This is why Cilium handles large-scale pod churn better than iptables-based CNIs, where every IP change triggers a rule update.


Part 9: Production Hardening

A running cluster is not a production-ready cluster. Here's the hardening checklist that separates "it works" from "it won't get us fired."

RBAC — Lock It Down

# Audit: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .subjects[]? | "\(.kind)/\(.name)"'

# Create a namespace-scoped deployer role
kubectl create role deployer \
  --verb=get,list,create,update,patch \
  --resource=deployments,services,configmaps \
  -n production

# Bind it to a service account
kubectl create rolebinding ci-deployer-binding \
  --role=deployer \
  --serviceaccount=ci:ci-deployer \
  -n production

# Test it
kubectl auth can-i create deployments -n production \
  --as=system:serviceaccount:ci:ci-deployer
# yes

kubectl auth can-i delete nodes \
  --as=system:serviceaccount:ci:ci-deployer
# no

Remember: RBAC mnemonic: 2R-2B — two Roles (Role, ClusterRole), two Bindings (RoleBinding, ClusterRoleBinding). The binding type determines scope, not the role type. A ClusterRole + RoleBinding = scoped to one namespace.

Network Policies — Default Deny

# Start with deny-all in every production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Then explicitly allow what's needed:

# Allow frontend to reach backend on port 8080, plus DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-ingress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

Gotcha: When you add an egress NetworkPolicy, you MUST explicitly allow DNS (UDP/TCP port 53 to kube-system). Otherwise, service discovery silently breaks — pods can't resolve service names, and the failure looks like a networking problem, not a policy problem. This catches almost everyone the first time.

Pod Security Standards

Kubernetes 1.25+ uses Pod Security Admission (replacing the deprecated PodSecurityPolicy):

# Label namespace to enforce restricted standard
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

This blocks: privileged containers, host networking, host PID/IPC, root users, privilege escalation, and writable root filesystems.

Audit Logging

Enable API server audit logs to see who did what:

# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods"]
      - group: "apps"
        resources: ["deployments"]
  - level: None
    resources:
      - group: ""
        resources: ["events"]

Add to the API server manifest:

--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit.log
--audit-log-maxage=30
--audit-log-maxbackup=10

Encrypt Secrets at rest

By default, Kubernetes Secrets are stored as base64 in etcd — not encrypted. Anyone with etcd access can read them.

# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - aescbc:
          keys:
            - name: key1
              secret: <base64-encoded-32-byte-key>
      - identity: {}

Add --encryption-provider-config=/etc/kubernetes/encryption-config.yaml to the API server flags.


Part 10: Day-2 Operations

Adding nodes

# Generate a new join token (existing tokens expire after 24h)
kubeadm token create --print-join-command

# On the new node:
sudo kubeadm join k8s-api.internal.example.com:6443 \
  --token <new-token> \
  --discovery-token-ca-cert-hash sha256:<hash>

# Verify
kubectl get nodes
kubectl label node worker-13 topology.kubernetes.io/zone=rack-3

Removing nodes

# Cordon and drain
kubectl cordon worker-5
kubectl drain worker-5 --ignore-daemonsets --delete-emptydir-data --timeout=300s

# Delete the node object
kubectl delete node worker-5

# On the node itself: reset kubeadm state
sudo kubeadm reset
sudo rm -rf /etc/kubernetes/ /var/lib/kubelet/ /var/lib/etcd/

Capacity planning

# Current resource usage at a glance
kubectl top nodes

# Allocated vs allocatable
kubectl describe nodes | grep -A 5 "Allocated resources"

# Find overcommitted nodes
kubectl describe nodes | grep -E "Name:|cpu.*%|memory.*%"

# Pods per node (default limit: 110)
kubectl get pods -A -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sort -rn

Mental Model: Capacity planning is about three thresholds: (1) where you are now, (2) where you'll be at peak, and (3) where things break. If peak is 80% of break, you don't have enough headroom. Target 60-70% utilization at peak to absorb surprises.


Part 11: Disaster Recovery

The disaster recovery hierarchy

Scenario Recovery method Time to recover
Single worker node failure Pods reschedule automatically 5-10 minutes
Single control plane node failure Cluster operates on remaining nodes Immediate (if HA)
etcd member failure Replace member, cluster self-heals 15-30 minutes
etcd quorum loss Restore from snapshot 30-60 minutes
Total cluster loss Rebuild + restore etcd + redeploy 2-8 hours
Certificate expiry kubeadm certs renew all + restart 15 minutes
CA certificate expiry Full cluster rebuild 4-12 hours

The minimum backup set

# 1. etcd snapshot (cluster state)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

# 2. PKI certificates (identity)
sudo tar czf /backup/pki-$(date +%Y%m%d).tar.gz /etc/kubernetes/pki/

# 3. kubeadm config (cluster parameters)
kubectl -n kube-system get configmap kubeadm-config -o yaml > /backup/kubeadm-config.yaml

# Store all three off-cluster (S3, NFS, different datacenter)

Automated backup script

#!/usr/bin/env bash
# /usr/local/bin/k8s-backup.sh — run hourly via cron
set -euo pipefail

BACKUP_DIR="/backup/k8s/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key"

# etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/etcd-$(date +%H%M).db" \
  --endpoints=https://127.0.0.1:2379 $ETCD_CERTS

# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/etcd-$(date +%H%M).db" \
  --write-out=json > /dev/null 2>&1 || {
    echo "CRITICAL: etcd snapshot verification failed" >&2
    exit 1
}

# PKI backup (once daily)
if [[ "$(date +%H)" == "02" ]]; then
  tar czf "$BACKUP_DIR/pki.tar.gz" /etc/kubernetes/pki/
fi

# Retention: keep 7 days
find /backup/k8s/ -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +

echo "Backup completed: $BACKUP_DIR"

War Story: An untested backup is not a backup. A team had etcd snapshots running hourly for a year. When they needed to restore after a failed upgrade, the snapshots were zero-byte files — the backup script was writing to a full disk and silently producing empty files. They had 365 days of "backups" and zero usable data. The lesson: the snapshot status verification step in the script above isn't optional paranoia. It's the difference between "we restored in 30 minutes" and "we rebuilt for 12 hours."


Flashcard Check #4

Question Answer (cover this column)
What three things should you back up for disaster recovery? etcd snapshot, PKI certificates, kubeadm config
How long do kubeadm-generated certificates last by default? 1 year (CA certs last 10 years)
What happens if etcd loses quorum? Cluster becomes read-only. Restore from snapshot or force-new-cluster
How do you verify an etcd backup is valid? etcdctl snapshot status <file> — check it's not zero-byte
Can a cluster survive losing one control plane node in a 3-node HA setup? Yes — etcd quorum needs 2 of 3, API server runs on remaining nodes

Exercises

Exercise 1: Health Check (5 minutes)

Run these commands against any Kubernetes cluster (minikube, kind, or production) and interpret the output:

kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get --raw /healthz      # componentstatuses is deprecated/removed in recent K8s
What to look for - All nodes should be `Ready` and on the same minor version - Control plane pods (apiserver, controller-manager, scheduler, etcd) should be `Running` - CoreDNS pods should be `Running` (DNS is the most common failure point) - If any pods show `CrashLoopBackOff` or `Error`, dig deeper with `kubectl describe pod`

Exercise 2: Certificate Audit (10 minutes)

On a kubeadm cluster, run kubeadm certs check-expiration and answer:

  1. Which certificates expire soonest?
  2. How many days until they expire?
  3. What's the CA cert expiration date?
If you don't have a kubeadm cluster Use kind to create one: `kind create cluster`. Then exec into the control plane container and run `kubeadm certs check-expiration`.

Exercise 3: etcd Backup and Restore (20 minutes)

  1. Create an etcd snapshot
  2. Create some test resources (kubectl create namespace test-backup)
  3. Delete them (kubectl delete namespace test-backup)
  4. Restore the snapshot
  5. Verify the namespace is back
Hint The tricky part is stopping the control plane before restore. On a kubeadm cluster, move the static pod manifests out of `/etc/kubernetes/manifests/`, restore, move them back. On kind, you can stop the kubelet container.

Exercise 4: Upgrade Planning (15 minutes)

Look at your current cluster version. Plan an upgrade to the next minor version:

  1. What version are you currently running?
  2. What's the target version?
  3. Read the changelog for breaking changes
  4. What PDBs exist that might block drains?
  5. Write out the exact command sequence you'd use
Key commands
kubectl version --short
kubectl get pdb -A
kubeadm upgrade plan  # on a kubeadm cluster

Cheat Sheet

kubeadm lifecycle

Task Command
Initialize cluster kubeadm init --control-plane-endpoint <lb> --pod-network-cidr <cidr>
Join worker kubeadm join <api-endpoint> --token <token> --discovery-token-ca-cert-hash <hash>
Join control plane Add --control-plane --certificate-key <key> to join
New join token kubeadm token create --print-join-command
Check cert expiry kubeadm certs check-expiration
Renew all certs kubeadm certs renew all
Upgrade plan kubeadm upgrade plan
Upgrade first CP kubeadm upgrade apply v1.30.x
Upgrade other CP/workers kubeadm upgrade node
Reset node kubeadm reset

Worker upgrade sequence

cordon → drain → apt install kubeadm → kubeadm upgrade node →
apt install kubelet kubectl → systemctl restart kubelet → uncordon

etcd essentials

Task Command
Health check etcdctl endpoint health --cluster
Status + leader etcdctl endpoint status --write-out=table --cluster
Backup etcdctl snapshot save /backup/snap.db
Verify backup etcdctl snapshot status /backup/snap.db
Restore etcdctl snapshot restore /backup/snap.db --data-dir=/var/lib/etcd-new
Member list etcdctl member list --write-out=table

EKS essentials

Task Command
Create cluster eksctl create cluster --name <name> --version <ver> --managed
Upgrade control plane eksctl upgrade cluster --name <name> --version <ver> --approve
Upgrade node group eksctl upgrade nodegroup --name <ng> --cluster <name>
Update add-on eksctl update addon --name <addon> --cluster <name>
List add-ons eksctl get addons --cluster <name>

Version skew rules

API server 1.30:
  kubelet:  1.29 or 1.30 (1 minor behind OK)
  kubectl:  1.29, 1.30, or 1.31 (+-1 minor)
  etcd:     3.5.x (check release notes)

Upgrade order: etcd → API server → controller-manager → scheduler → workers
Never skip minor versions.

Takeaways

  1. etcd backup before upgrades is non-negotiable. Everything else is recoverable. Lost etcd data is not. Make it the first step, not an afterthought.

  2. Certificates are a silent killer. kubeadm certs expire in 1 year. If you don't upgrade or renew, the cluster dies without warning. Monitor expiry dates like you monitor disk space.

  3. Control plane first, workers second, one at a time. Version skew rules exist for a reason. The kubelet can be 1 minor version behind the API server — use this to your advantage for rolling upgrades.

  4. EKS trades control for convenience. AWS manages the hard parts (etcd, certs, control plane upgrades) but you lose direct access. For bare metal, you ARE the managed service.

  5. CNI choice is permanent (practically). Switching CNIs on a running cluster is possible but painful. Choose Calico for BGP environments, Cilium for modern eBPF clusters, Flannel only for non-production.

  6. Default deny network policies + RBAC = production baseline. Without both, your cluster is one compromised pod away from a full breach. Add them before you add workloads, not after.