Kubernetes: From Scratch to Production Upgrade
- lesson
- kubeadm
- eks
- etcd
- certificates
- control-plane-architecture
- cni
- rbac
- cluster-upgrades
- disaster-recovery ---# Kubernetes — From Scratch to Production Upgrade
Topics: kubeadm, EKS, etcd, certificates, control plane architecture, CNI, RBAC, cluster upgrades, disaster recovery Level: L1–L3 (Foundations through Advanced Ops) Time: 90–120 minutes Strategy: Build-up + incident-driven
The Mission¶
You inherited a 12-node bare-metal Kubernetes cluster running 1.29.4. It powers an e-commerce platform handling 8,000 requests per second. The cluster was built with kubeadm 18 months ago, and nobody has touched the control plane since. Your job: upgrade it to 1.30 with zero downtime, fix the certificate time bomb that is 6 months from detonating, and document the process so the next person doesn't start from scratch.
Along the way, you'll also set up a parallel EKS cluster for the company's new microservices, because your team is going multi-environment. By the end of this lesson, you'll understand how Kubernetes clusters are born, how they grow, and how they die when nobody maintains them.
Let's build a cluster from nothing, then upgrade one under pressure.
Part 1: What kubeadm Actually Does¶
Before you upgrade, you need to understand what kubeadm built. Most people run
kubeadm init once and never think about it again. That's how clusters die silently.
The init sequence¶
sudo kubeadm init \
--control-plane-endpoint "k8s-api.internal.example.com:6443" \
--pod-network-cidr "10.244.0.0/16" \
--service-cidr "10.96.0.0/12" \
--upload-certs
That single command does an enormous amount of work:
| Step | What happens | Why it matters |
|---|---|---|
| 1 | Generates a Certificate Authority (CA) | Every component authenticates via TLS signed by this CA |
| 2 | Creates certificates for API server, etcd, kubelet, front-proxy | These expire in 1 year by default |
| 3 | Generates kubeconfig files | Admin, controller-manager, scheduler each get their own |
| 4 | Writes static pod manifests to /etc/kubernetes/manifests/ |
kubelet watches this directory and runs them directly |
| 5 | Starts etcd, API server, controller-manager, scheduler as static pods | The control plane is running |
| 6 | Applies RBAC rules and the bootstrap token | Workers can join using the token |
| 7 | Marks the node as a control plane node (taints it) | No workloads land here by default |
Under the Hood: kubeadm doesn't use Deployments or DaemonSets for the control plane. It writes YAML manifests directly to
/etc/kubernetes/manifests/, and the kubelet's static pod watcher picks them up. This meanskubectl get pods -n kube-systemshows pods likekube-apiserver-cp-1andetcd-cp-1, but they aren't managed by any controller. If you delete the manifest file, the pod disappears. If you edit the file, the kubelet restarts the pod with the new config. This is how kubeadm upgrades work — it rewrites the manifest files.
The join sequence¶
After init, you get a kubeadm join command with a token and a CA cert hash:
# Worker node join
sudo kubeadm join k8s-api.internal.example.com:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:e3b0c44298fc1c149afbf4c8996fb924...
# Additional control plane node join
sudo kubeadm join k8s-api.internal.example.com:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:e3b0c44298fc1c149afbf4c8996fb924... \
--control-plane --certificate-key <key-from-upload-certs>
The --control-plane flag is the difference between adding a worker and adding a
control plane node. With it, kubeadm copies the CA certs, generates node-specific certs,
and writes control plane static pod manifests on the new node.
Gotcha: The bootstrap token expires after 24 hours by default. If you need to add nodes later, generate a new one:
kubeadm token create --print-join-command. This is the most common "why can't my new node join" question.
Flashcard Check #1¶
| Question | Answer (cover this column) |
|---|---|
| What directory does kubeadm write control plane manifests to? | /etc/kubernetes/manifests/ — kubelet watches it as static pods |
| How does a worker join the cluster? | kubeadm join with a bootstrap token and CA cert hash |
| What's the default bootstrap token lifetime? | 24 hours |
Why does kubeadm init need --control-plane-endpoint? |
For HA — all nodes must reach the API through a single DNS name or load balancer |
Part 2: The Four Control Plane Components¶
Your cluster has a brain, and it has four parts. When something goes wrong, you need to know which part is misfiring.
kube-apiserver — The Front Door¶
Every interaction with the cluster goes through the API server. kubectl, the scheduler,
the controller-manager, kubelets, CI/CD pipelines — all of them talk to the API server and
nothing else. The API server is the only component that talks to etcd.
# Check if the API server is responding
kubectl cluster-info
# Kubernetes control plane is running at https://k8s-api.internal.example.com:6443
# Check API server pod health
kubectl get pods -n kube-system -l component=kube-apiserver
# View API server logs (useful when kubectl itself is flaky)
sudo crictl logs $(sudo crictl ps --name kube-apiserver -q)
Trivia: The Kubernetes API server is completely stateless. It stores nothing locally — every read comes from etcd, every write goes to etcd. You can run 1, 3, or 10 API server instances behind a load balancer, and they're all identical. This statelessness is what makes the API server horizontally scalable, and why etcd performance directly determines API server responsiveness.
kube-controller-manager — The Reconciliation Engine¶
The controller-manager runs dozens of control loops that watch the cluster's actual state and push it toward the desired state. The ReplicaSet controller creates pods. The Node controller marks nodes as NotReady. The Endpoints controller populates service endpoints.
# Check controller-manager health
kubectl get pods -n kube-system -l component=kube-controller-manager
# Check leader election (only one controller-manager is active in HA setups)
kubectl get endpoints kube-controller-manager -n kube-system -o yaml
Mental Model: Think of the controller-manager as the "desired state enforcement department." You say "I want 3 replicas." The ReplicaSet controller sees 2 running, creates 1. You drain a node. The node controller marks it NotReady. The endpoint controller removes its pods from Service endpoints. Every controller runs a simple loop: observe → compare → act. That's reconciliation. That's all of Kubernetes.
kube-scheduler — The Matchmaker¶
The scheduler watches for pods with no nodeName (unscheduled) and assigns them to nodes.
It runs a two-phase algorithm: filter (which nodes CAN run this pod?) then score
(which of those nodes is BEST?).
# Check scheduler health
kubectl get pods -n kube-system -l component=kube-scheduler
# See why a pod isn't scheduled
kubectl describe pod <pending-pod> -n <namespace> | grep -A 10 Events
etcd — The Brain¶
etcd stores every Kubernetes object. We covered etcd in depth in the etcd lesson — here we focus on what you need to know for cluster operations.
# Quick health check (run on a control plane node)
export ETCDCTL_API=3
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key"
etcdctl endpoint health --cluster $ETCD_CERTS
etcdctl endpoint status --write-out=table --cluster $ETCD_CERTS
etcdctl member list --write-out=table $ETCD_CERTS
Name Origin: etcd =
/etc(Unix configuration directory) + "d" (distributed). Created by CoreOS in 2013, it uses the Raft consensus algorithm — chosen specifically because it's understandable at 3 AM. The original Raft paper's title was literally "In Search of an Understandable Consensus Algorithm."
Stacked vs External etcd¶
kubeadm supports two topologies:
STACKED (default): EXTERNAL:
┌──────────────────┐ ┌──────────────────┐
│ Control Plane 1 │ │ Control Plane 1 │
│ ┌─────────────┐ │ │ (no etcd) │
│ │ API Server │ │ └──────────────────┘
│ │ etcd │ │ │
│ │ scheduler │ │ ┌─────┴─────────────┐
│ │ ctrl-mgr │ │ │ etcd cluster │
│ └─────────────┘ │ │ (3 dedicated │
└──────────────────┘ │ nodes) │
└───────────────────┘
Stacked: etcd runs on the same nodes as the control plane. Simpler to set up, fewer machines. But losing a control plane node also loses an etcd member.
External: etcd runs on its own dedicated nodes. Harder to set up, more machines, but etcd failures and control plane failures are decoupled. For enterprise bare-metal clusters, external etcd is the production-grade choice.
Remember: Quorum rule for etcd: you need (N/2) + 1 members alive. 3-member cluster tolerates 1 failure. 5-member cluster tolerates 2. Always odd numbers — a 4-member cluster has the same quorum as 5 (needs 3) but tolerates fewer failures.
Part 3: The Upgrade — 1.29 to 1.30¶
This is the mission. Your cluster is on 1.29.4 and needs to reach 1.30.x. Kubernetes only supports upgrading one minor version at a time — no skipping.
Pre-flight: The Checklist That Saves You¶
Before touching anything:
# 1. Read the release notes (non-negotiable)
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.30.md
# 2. Check current versions
kubectl get nodes -o wide
kubectl version --short
# 3. Back up etcd (THE MOST IMPORTANT STEP)
etcdctl snapshot save /backup/etcd-pre-upgrade-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 $ETCD_CERTS
# Verify the backup
etcdctl snapshot status /backup/etcd-pre-upgrade-*.db --write-out=table
# 4. Back up PKI certificates
sudo cp -r /etc/kubernetes/pki /backup/pki-pre-upgrade-$(date +%Y%m%d)
# 5. Check PDBs — any that will block drains?
kubectl get pdb -A
# 6. Check version skew rules
# kubelet can be 1 minor version behind the API server
# kubectl must be within 1 minor version of the API server
# Upgrade order: control plane first, workers second
War Story: A team at a fintech company started a 1.28-to-1.29 upgrade on a Friday afternoon. They upgraded the control plane, then started draining workers. Mid-drain, the third worker's kubelet crashed during the upgrade — a known bug in their specific kernel version. They needed to roll back the control plane, but they had no etcd backup. The "rollback" became a 14-hour rebuild-from-scratch, restoring workloads from Helm releases and GitOps state. They lost all Secrets that weren't in their GitOps repo, including database credentials for three services. Monday morning was not fun. The root cause wasn't the kubelet bug — it was the missing etcd snapshot.
Step 1: Upgrade the first control plane node¶
# On the first control plane node (cp-1):
# Update package repos
sudo apt-get update
# Check available kubeadm versions
apt-cache madison kubeadm | head -5
# Install the new kubeadm
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
# Verify kubeadm version
kubeadm version
# See what the upgrade will do (dry run)
sudo kubeadm upgrade plan
# Apply the upgrade to the control plane
sudo kubeadm upgrade apply v1.30.4
What kubeadm upgrade apply does:
- Validates the cluster is healthy
- Downloads new component images
- Upgrades the static pod manifests in
/etc/kubernetes/manifests/ - Applies new RBAC rules and API migrations
- Upgrades the kube-proxy and CoreDNS addons
The kubelet restarts each control plane component as its manifest is updated.
# Now upgrade kubelet and kubectl on cp-1
sudo apt-get install -y --allow-change-held-packages \
kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Verify
kubectl get nodes
# cp-1 should show v1.30.4
Step 2: Upgrade additional control plane nodes¶
# On cp-2 and cp-3 (note: upgrade node, NOT apply)
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
sudo kubeadm upgrade node
# Then upgrade kubelet and kubectl
sudo apt-get install -y --allow-change-held-packages \
kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
Gotcha: On the first control plane node, you run
kubeadm upgrade apply. On subsequent control plane nodes, you runkubeadm upgrade node. Runningapplyon the second node won't break anything, but it'll redo work that was already done.nodeis faster and correct.
Step 3: Upgrade workers — one at a time¶
This is where zero downtime lives or dies.
# From a machine with kubectl access:
# Cordon the worker (no new pods)
kubectl cordon worker-1
# Drain the worker (evict existing pods)
kubectl drain worker-1 \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=300s
# SSH to the worker:
sudo apt-get update
sudo apt-get install -y --allow-change-held-packages kubeadm=1.30.4-1.1
sudo kubeadm upgrade node
sudo apt-get install -y --allow-change-held-packages \
kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Back on the kubectl machine:
kubectl uncordon worker-1
# Verify before moving to the next worker
kubectl get nodes -o wide
kubectl get pods -A | grep -v Running | grep -v Completed
Repeat for each worker. For 12 workers, this takes 2-4 hours depending on drain time and how fast your pods reschedule.
Remember: Version skew mnemonic: "Control plane leads, workers follow, one version gap max." Upgrade order: etcd, API server, controller-manager/scheduler, then workers. Never skip minor versions — 1.28 to 1.30 is not supported.
Flashcard Check #2¶
| Question | Answer (cover this column) |
|---|---|
| What must you back up before a cluster upgrade? | etcd snapshot AND /etc/kubernetes/pki/ certificates |
| What command upgrades the first control plane node? | kubeadm upgrade apply v1.30.4 |
| What command upgrades additional control plane nodes? | kubeadm upgrade node |
| What's the worker upgrade sequence? | Cordon, drain, upgrade kubeadm/kubelet/kubectl, uncordon |
| Can kubelet 1.29 work with API server 1.30? | Yes — kubelet can be 1 minor version behind |
Part 4: The Certificate Time Bomb¶
Your cluster is 18 months old. The certificates kubeadm generated at init expired 6 months ago. Wait — the cluster is still running. How?
kubeadm auto-rotates certificates during kubeadm upgrade. Since you just upgraded to
1.30, the certs were renewed. But if you hadn't upgraded for 2+ years on the same minor
version, the cluster would have died silently when certs expired.
Checking certificate expiration¶
CERTIFICATE EXPIRES RESIDUAL TIME
admin.conf Mar 22, 2027 00:00 UTC 364d
apiserver Mar 22, 2027 00:00 UTC 364d
apiserver-etcd-client Mar 22, 2027 00:00 UTC 364d
apiserver-kubelet-client Mar 22, 2027 00:00 UTC 364d
controller-manager.conf Mar 22, 2027 00:00 UTC 364d
etcd-healthcheck-client Mar 22, 2027 00:00 UTC 364d
etcd-peer Mar 22, 2027 00:00 UTC 364d
etcd-server Mar 22, 2027 00:00 UTC 364d
front-proxy-client Mar 22, 2027 00:00 UTC 364d
scheduler.conf Mar 22, 2027 00:00 UTC 364d
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME
ca Mar 19, 2036 00:00 UTC 9y
etcd-ca Mar 19, 2036 00:00 UTC 9y
front-proxy-ca Mar 19, 2036 00:00 UTC 9y
Two critical things to notice:
-
Component certs expire in 1 year. This is the trap. If you don't upgrade or manually renew before then, the API server can't talk to etcd, kubelets can't talk to the API server, and your cluster goes dark.
-
CA certs expire in 10 years. These are the root certificates. When THEY expire, the only fix is a full cluster rebuild. Mark this date in a calendar.
Manual certificate renewal¶
If you can't upgrade (maybe you're on the latest version), renew manually:
# Renew all certificates
sudo kubeadm certs renew all
# Restart the control plane components (they need to pick up new certs)
# On kubeadm clusters, restart kubelet which restarts static pods:
sudo systemctl restart kubelet
# Verify the new admin kubeconfig works
kubectl get nodes
Gotcha: After renewing certs, you must also update the kubeconfig files (
/etc/kubernetes/admin.conf, etc.).kubeadm certs renew alldoes this automatically, but if you're renewing individual certs, you need to regenerate the corresponding kubeconfigs manually. If your~/.kube/configis a copy ofadmin.conf, recopy it.
Automating cert monitoring¶
Set up a cron job or monitoring alert:
# Check cert expiry in days (add to your monitoring)
CERT_EXPIRY=$(openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$CERT_EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "API server cert expires in $DAYS_LEFT days"
# Alert if < 30 days
Part 5: etcd Operations for Cluster Lifecycle¶
etcd is covered in depth in the etcd lesson. Here's what you need specifically for cluster lifecycle operations.
Backup before everything¶
This is non-negotiable. Before upgrades, before cert rotation, before adding or removing control plane nodes — back up etcd.
# The backup command you should have memorized
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify it's valid
etcdctl snapshot status /backup/etcd-*.db --write-out=table
Restore — the nuclear option¶
If an upgrade goes sideways and you need to roll back to a known-good state:
# 1. Stop the control plane on ALL nodes
# Move manifests out of the static pod directory
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for pods to stop
sudo crictl ps | grep -E "etcd|apiserver"
# 2. Restore the snapshot on each etcd member
# On cp-1:
etcdctl snapshot restore /backup/etcd-20260323-090000.db \
--data-dir=/var/lib/etcd-restored \
--name=cp-1 \
--initial-cluster="cp-1=https://10.0.1.10:2380,cp-2=https://10.0.1.11:2380,cp-3=https://10.0.1.12:2380" \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# 3. Update etcd manifest to use new data directory
# Edit /tmp/etcd.yaml: change --data-dir to /var/lib/etcd-restored
# 4. Restore manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 5. Verify
kubectl get nodes
etcdctl endpoint health --cluster $ETCD_CERTS
Gotcha: The restore must be done on EVERY etcd member, each with its own
--nameand--initial-advertise-peer-urls. Restoring on only one member creates a split-state cluster where members disagree on data. This is a common and catastrophic mistake.
Member management¶
Adding or replacing etcd members (relevant when scaling control plane nodes):
# List current members
etcdctl member list --write-out=table $ETCD_CERTS
# Remove a failed member
etcdctl member remove <MEMBER_ID> $ETCD_CERTS
# Add a replacement (always add before removing the next one)
etcdctl member add cp-4 --peer-urls=https://10.0.1.13:2380 $ETCD_CERTS
Gotcha: Never remove two etcd members from a 3-member cluster simultaneously. Quorum requires 2 of 3. Remove one, add its replacement, verify health, then proceed to the next. Violating this kills the cluster.
Part 6: EKS — The Cloud Side¶
Your company also runs EKS for newer microservices. EKS abstracts away the control plane entirely — AWS manages the API server, etcd, controller-manager, and scheduler. You manage the worker nodes (or let AWS manage those too with managed node groups or Fargate).
Creating an EKS cluster¶
# Install eksctl
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_Linux_amd64.tar.gz"
tar xzf eksctl_Linux_amd64.tar.gz -C /usr/local/bin
# Create a cluster with managed node groups
eksctl create cluster \
--name prod-eks \
--region us-east-1 \
--version 1.30 \
--nodegroup-name workers \
--node-type m6i.xlarge \
--nodes 6 \
--nodes-min 3 \
--nodes-max 12 \
--managed \
--with-oidc \
--ssh-access \
--ssh-public-key my-key
This creates: VPC, subnets, security groups, IAM roles, the EKS control plane, a managed node group, and the aws-auth ConfigMap.
Node group types¶
| Type | Who manages nodes | Scaling | Use case |
|---|---|---|---|
| Managed node groups | AWS manages AMI updates, draining | ASG + node group update | Default for most workloads |
| Self-managed | You manage everything | ASG only | Custom AMIs, GPU, specialized hardware |
| Fargate profiles | AWS manages everything (serverless) | Per-pod, no nodes visible | Batch jobs, burstable workloads |
Mental Model: EKS managed node groups are like a "kubeadm with a concierge." AWS handles the OS patching, AMI rotation, and node draining during upgrades. Self-managed node groups are like running your own kubeadm workers — full control, full responsibility. Fargate is a different paradigm entirely: no nodes, just pods. You pay per pod, and AWS handles everything underneath.
EKS upgrades¶
EKS control plane upgrades are a single API call:
# Upgrade the control plane (AWS handles this — takes 20-40 minutes)
eksctl upgrade cluster --name prod-eks --version 1.30 --approve
# Check add-on compatibility BEFORE upgrading node groups
eksctl get addons --cluster prod-eks
# Common add-ons: vpc-cni, kube-proxy, coredns, ebs-csi-driver
# Update add-ons to compatible versions
eksctl update addon --name vpc-cni --cluster prod-eks --version latest
eksctl update addon --name coredns --cluster prod-eks --version latest
eksctl update addon --name kube-proxy --cluster prod-eks --version latest
# Upgrade node groups (rolling update — replaces nodes one by one)
eksctl upgrade nodegroup \
--name workers \
--cluster prod-eks \
--kubernetes-version 1.30
Gotcha: EKS add-on versions must be compatible with the cluster version. Upgrading the control plane without updating add-ons can leave you with a kube-proxy that doesn't match the API server, causing subtle networking issues. Always check add-on versions after a control plane upgrade.
Flashcard Check #3¶
| Question | Answer (cover this column) |
|---|---|
| Who manages the EKS control plane? | AWS — you never SSH into it, never back up etcd |
| What's the difference between managed and self-managed node groups? | Managed: AWS handles AMI updates, draining during upgrades. Self-managed: you do everything |
| What must you update after an EKS control plane upgrade? | Add-ons (vpc-cni, coredns, kube-proxy) and node groups |
Can you run etcdctl on EKS? |
No — etcd is fully managed and inaccessible |
Part 7: kubeadm vs EKS vs k3s vs RKE2¶
You'll encounter all of these. Here's when to use which.
| kubeadm | EKS | k3s | RKE2 | |
|---|---|---|---|---|
| Control plane | You manage everything | AWS manages | Built-in, single binary | Built-in, RKE-managed |
| etcd | You manage (backup, certs, monitor) | AWS manages (invisible) | SQLite default, etcd optional | etcd built-in |
| Networking | BYO CNI | vpc-cni default | Flannel default, swap to Cilium | Canal (Calico+Flannel) default |
| Certs | kubeadm manages, 1-year default | AWS manages | k3s auto-rotates | RKE2 auto-rotates |
| Upgrades | Manual, node-by-node | API call + rolling node update | Binary replacement | RKE2 CLI upgrade |
| Best for | Bare metal, full control, learning | AWS production, managed experience | Edge, IoT, homelab, CI | Air-gapped, FIPS, government |
| Operational burden | High | Low | Very low | Medium |
| Cost | Hardware only | $0.10/hr control plane + nodes | Hardware only | Hardware only |
Trivia: k3s gets its name from being "half of k8s" — the number of letters: Kubernetes (10 letters) = K8s, so half is K3s. Rancher Labs (now SUSE) designed it to run on hardware as small as a Raspberry Pi. Despite the lightweight reputation, k3s is a CNCF-certified conformant Kubernetes distribution.
Part 8: CNI Selection for Bare Metal¶
On EKS, you get vpc-cni and don't think about it. On bare metal, CNI choice is one of the first and most consequential decisions you make.
| Calico | Cilium | Flannel | |
|---|---|---|---|
| Dataplane | iptables or eBPF | eBPF | VXLAN overlay |
| NetworkPolicy | Full support | Full + extended (L7, DNS) | None (pair with Calico) |
| Performance | Good | Best (bypasses iptables) | Adequate |
| BGP support | Native | Via MetalLB integration | No |
| kube-proxy replacement | No | Yes | No |
| Observability | Basic | Hubble (flow logs, service map) | None |
| Encryption | WireGuard | WireGuard | No |
| Complexity | Medium | Medium-high | Low |
| Best for bare metal | Enterprise, needs BGP peering | Modern clusters, needs L7 policy | Dev/test clusters, simplicity |
Recommendation for enterprise bare metal¶
Calico if you need BGP peering with your physical network fabric (common in datacenter environments with ToR switches). Calico routes pod traffic via BGP, which means your physical network sees pod IPs directly — no overlay, no encapsulation overhead.
Cilium if you want the fastest dataplane, deep observability via Hubble, and L7 network policies (e.g., "only allow GET /api/v1/public from frontend pods"). Cilium can also replace kube-proxy entirely, eliminating iptables for service routing.
Flannel if you want the simplest possible setup for a non-production cluster. It works, it's stable, but it has no NetworkPolicy support and limited observability.
# Install Calico (BGP mode for bare metal)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml
# Install Cilium (with Hubble and kube-proxy replacement)
cilium install --version 1.15.0 \
--set kubeProxyReplacement=true \
--set hubble.enabled=true \
--set hubble.relay.enabled=true
# Verify CNI is healthy
kubectl get pods -n kube-system -l k8s-app=calico-node # Calico
kubectl -n kube-system exec ds/cilium -- cilium status # Cilium
Under the Hood: Cilium assigns a numeric identity to each unique set of pod labels. Network policy enforcement happens against these identities, not IP addresses. When a pod restarts with a new IP, Cilium's identity-based rules remain valid. This is why Cilium handles large-scale pod churn better than iptables-based CNIs, where every IP change triggers a rule update.
Part 9: Production Hardening¶
A running cluster is not a production-ready cluster. Here's the hardening checklist that separates "it works" from "it won't get us fired."
RBAC — Lock It Down¶
# Audit: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .subjects[]? | "\(.kind)/\(.name)"'
# Create a namespace-scoped deployer role
kubectl create role deployer \
--verb=get,list,create,update,patch \
--resource=deployments,services,configmaps \
-n production
# Bind it to a service account
kubectl create rolebinding ci-deployer-binding \
--role=deployer \
--serviceaccount=ci:ci-deployer \
-n production
# Test it
kubectl auth can-i create deployments -n production \
--as=system:serviceaccount:ci:ci-deployer
# yes
kubectl auth can-i delete nodes \
--as=system:serviceaccount:ci:ci-deployer
# no
Remember: RBAC mnemonic: 2R-2B — two Roles (Role, ClusterRole), two Bindings (RoleBinding, ClusterRoleBinding). The binding type determines scope, not the role type. A ClusterRole + RoleBinding = scoped to one namespace.
Network Policies — Default Deny¶
# Start with deny-all in every production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Then explicitly allow what's needed:
# Allow frontend to reach backend on port 8080, plus DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Gotcha: When you add an egress NetworkPolicy, you MUST explicitly allow DNS (UDP/TCP port 53 to kube-system). Otherwise, service discovery silently breaks — pods can't resolve service names, and the failure looks like a networking problem, not a policy problem. This catches almost everyone the first time.
Pod Security Standards¶
Kubernetes 1.25+ uses Pod Security Admission (replacing the deprecated PodSecurityPolicy):
# Label namespace to enforce restricted standard
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
This blocks: privileged containers, host networking, host PID/IPC, root users, privilege escalation, and writable root filesystems.
Audit Logging¶
Enable API server audit logs to see who did what:
# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
resources:
- group: ""
resources: ["secrets", "configmaps"]
- level: RequestResponse
resources:
- group: ""
resources: ["pods"]
- group: "apps"
resources: ["deployments"]
- level: None
resources:
- group: ""
resources: ["events"]
Add to the API server manifest:
--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit.log
--audit-log-maxage=30
--audit-log-maxbackup=10
Encrypt Secrets at rest¶
By default, Kubernetes Secrets are stored as base64 in etcd — not encrypted. Anyone with etcd access can read them.
# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {}
Add --encryption-provider-config=/etc/kubernetes/encryption-config.yaml to the API
server flags.
Part 10: Day-2 Operations¶
Adding nodes¶
# Generate a new join token (existing tokens expire after 24h)
kubeadm token create --print-join-command
# On the new node:
sudo kubeadm join k8s-api.internal.example.com:6443 \
--token <new-token> \
--discovery-token-ca-cert-hash sha256:<hash>
# Verify
kubectl get nodes
kubectl label node worker-13 topology.kubernetes.io/zone=rack-3
Removing nodes¶
# Cordon and drain
kubectl cordon worker-5
kubectl drain worker-5 --ignore-daemonsets --delete-emptydir-data --timeout=300s
# Delete the node object
kubectl delete node worker-5
# On the node itself: reset kubeadm state
sudo kubeadm reset
sudo rm -rf /etc/kubernetes/ /var/lib/kubelet/ /var/lib/etcd/
Capacity planning¶
# Current resource usage at a glance
kubectl top nodes
# Allocated vs allocatable
kubectl describe nodes | grep -A 5 "Allocated resources"
# Find overcommitted nodes
kubectl describe nodes | grep -E "Name:|cpu.*%|memory.*%"
# Pods per node (default limit: 110)
kubectl get pods -A -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sort -rn
Mental Model: Capacity planning is about three thresholds: (1) where you are now, (2) where you'll be at peak, and (3) where things break. If peak is 80% of break, you don't have enough headroom. Target 60-70% utilization at peak to absorb surprises.
Part 11: Disaster Recovery¶
The disaster recovery hierarchy¶
| Scenario | Recovery method | Time to recover |
|---|---|---|
| Single worker node failure | Pods reschedule automatically | 5-10 minutes |
| Single control plane node failure | Cluster operates on remaining nodes | Immediate (if HA) |
| etcd member failure | Replace member, cluster self-heals | 15-30 minutes |
| etcd quorum loss | Restore from snapshot | 30-60 minutes |
| Total cluster loss | Rebuild + restore etcd + redeploy | 2-8 hours |
| Certificate expiry | kubeadm certs renew all + restart |
15 minutes |
| CA certificate expiry | Full cluster rebuild | 4-12 hours |
The minimum backup set¶
# 1. etcd snapshot (cluster state)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 $ETCD_CERTS
# 2. PKI certificates (identity)
sudo tar czf /backup/pki-$(date +%Y%m%d).tar.gz /etc/kubernetes/pki/
# 3. kubeadm config (cluster parameters)
kubectl -n kube-system get configmap kubeadm-config -o yaml > /backup/kubeadm-config.yaml
# Store all three off-cluster (S3, NFS, different datacenter)
Automated backup script¶
#!/usr/bin/env bash
# /usr/local/bin/k8s-backup.sh — run hourly via cron
set -euo pipefail
BACKUP_DIR="/backup/k8s/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
ETCD_CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key"
# etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/etcd-$(date +%H%M).db" \
--endpoints=https://127.0.0.1:2379 $ETCD_CERTS
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/etcd-$(date +%H%M).db" \
--write-out=json > /dev/null 2>&1 || {
echo "CRITICAL: etcd snapshot verification failed" >&2
exit 1
}
# PKI backup (once daily)
if [[ "$(date +%H)" == "02" ]]; then
tar czf "$BACKUP_DIR/pki.tar.gz" /etc/kubernetes/pki/
fi
# Retention: keep 7 days
find /backup/k8s/ -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +
echo "Backup completed: $BACKUP_DIR"
War Story: An untested backup is not a backup. A team had etcd snapshots running hourly for a year. When they needed to restore after a failed upgrade, the snapshots were zero-byte files — the backup script was writing to a full disk and silently producing empty files. They had 365 days of "backups" and zero usable data. The lesson: the
snapshot statusverification step in the script above isn't optional paranoia. It's the difference between "we restored in 30 minutes" and "we rebuilt for 12 hours."
Flashcard Check #4¶
| Question | Answer (cover this column) |
|---|---|
| What three things should you back up for disaster recovery? | etcd snapshot, PKI certificates, kubeadm config |
| How long do kubeadm-generated certificates last by default? | 1 year (CA certs last 10 years) |
| What happens if etcd loses quorum? | Cluster becomes read-only. Restore from snapshot or force-new-cluster |
| How do you verify an etcd backup is valid? | etcdctl snapshot status <file> — check it's not zero-byte |
| Can a cluster survive losing one control plane node in a 3-node HA setup? | Yes — etcd quorum needs 2 of 3, API server runs on remaining nodes |
Exercises¶
Exercise 1: Health Check (5 minutes)¶
Run these commands against any Kubernetes cluster (minikube, kind, or production) and interpret the output:
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get --raw /healthz # componentstatuses is deprecated/removed in recent K8s
What to look for
- All nodes should be `Ready` and on the same minor version - Control plane pods (apiserver, controller-manager, scheduler, etcd) should be `Running` - CoreDNS pods should be `Running` (DNS is the most common failure point) - If any pods show `CrashLoopBackOff` or `Error`, dig deeper with `kubectl describe pod`Exercise 2: Certificate Audit (10 minutes)¶
On a kubeadm cluster, run kubeadm certs check-expiration and answer:
- Which certificates expire soonest?
- How many days until they expire?
- What's the CA cert expiration date?
If you don't have a kubeadm cluster
Use kind to create one: `kind create cluster`. Then exec into the control plane container and run `kubeadm certs check-expiration`.Exercise 3: etcd Backup and Restore (20 minutes)¶
- Create an etcd snapshot
- Create some test resources (
kubectl create namespace test-backup) - Delete them (
kubectl delete namespace test-backup) - Restore the snapshot
- Verify the namespace is back
Hint
The tricky part is stopping the control plane before restore. On a kubeadm cluster, move the static pod manifests out of `/etc/kubernetes/manifests/`, restore, move them back. On kind, you can stop the kubelet container.Exercise 4: Upgrade Planning (15 minutes)¶
Look at your current cluster version. Plan an upgrade to the next minor version:
- What version are you currently running?
- What's the target version?
- Read the changelog for breaking changes
- What PDBs exist that might block drains?
- Write out the exact command sequence you'd use
Cheat Sheet¶
kubeadm lifecycle¶
| Task | Command |
|---|---|
| Initialize cluster | kubeadm init --control-plane-endpoint <lb> --pod-network-cidr <cidr> |
| Join worker | kubeadm join <api-endpoint> --token <token> --discovery-token-ca-cert-hash <hash> |
| Join control plane | Add --control-plane --certificate-key <key> to join |
| New join token | kubeadm token create --print-join-command |
| Check cert expiry | kubeadm certs check-expiration |
| Renew all certs | kubeadm certs renew all |
| Upgrade plan | kubeadm upgrade plan |
| Upgrade first CP | kubeadm upgrade apply v1.30.x |
| Upgrade other CP/workers | kubeadm upgrade node |
| Reset node | kubeadm reset |
Worker upgrade sequence¶
cordon → drain → apt install kubeadm → kubeadm upgrade node →
apt install kubelet kubectl → systemctl restart kubelet → uncordon
etcd essentials¶
| Task | Command |
|---|---|
| Health check | etcdctl endpoint health --cluster |
| Status + leader | etcdctl endpoint status --write-out=table --cluster |
| Backup | etcdctl snapshot save /backup/snap.db |
| Verify backup | etcdctl snapshot status /backup/snap.db |
| Restore | etcdctl snapshot restore /backup/snap.db --data-dir=/var/lib/etcd-new |
| Member list | etcdctl member list --write-out=table |
EKS essentials¶
| Task | Command |
|---|---|
| Create cluster | eksctl create cluster --name <name> --version <ver> --managed |
| Upgrade control plane | eksctl upgrade cluster --name <name> --version <ver> --approve |
| Upgrade node group | eksctl upgrade nodegroup --name <ng> --cluster <name> |
| Update add-on | eksctl update addon --name <addon> --cluster <name> |
| List add-ons | eksctl get addons --cluster <name> |
Version skew rules¶
API server 1.30:
kubelet: 1.29 or 1.30 (1 minor behind OK)
kubectl: 1.29, 1.30, or 1.31 (+-1 minor)
etcd: 3.5.x (check release notes)
Upgrade order: etcd → API server → controller-manager → scheduler → workers
Never skip minor versions.
Takeaways¶
-
etcd backup before upgrades is non-negotiable. Everything else is recoverable. Lost etcd data is not. Make it the first step, not an afterthought.
-
Certificates are a silent killer. kubeadm certs expire in 1 year. If you don't upgrade or renew, the cluster dies without warning. Monitor expiry dates like you monitor disk space.
-
Control plane first, workers second, one at a time. Version skew rules exist for a reason. The kubelet can be 1 minor version behind the API server — use this to your advantage for rolling upgrades.
-
EKS trades control for convenience. AWS manages the hard parts (etcd, certs, control plane upgrades) but you lose direct access. For bare metal, you ARE the managed service.
-
CNI choice is permanent (practically). Switching CNIs on a running cluster is possible but painful. Choose Calico for BGP environments, Cilium for modern eBPF clusters, Flannel only for non-production.
-
Default deny network policies + RBAC = production baseline. Without both, your cluster is one compromised pod away from a full breach. Add them before you add workloads, not after.
Related Lessons¶
- etcd — The Database That Runs Kubernetes — Deep dive on Raft consensus, etcd performance, and disaster recovery
- Kubernetes Debugging — When Pods Won't Behave — CrashLoopBackOff, OOMKilled, and probe failures
- Kubernetes Services — How Traffic Finds Your Pod — ClusterIP, NodePort, LoadBalancer, and DNS
- What Happens When You kubectl apply — End-to-end trace from YAML to running pod
- Linux Hardening — Closing the Doors — OS-level security that complements Kubernetes hardening