Portal | Level: L2: Operations | Topics: Kubernetes Core, Kubernetes Networking, Node Lifecycle & Maintenance | Domain: Kubernetes
Kubernetes Under the Covers¶
A practical, internals-first learning guide (with diagrams, “what actually happens”, and drills).
Scope / versions: This describes how Kubernetes works in mainstream, production clusters today. Exact behavior varies by distribution, CNI, CSI, and Kubernetes version. Where it varies, this guide says so explicitly (no fairy tales).
How to use this guide (fast)¶
- Skim Part A once to get the “mental model”.
- Then pick one operation from Part C per session and do: 1) Read “Under the covers” 2) Run the “Watch it live” commands 3) Do the “Drill” 4) Run the “Cleanup”
Table of contents¶
- Part A - The mental model
- Part B - Core components and data flows
- Part C - What happens when you do common operations
- Part D - Troubleshooting map (symptom → likely layer)
- Part E - Unknown unknowns (stuff that surprises smart sysadmins)
- Appendix - Minimal lab manifests
Part A - The mental model¶
A1) Kubernetes is a distributed control system¶
Kubernetes is not “a container launcher”. It’s a declarative control plane that: 1. Accepts desired state via the API. 2. Stores it durably (etcd). 3. Runs control loops that continuously push reality toward that desired state. 4. Delegates the actual “Linux work” to agents on each node (kubelet + container runtime + CNI/CSI).
The core loop (visual)¶
You (kubectl / API client)
|
v
[ kube-apiserver ]
authn/authz/admission
|
v
[ etcd ] <-- the committed desired state
|
v
watchers / informers (caches)
|
v
controllers + scheduler decide actions
|
v
[ kubelet on node ]
runtime (CRI) + CNI + CSI
|
v
Linux primitives (namespaces/cgroups, routes, mounts, processes)
A2) Desired state vs observed state¶
- Spec: what you want (desired state).
- Status: what Kubernetes observes (current state).
Most “why is it stuck?” problems are spec says X but status can’t reach X.
A3) One sentence summary you should memorize¶
“Kubernetes is a set of controllers watching etcd and reconciling the world.”
Part B - Core components and data flows¶
B1) API server pipeline (what happens to every create/update/delete)¶
When you kubectl apply/create/delete, the request typically goes:
- Authentication: who are you?
- Authorization: are you allowed? (RBAC, etc.)
- Admission (before persistence):
- Mutating admission runs first (may change the object).
- Validating admission runs after (may reject).
- Persistence: object is stored in etcd with metadata like
uidandresourceVersion. - Watch events: clients (controllers, scheduler, kubectl -w) get notified.
Key takeaway: Kubernetes “does stuff” after the API write lands in etcd.
Admission phases (visual)¶
B2) etcd: the source of truth¶
- etcd is a consistent KV store used by the control plane to store cluster state.
- Controllers and scheduler watch the API server (which reads/writes etcd).
Practical implication: if etcd/API is unhealthy, nothing converges.
B3) Controllers: the engine of “make it so”¶
Controllers are loops that: - watch certain object types - compute what should exist - create/update/delete other objects accordingly
Examples: - Deployment controller creates ReplicaSets. - ReplicaSet controller creates Pods. - EndpointSlice controller creates/updates EndpointSlices for Services. - Node controller reacts to node health.
B4) Scheduler: “where should this Pod run?”¶
The scheduler:
- watches for Pods without a node assignment
- picks a node using plugins (filtering + scoring + additional phases)
- writes the binding (spec.nodeName) back through the API
B5) kubelet: “make Pods on this node real”¶
kubelet: - watches Pods bound to its node - calls the container runtime via CRI - coordinates volume setup (with CSI / host volume types) - triggers networking setup through CNI (via the runtime) - updates Pod status and posts events
B6) CRI / container runtime (containerd, CRI-O, etc.)¶
- Kubernetes talks to runtimes through the Container Runtime Interface (CRI).
- kubelet typically asks the runtime to:
- create a Pod sandbox (network namespace and related setup)
- then create/start containers within that sandbox
Many setups use a “pause” (sandbox) image, but the exact mechanics can vary by runtime/config.
B7) CNI: Pod networking¶
CNI plugins commonly: - create interfaces (veth), assign Pod IPs - program routes - implement NetworkPolicy (depends on plugin) - optionally handle Service routing via eBPF (plugin-dependent)
B8) kube-proxy / eBPF dataplanes: Service traffic¶
- Traditional Kubernetes uses kube-proxy on each node to program rules to route Service traffic to backends.
- Some CNIs replace kube-proxy behavior with eBPF.
EndpointSlices are the common “source of truth” for backend sets.
B9) CSI: storage¶
CSI involves two broad sides: - Controller side: provisioning, attach/detach (often control-plane pods) - Node side: mount/unmount on the node
Part C - What happens when you do common operations¶
C0) Baseline: What to watch live (use this constantly)¶
Open 2-4 terminals:
T1: watch the object
T2: watch events (the story)
T3: describe when stuck
T4 (node-level): kubelet On the node (or via SSH):
If you use containerd:
If you only remember one debugging rule: Events + kubelet logs.
C1) Operation: Create a Pod (kubectl apply -f pod.yaml)¶
Under the covers (sequence)¶
kubectl
|
| 1) POST/PUT Pod to API server
v
kube-apiserver
| 2) authn/authz
| 3) admission (mutating -> validating)
| 4) persist to etcd
v
etcd (Pod now exists as desired state)
|
| 5) scheduler sees unscheduled Pod
v
kube-scheduler
| 6) filter/score nodes
| 7) bind Pod to node (spec.nodeName)
v
kube-apiserver/etcd
|
| 8) kubelet on chosen node sees Pod
v
kubelet
| 9) prepare volumes
| 10) CRI RunPodSandbox (network namespace)
| 11) CNI ADD (Pod IP + routes)
| 12) pull images
| 13) run init containers (if any)
| 14) start app containers
| 15) update PodStatus + events
v
Pod becomes Running; readiness gates decide Ready/NotReady
The “why it gets stuck” hotspots¶
- Pending: scheduler can’t find a node (resources, taints, affinities, volumes).
- ContainerCreating: volume mount or CNI problem.
- ImagePullBackOff: registry/auth/DNS/network.
- CrashLoopBackOff: app exits, liveness fails, bad command/args/env.
Drill (do this with a simple Pod)¶
- Apply the Pod.
- Watch events and identify which component emitted each event:
- Scheduled → scheduler
- Pulling/Started → kubelet
- Explain the difference between
spec.nodeNameand Pod IP.
Cleanup¶
C2) Operation: Create a Deployment (Pods appear “by magic”)¶
Under the covers¶
When you apply a Deployment, you are creating a controller input, not Pods directly.
Deployment created
|
v
Deployment controller -> creates ReplicaSet
|
v
ReplicaSet controller -> creates Pods
|
v
Scheduler -> binds each Pod
|
v
Kubelet -> runs each Pod
Why this matters¶
- If you delete a Pod under a Deployment, it comes back (ReplicaSet notices desired replicas).
- Scaling is just changing the desired replica count; controller does the rest.
Drill¶
- Create a Deployment with 2 replicas.
- Delete one Pod.
- Observe that the ReplicaSet creates a replacement.
Cleanup¶
C3) Operation: Scale a Deployment (kubectl scale)¶
Under the covers¶
- You update
spec.replicason the Deployment (API write). - Deployment controller updates ReplicaSet desired replicas.
- ReplicaSet controller creates/deletes Pods to match.
- Scheduler/kubelet do their normal job for new Pods.
Drill¶
- Scale from 1 → 5.
- In events, count how many Pods were created and scheduled.
Cleanup¶
Scale back down or delete the Deployment.
C4) Operation: Rolling update (kubectl set image / apply new template)¶
Under the covers¶
A rolling update is a controlled replacement of Pods:
- Deployment template changes (new image tag, env, etc.)
- Deployment controller creates a new ReplicaSet
- It increases new RS replicas and decreases old RS replicas according to:
- maxUnavailable
- maxSurge
- Readiness gates determine when a new Pod counts as “available”
Drill¶
- Change the image to a new version.
- Watch ReplicaSets:
- Explain why you temporarily have extra Pods (surge) or fewer (unavailable).
Cleanup¶
Rollback or delete Deployment:
C5) Operation: Delete a Pod / Deployment (kubectl delete)¶
Under the covers¶
Delete is not always “kill immediately”.
- API server marks object with
deletionTimestamp(and incrementsgeneration/resourceVersion). - Finalizers (if any) must clear before actual removal.
- For Pods:
- kubelet receives a “stop” directive
- sends TERM to containers, waits
terminationGracePeriodSeconds - then SIGKILL if needed
- tears down sandbox + CNI DEL for networking
- unmounts volumes as appropriate
Drill¶
- Delete a Pod with a sleep loop.
- Observe termination grace: does it stop instantly or wait?
Cleanup¶
Already deleted.
C6) Operation: Exec into a container (kubectl exec)¶
Under the covers (high level)¶
kubectlasks API server for an exec session.- API server upgrades to a streaming connection (SPDY/WebSocket depending on setup).
- API server proxies the stream to the kubelet on the node.
- kubelet asks the runtime to create an exec process in the container’s namespaces/cgroups.
Key implication: exec is a control plane → kubelet → runtime path, not “SSH”.
Drill¶
- Exec into a Pod and run
ps. - Explain why you see only processes in that container/pod namespaces.
Cleanup¶
Exit the shell.
C7) Operation: Stream logs (kubectl logs -f)¶
Under the covers¶
- kubelet exposes container logs (runtime-dependent storage path).
- API server proxies
kubectl logsrequest to kubelet. - kubelet streams logs back.
Important: log retention depends on node disk + runtime log rotation settings.
Drill¶
- Run a Pod that prints a counter.
- Stream logs and kill/restart the container.
- Observe how “previous” logs work:
Cleanup¶
Delete Pod.
C8) Operation: Create a Service (ClusterIP) and route traffic¶
Under the covers¶
When you create a Service: - API stores Service object. - EndpointSlice controller creates/updates EndpointSlices matching the Service selector. - kube-proxy (or eBPF dataplane) uses EndpointSlices to program routing rules. - CoreDNS provides name resolution for the Service.
Visual: Service traffic path (classic kube-proxy model)¶
Client Pod -> Service ClusterIP:port
|
v
node routing (kube-proxy rules)
|
v
one backend PodIP:targetPort (selected from EndpointSlices)
Drill¶
- Create Deployment + Service.
- Verify EndpointSlices:
- Curl the Service and then curl a specific Pod IP. Explain the difference.
Cleanup¶
C9) Operation: Update a ConfigMap used by a Pod¶
Under the covers (gotcha-heavy)¶
Two common consumption patterns:
1) Env var from ConfigMap - ConfigMap change does not automatically update running container env. - You need a restart/rollout.
2) Volume mount from ConfigMap - kubelet updates projected volume content (eventually) on the node. - app must re-read files to see changes.
Drill¶
- Mount a ConfigMap as a file and watch it change.
- Then use env-from and notice it doesn’t update without restart.
Cleanup¶
Delete the ConfigMap and workload.
C10) Operation: Create a PVC (storage)¶
Under the covers (typical dynamic provisioning)¶
- You create a PVC.
- A controller provisions a PV (via CSI provisioner) if StorageClass supports dynamic provisioning.
- PV binds to PVC.
- When a Pod uses the PVC:
- controller side may attach a volume to the node (cloud/provider dependent)
- node plugin mounts it into the Pod
Drill¶
- Create PVC, then a Pod that mounts it.
- Observe events for attach/mount.
Cleanup¶
Delete Pod, then PVC (and possibly PV depending on reclaim policy):
C11) Operation: Cordon/Drain a Node (eviction + reschedule)¶
Under the covers¶
- cordon marks node unschedulable.
- drain:
- evicts Pods (through API eviction subresource)
- respects PodDisruptionBudgets
- deletes Pods (controllers recreate elsewhere)
Drill¶
- Cordon a node.
- Drain it.
- Watch Pods reschedule.
Cleanup¶
Uncordon:
C12) Operation: Apply vs Replace vs Patch (why “apply” is special)¶
Under the covers¶
kubectl applyuses a three-way merge concept (client-side and/or server-side apply depending on usage).- It tracks field ownership and tries to only change what you “own”.
- This is why apply plays better with other controllers editing objects.
Drill¶
- Apply a manifest.
- Edit the live object with
kubectl edit. - Apply again and see what changes, and what gets preserved.
Cleanup¶
Delete the objects you created.
Part D - Troubleshooting map (symptom → likely layer)¶
D1) Pod stuck in Pending¶
Most common layers:
- Scheduler constraints (resources/taints/affinity)
- Volume constraints (PV node affinity/zone)
Check:
- kubectl describe pod ... → Events
- kubectl get nodes -o wide
D2) Pod stuck in ContainerCreating¶
Most common layers: - CNI networking failing - CSI mount failing Check: - Pod events for FailedMount / CNI errors - kubelet logs on node
D3) ImagePullBackOff¶
Layers: - Registry auth - DNS/network path to registry - Wrong image name/tag Check: - Events show exact pull error - node network/DNS
D4) CrashLoopBackOff¶
Layers:
- Your app exits
- probe failure
- wrong command/args/env
Check:
- kubectl logs
- kubectl describe pod (probe failures)
D5) Service exists but no traffic¶
Layers: - selector mismatch → no endpoints - readiness not satisfied → endpoints not “ready” - NetworkPolicy blocking - kube-proxy / dataplane issue Check: - EndpointSlices - readiness conditions - policy rules
Part E - Unknown unknowns (the “why is this weird?” list)¶
- A Pod is not a process. It’s a bundle of namespaces + cgroups + containers.
- Most things are eventually consistent (controllers catch up via watch).
- Status is not spec. You can “ask” for 3 replicas and only have 1 running.
- ConfigMap env vars don’t live-update (needs restart).
- Delete is often two-phase (deletionTimestamp + finalizers).
- Services don’t “own” traffic routing; kube-proxy/eBPF does.
- “Ready” is an application contract (readiness probes gate traffic).
- A Deployment is a policy object controlling ReplicaSets, not Pods.
- NetworkPolicy behavior depends on the CNI (some enforce, some don’t).
- kubectl is a client, not a “cluster command runner”.
Appendix - Minimal lab manifests¶
Put these in a directory and apply them. They’re intentionally tiny.
A) namespace¶
B) pod (simple)¶
apiVersion: v1
kind: Pod
metadata:
name: hello-pod
namespace: demo
spec:
containers:
- name: web
image: nginx:stable
ports:
- containerPort: 80
C) deployment + service¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-deploy
namespace: demo
spec:
replicas: 2
selector:
matchLabels:
app: hello
template:
metadata:
labels:
app: hello
spec:
containers:
- name: web
image: nginx:stable
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: hello-svc
namespace: demo
spec:
selector:
app: hello
ports:
- port: 80
targetPort: 80
D) cleanup-all for the demo namespace¶
Sources (primary, official)¶
(These are here so you can verify details and avoid cargo-culting.)
Kubernetes Admission Control:
- https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
- https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
EndpointSlices:
- https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/
Scheduling framework & scheduler phases:
- https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
- https://kubernetes.io/docs/reference/scheduling/config/
- https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/
CRI and Pod sandbox concept:
- https://kubernetes.io/blog/2016/12/container-runtime-interface-cri-in-kubernetes/
Wiki Navigation¶
Prerequisites¶
- Skillcheck: Kubernetes (Assessment, L1)
Related Content¶
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core, Kubernetes Networking
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core, Node Lifecycle & Maintenance
- Case Study: Service No Endpoints (Case Study, L1) — Kubernetes Core, Kubernetes Networking
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — Kubernetes Core, Kubernetes Networking
- Kubernetes Ops (Production) (Topic Pack, L2) — Kubernetes Networking, Node Lifecycle & Maintenance
- Track: Kubernetes Core (Reference, L1) — Kubernetes Core, Kubernetes Networking
- API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
Pages that link here¶
- API Gateways & Ingress
- Kubernetes - Skill Check
- Kubernetes Node Lifecycle - Primer
- Kubernetes Ops Domain
- Kubernetes Pod Lifecycle
- Kubernetes_Core
- Level 3: Production Kubernetes
- Node Maintenance - Primer
- Primer
- Runbook: Node NotReady
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning