Portal | Level: L1: Foundations | Topics: Kubernetes Pods & Scheduling, Kubernetes Core | Domain: Kubernetes
Kubernetes Pods & Scheduling - Primer¶
Why This Matters¶
The pod is the fundamental unit of execution in Kubernetes. Everything you deploy — web servers, batch jobs, databases — runs inside a pod. Understanding pod anatomy, lifecycle, and scheduling is the foundation for operating anything in Kubernetes. Get this wrong and you'll spend hours debugging Pending pods, OOMKilled containers, and mysterious scheduling failures.
Pod Anatomy¶
A pod is a group of one or more containers that share networking and storage. Every container in a pod gets the same IP address and can communicate over localhost.
Container Types¶
A pod can contain three types of containers:
Application containers — your main workload. Most pods have exactly one.
Init containers — run to completion before any app containers start. They run sequentially (init-1 must succeed before init-2 starts). Common uses: database migrations, config file generation, waiting for a dependency.
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nslookup postgres.default.svc.cluster.local; do sleep 2; done']
- name: run-migrations
image: myapp:v2.1.0
command: ['python', 'manage.py', 'migrate']
containers:
- name: app
image: myapp:v2.1.0
ports:
- containerPort: 8000
Sidecar containers — as of Kubernetes 1.28+, native sidecar containers are specified as init containers with restartPolicy: Always. They start before app containers, run alongside them, and are terminated after the main containers exit. Before 1.28, sidecars were just regular containers in the pod spec — they had no guaranteed startup ordering and the pod would exit when all containers stopped.
spec:
initContainers:
- name: log-shipper
image: fluent-bit:3.0
restartPolicy: Always # This makes it a native sidecar
volumeMounts:
- name: logs
mountPath: /var/log/app
containers:
- name: app
image: myapp:v2.1.0
volumeMounts:
- name: logs
mountPath: /var/log/app
volumes:
- name: logs
emptyDir: {}
Volumes¶
Pods can mount several volume types:
- emptyDir — ephemeral, created when pod is assigned to a node, deleted when pod is removed. Useful for scratch space or sharing files between containers in the same pod.
- hostPath — mounts a path from the host node's filesystem. Use with extreme caution — it breaks portability and is a security risk.
- configMap / secret — project ConfigMap or Secret data as files.
- persistentVolumeClaim — attach durable storage that survives pod restarts.
- projected — combine multiple sources (configMap, secret, serviceAccountToken, downwardAPI) into a single mount.
Service Account¶
Every pod runs with a service account. If you don't specify one, it uses the default service account in the namespace. The service account determines what Kubernetes API permissions the pod has (via RBAC).
spec:
serviceAccountName: my-app-sa
automountServiceAccountToken: false # Disable if you don't need API access
Setting automountServiceAccountToken: false is a security best practice for pods that don't need to talk to the Kubernetes API.
Pod Lifecycle¶
A pod moves through these phases:
| Phase | Meaning |
|---|---|
| Pending | Pod accepted by the cluster but not yet running. Waiting for scheduling, image pull, or init containers. |
| Running | At least one container is running, starting, or restarting. |
| Succeeded | All containers terminated successfully (exit code 0). Won't restart. |
| Failed | All containers terminated, at least one with a non-zero exit code. |
| Unknown | Pod status can't be obtained — usually a node communication issue. |
Container States¶
Each container within a pod has its own state:
- Waiting — not yet running. Reasons include
ContainerCreating,ImagePullBackOff,CrashLoopBackOff,CreateContainerConfigError. - Running — executing normally. The
startedAttimestamp tells you when it started. - Terminated — finished execution. Check
exitCode,reason(Completed, Error, OOMKilled), andsignal.
# See container states
kubectl get pod myapp -o jsonpath='{.status.containerStatuses[*].state}'
# Detailed view
kubectl describe pod myapp
Restart Policies¶
The restartPolicy field controls what happens when a container exits:
| Policy | Behavior | Use Case |
|---|---|---|
| Always (default) | Restart on any exit, with exponential backoff | Long-running services (Deployments) |
| OnFailure | Restart only on non-zero exit code | Batch jobs that should retry on failure |
| Never | Never restart | One-shot tasks, debugging |
The backoff sequence is: 10s, 20s, 40s, 80s, 160s, capped at 5 minutes. It resets after 10 minutes of successful running.
Resource Requests and Limits¶
Every production container should declare resource requests and limits.
containers:
- name: app
image: myapp:v2.1.0
resources:
requests:
cpu: 250m # 0.25 CPU cores — used for scheduling
memory: 256Mi # used for scheduling and OOM scoring
limits:
cpu: 1000m # 1 CPU core — container is throttled above this
memory: 512Mi # container is OOMKilled above this
Requests vs Limits¶
Requests determine scheduling. The scheduler finds a node with enough unrequested resources for the pod. The request is a guarantee — the container will always have at least this much available.
Limits determine enforcement. CPU limits cause throttling (the container is slowed down). Memory limits cause OOMKill (the container is killed).
Key differences:
| Resource | Over-request | Over-limit |
|---|---|---|
| CPU | Wastes capacity (node appears full but isn't) | Throttled — processes slow down but survive |
| Memory | Wastes capacity | OOMKilled — container is terminated immediately |
CPU is compressible (you can take it away and give it back). Memory is incompressible (once allocated, you can only reclaim it by killing the process).
Debug clue: If your application is slow but not crashing, check for CPU throttling:
cat /sys/fs/cgroup/cpu.statinside the container and look atnr_throttledandthrottled_usec. High throttle counts with low CPU limits mean the container is being starved, not that the application is slow.
CPU Units¶
1= 1 vCPU/core100m= 100 millicores = 0.1 CPU250m= a quarter core
Memory Units¶
128Mi= 128 mebibytes (1 MiB = 1,048,576 bytes)1Gi= 1 gibibyte128M= 128 megabytes (1 MB = 1,000,000 bytes) — note: Mi vs M matters
QoS Classes¶
Remember: The QoS eviction order is BEG — BestEffort dies first, then Burstable (by how far over their request), then Guaranteed last. Mnemonic: "Best Effort Goes first."
Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. QoS determines eviction priority when a node is under memory pressure.
Guaranteed¶
All containers in the pod have both requests and limits set for both CPU and memory, and requests == limits.
Guaranteed pods are the last to be evicted. Use for critical workloads (databases, payment processing).
Burstable¶
At least one container has a resource request or limit set, but it doesn't meet the Guaranteed criteria (requests != limits, or not all resources specified).
Burstable pods are evicted before BestEffort but after Guaranteed. Most application workloads land here.
BestEffort¶
No container has any resource requests or limits set. These pods are evicted first. Never run BestEffort in production.
Pod Priority and Preemption¶
Pod priority lets you rank pods. When the cluster is full, higher-priority pods can preempt (evict) lower-priority ones.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-apps
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production services"
---
apiVersion: v1
kind: Pod
metadata:
name: critical-api
spec:
priorityClassName: critical-apps
containers:
- name: api
image: myapi:v3.0.0
Built-in priority classes:
- system-cluster-critical (2000000000) — core cluster components (kube-apiserver, etcd)
- system-node-critical (2000001000) — node-essential components (kube-proxy, CNI)
Priority values range from -2,147,483,648 to 1,000,000,000 (values above 1B are reserved for system use).
Scheduling: Controlling Where Pods Run¶
Node Selectors¶
The simplest form of scheduling constraint. Pods are only scheduled on nodes matching all specified labels.
Node Affinity¶
More expressive than nodeSelector. Supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators.
requiredDuringSchedulingIgnoredDuringExecution — hard requirement. Pod won't schedule if no matching node exists.
preferredDuringSchedulingIgnoredDuringExecution — soft preference. Scheduler tries to match but will schedule elsewhere if needed.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-memory
Pod Affinity and Anti-Affinity¶
Schedule pods relative to other pods, not nodes.
Pod affinity — "schedule me near pods with label X." Useful for co-locating related services for low-latency communication.
Pod anti-affinity — "schedule me away from pods with label X." Essential for HA — spread replicas across nodes or zones.
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-server
topologyKey: kubernetes.io/hostname
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: topology.kubernetes.io/zone
The topologyKey defines what "near" or "away" means:
- kubernetes.io/hostname — same/different node
- topology.kubernetes.io/zone — same/different availability zone
- topology.kubernetes.io/region — same/different region
Taints and Tolerations¶
Taints are applied to nodes. They repel pods unless the pod has a matching toleration.
# Taint a node
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# Remove a taint
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule-
Taint effects: - NoSchedule — new pods without a matching toleration won't be scheduled here - PreferNoSchedule — soft version, scheduler tries to avoid but doesn't guarantee - NoExecute — existing pods without tolerations are evicted, new ones aren't scheduled
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # Tolerate for 5 minutes, then evict
The operator: Exists matches any value for the key (useful for built-in taints).
Topology Spread Constraints¶
Distribute pods evenly across failure domains. More fine-grained than pod anti-affinity.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-server
maxSkew— maximum difference in pod count between any two topology domainswhenUnsatisfiable: DoNotSchedule— hard constraint (pod stays Pending)whenUnsatisfiable: ScheduleAnyway— soft constraint (best-effort spread)
PodDisruptionBudget¶
A PDB declares how many pods of a workload must remain available during voluntary disruptions (node drains, cluster upgrades, rolling updates).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # At least 2 pods must be available
# OR
# maxUnavailable: 1 # At most 1 pod can be down
selector:
matchLabels:
app: api-server
PDBs only protect against voluntary disruptions. They cannot prevent involuntary disruptions like node crashes or OOMKills.
Pod Overhead and Runtime Class¶
Pod overhead accounts for resources consumed by the pod sandbox itself (not your containers). This matters for VM-based runtimes like Kata Containers or gVisor where the sandbox has non-trivial resource usage.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata-qemu
overhead:
podFixed:
cpu: 250m
memory: 160Mi
---
apiVersion: v1
kind: Pod
metadata:
name: secure-workload
spec:
runtimeClassName: kata
containers:
- name: app
image: myapp:v2.1.0
resources:
requests:
cpu: 500m
memory: 256Mi
The scheduler adds the overhead (250m CPU, 160Mi memory) to the container requests when finding a suitable node.
Pod Security Context¶
Security context defines privilege and access control for a pod or individual container.
Pod-Level Security Context¶
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
Container-Level Security Context¶
containers:
- name: app
image: myapp:v2.1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
Pod Security Standards¶
Kubernetes defines three security profiles enforced at the namespace level via Pod Security Admission:
| Standard | What It Allows |
|---|---|
| Privileged | Unrestricted. No restrictions applied. |
| Baseline | Prevents known privilege escalations. Allows most workloads without modification. |
| Restricted | Heavily restricted. Requires running as non-root, dropping all capabilities, read-only root filesystem. |
# Enforce restricted standard on a namespace
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
The three enforcement modes: - enforce — reject pods that violate the standard - warn — allow but display a warning to the user - audit — allow but log the violation in the audit log
Quick Reference: Pod Spec Fields That Matter¶
| Field | What It Controls |
|---|---|
restartPolicy |
What happens when a container exits |
terminationGracePeriodSeconds |
How long to wait between SIGTERM and SIGKILL (default: 30) |
serviceAccountName |
Kubernetes API identity for the pod |
automountServiceAccountToken |
Whether to mount the SA token (disable if unused) |
nodeSelector |
Simple label-based node selection |
affinity |
Complex scheduling rules (node, pod affinity/anti-affinity) |
tolerations |
Allow scheduling on tainted nodes |
topologySpreadConstraints |
Even distribution across failure domains |
priorityClassName |
Pod priority for preemption decisions |
runtimeClassName |
Container runtime selection (runc, kata, gvisor) |
securityContext |
Privilege and access control |
dnsPolicy |
DNS resolution behavior (ClusterFirst, Default, None) |
hostNetwork |
Share the host's network namespace |
shareProcessNamespace |
Containers can see each other's processes |
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core
Pages that link here¶
- Anti-Primer: Kubernetes Pods And Scheduling
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- Chaos Engineering & Fault Injection
- Incident Replay: DaemonSet Blocks Node Eviction
- Incident Replay: Resource Quota Blocking Deployment
- Kubernetes Pods & Scheduling
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms