Portal | Level: L1: Foundations | Topics: Kubernetes Pods & Scheduling, Kubernetes Core | Domain: Kubernetes

Kubernetes Pods & Scheduling - Primer¶

Why This Matters¶

The pod is the fundamental unit of execution in Kubernetes. Everything you deploy — web servers, batch jobs, databases — runs inside a pod. Understanding pod anatomy, lifecycle, and scheduling is the foundation for operating anything in Kubernetes. Get this wrong and you'll spend hours debugging Pending pods, OOMKilled containers, and mysterious scheduling failures.

Pod Anatomy¶

A pod is a group of one or more containers that share networking and storage. Every container in a pod gets the same IP address and can communicate over localhost.

Container Types¶

A pod can contain three types of containers:

Application containers — your main workload. Most pods have exactly one.

Init containers — run to completion before any app containers start. They run sequentially (init-1 must succeed before init-2 starts). Common uses: database migrations, config file generation, waiting for a dependency.

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  initContainers:
    - name: wait-for-db
      image: busybox:1.36
      command: ['sh', '-c', 'until nslookup postgres.default.svc.cluster.local; do sleep 2; done']
    - name: run-migrations
      image: myapp:v2.1.0
      command: ['python', 'manage.py', 'migrate']
  containers:
    - name: app
      image: myapp:v2.1.0
      ports:
        - containerPort: 8000

Sidecar containers — as of Kubernetes 1.28+, native sidecar containers are specified as init containers with restartPolicy: Always. They start before app containers, run alongside them, and are terminated after the main containers exit. Before 1.28, sidecars were just regular containers in the pod spec — they had no guaranteed startup ordering and the pod would exit when all containers stopped.

spec:
  initContainers:
    - name: log-shipper
      image: fluent-bit:3.0
      restartPolicy: Always    # This makes it a native sidecar
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
  containers:
    - name: app
      image: myapp:v2.1.0
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
  volumes:
    - name: logs
      emptyDir: {}

Volumes¶

Pods can mount several volume types:

emptyDir — ephemeral, created when pod is assigned to a node, deleted when pod is removed. Useful for scratch space or sharing files between containers in the same pod.
hostPath — mounts a path from the host node's filesystem. Use with extreme caution — it breaks portability and is a security risk.
configMap / secret — project ConfigMap or Secret data as files.
persistentVolumeClaim — attach durable storage that survives pod restarts.
projected — combine multiple sources (configMap, secret, serviceAccountToken, downwardAPI) into a single mount.

Service Account¶

Every pod runs with a service account. If you don't specify one, it uses the default service account in the namespace. The service account determines what Kubernetes API permissions the pod has (via RBAC).

spec:
  serviceAccountName: my-app-sa
  automountServiceAccountToken: false  # Disable if you don't need API access

Setting automountServiceAccountToken: false is a security best practice for pods that don't need to talk to the Kubernetes API.

Pod Lifecycle¶

A pod moves through these phases:

Phase	Meaning
Pending	Pod accepted by the cluster but not yet running. Waiting for scheduling, image pull, or init containers.
Running	At least one container is running, starting, or restarting.
Succeeded	All containers terminated successfully (exit code 0). Won't restart.
Failed	All containers terminated, at least one with a non-zero exit code.
Unknown	Pod status can't be obtained — usually a node communication issue.

Container States¶

Each container within a pod has its own state:

Waiting — not yet running. Reasons include ContainerCreating, ImagePullBackOff, CrashLoopBackOff, CreateContainerConfigError.
Running — executing normally. The startedAt timestamp tells you when it started.
Terminated — finished execution. Check exitCode, reason (Completed, Error, OOMKilled), and signal.

# See container states
kubectl get pod myapp -o jsonpath='{.status.containerStatuses[*].state}'

# Detailed view
kubectl describe pod myapp

Restart Policies¶

The restartPolicy field controls what happens when a container exits:

Policy	Behavior	Use Case
Always (default)	Restart on any exit, with exponential backoff	Long-running services (Deployments)
OnFailure	Restart only on non-zero exit code	Batch jobs that should retry on failure
Never	Never restart	One-shot tasks, debugging

The backoff sequence is: 10s, 20s, 40s, 80s, 160s, capped at 5 minutes. It resets after 10 minutes of successful running.

Resource Requests and Limits¶

Every production container should declare resource requests and limits.

containers:
  - name: app
    image: myapp:v2.1.0
    resources:
      requests:
        cpu: 250m        # 0.25 CPU cores — used for scheduling
        memory: 256Mi    # used for scheduling and OOM scoring
      limits:
        cpu: 1000m       # 1 CPU core — container is throttled above this
        memory: 512Mi    # container is OOMKilled above this

Requests vs Limits¶

Requests determine scheduling. The scheduler finds a node with enough unrequested resources for the pod. The request is a guarantee — the container will always have at least this much available.

Limits determine enforcement. CPU limits cause throttling (the container is slowed down). Memory limits cause OOMKill (the container is killed).

Key differences:

Resource	Over-request	Over-limit
CPU	Wastes capacity (node appears full but isn't)	Throttled — processes slow down but survive
Memory	Wastes capacity	OOMKilled — container is terminated immediately

CPU is compressible (you can take it away and give it back). Memory is incompressible (once allocated, you can only reclaim it by killing the process).

Debug clue: If your application is slow but not crashing, check for CPU throttling: cat /sys/fs/cgroup/cpu.stat inside the container and look at nr_throttled and throttled_usec. High throttle counts with low CPU limits mean the container is being starved, not that the application is slow.

CPU Units¶

1 = 1 vCPU/core
100m = 100 millicores = 0.1 CPU
250m = a quarter core

Memory Units¶

128Mi = 128 mebibytes (1 MiB = 1,048,576 bytes)
1Gi = 1 gibibyte
128M = 128 megabytes (1 MB = 1,000,000 bytes) — note: Mi vs M matters

QoS Classes¶

Remember: The QoS eviction order is BEG — BestEffort dies first, then Burstable (by how far over their request), then Guaranteed last. Mnemonic: "Best Effort Goes first."

Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. QoS determines eviction priority when a node is under memory pressure.

Guaranteed¶

All containers in the pod have both requests and limits set for both CPU and memory, and requests == limits.

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

Guaranteed pods are the last to be evicted. Use for critical workloads (databases, payment processing).

Burstable¶

At least one container has a resource request or limit set, but it doesn't meet the Guaranteed criteria (requests != limits, or not all resources specified).

resources:
  requests:
    cpu: 250m
    memory: 128Mi
  limits:
    cpu: 1000m
    memory: 512Mi

Burstable pods are evicted before BestEffort but after Guaranteed. Most application workloads land here.

BestEffort¶

No container has any resource requests or limits set. These pods are evicted first. Never run BestEffort in production.

# Check QoS class
kubectl get pod myapp -o jsonpath='{.status.qosClass}'

Pod Priority and Preemption¶

Pod priority lets you rank pods. When the cluster is full, higher-priority pods can preempt (evict) lower-priority ones.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-apps
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production services"
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-api
spec:
  priorityClassName: critical-apps
  containers:
    - name: api
      image: myapi:v3.0.0

Built-in priority classes: - system-cluster-critical (2000000000) — core cluster components (kube-apiserver, etcd) - system-node-critical (2000001000) — node-essential components (kube-proxy, CNI)

Priority values range from -2,147,483,648 to 1,000,000,000 (values above 1B are reserved for system use).

Scheduling: Controlling Where Pods Run¶

Node Selectors¶

The simplest form of scheduling constraint. Pods are only scheduled on nodes matching all specified labels.

spec:
  nodeSelector:
    disktype: ssd
    gpu: "true"

# Label a node
kubectl label node worker-3 disktype=ssd gpu=true

Node Affinity¶

More expressive than nodeSelector. Supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators.

requiredDuringSchedulingIgnoredDuringExecution — hard requirement. Pod won't schedule if no matching node exists.

preferredDuringSchedulingIgnoredDuringExecution — soft preference. Scheduler tries to match but will schedule elsewhere if needed.

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
                  - us-east-1b
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values:
                  - high-memory

Pod Affinity and Anti-Affinity¶

Schedule pods relative to other pods, not nodes.

Pod affinity — "schedule me near pods with label X." Useful for co-locating related services for low-latency communication.

Pod anti-affinity — "schedule me away from pods with label X." Essential for HA — spread replicas across nodes or zones.

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - api-server
          topologyKey: kubernetes.io/hostname
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - cache
            topologyKey: topology.kubernetes.io/zone

The topologyKey defines what "near" or "away" means: - kubernetes.io/hostname — same/different node - topology.kubernetes.io/zone — same/different availability zone - topology.kubernetes.io/region — same/different region

Taints and Tolerations¶

Taints are applied to nodes. They repel pods unless the pod has a matching toleration.

# Taint a node
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# Remove a taint
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule-

Taint effects: - NoSchedule — new pods without a matching toleration won't be scheduled here - PreferNoSchedule — soft version, scheduler tries to avoid but doesn't guarantee - NoExecute — existing pods without tolerations are evicted, new ones aren't scheduled

spec:
  tolerations:
    - key: "gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
    - key: "node.kubernetes.io/not-ready"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 300    # Tolerate for 5 minutes, then evict

The operator: Exists matches any value for the key (useful for built-in taints).

Topology Spread Constraints¶

Distribute pods evenly across failure domains. More fine-grained than pod anti-affinity.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: api-server
    - maxSkew: 2
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: api-server

maxSkew — maximum difference in pod count between any two topology domains
whenUnsatisfiable: DoNotSchedule — hard constraint (pod stays Pending)
whenUnsatisfiable: ScheduleAnyway — soft constraint (best-effort spread)

PodDisruptionBudget¶

A PDB declares how many pods of a workload must remain available during voluntary disruptions (node drains, cluster upgrades, rolling updates).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2           # At least 2 pods must be available
  # OR
  # maxUnavailable: 1       # At most 1 pod can be down
  selector:
    matchLabels:
      app: api-server

PDBs only protect against voluntary disruptions. They cannot prevent involuntary disruptions like node crashes or OOMKills.

# Check PDB status
kubectl get pdb -A
kubectl describe pdb api-pdb -n production

Pod Overhead and Runtime Class¶

Pod overhead accounts for resources consumed by the pod sandbox itself (not your containers). This matters for VM-based runtimes like Kata Containers or gVisor where the sandbox has non-trivial resource usage.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata-qemu
overhead:
  podFixed:
    cpu: 250m
    memory: 160Mi
---
apiVersion: v1
kind: Pod
metadata:
  name: secure-workload
spec:
  runtimeClassName: kata
  containers:
    - name: app
      image: myapp:v2.1.0
      resources:
        requests:
          cpu: 500m
          memory: 256Mi

The scheduler adds the overhead (250m CPU, 160Mi memory) to the container requests when finding a suitable node.

Pod Security Context¶

Security context defines privilege and access control for a pod or individual container.

Pod-Level Security Context¶

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault

Container-Level Security Context¶

containers:
  - name: app
    image: myapp:v2.1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE

Pod Security Standards¶

Kubernetes defines three security profiles enforced at the namespace level via Pod Security Admission:

Standard	What It Allows
Privileged	Unrestricted. No restrictions applied.
Baseline	Prevents known privilege escalations. Allows most workloads without modification.
Restricted	Heavily restricted. Requires running as non-root, dropping all capabilities, read-only root filesystem.

# Enforce restricted standard on a namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

The three enforcement modes: - enforce — reject pods that violate the standard - warn — allow but display a warning to the user - audit — allow but log the violation in the audit log

Quick Reference: Pod Spec Fields That Matter¶

Field	What It Controls
`restartPolicy`	What happens when a container exits
`terminationGracePeriodSeconds`	How long to wait between SIGTERM and SIGKILL (default: 30)
`serviceAccountName`	Kubernetes API identity for the pod
`automountServiceAccountToken`	Whether to mount the SA token (disable if unused)
`nodeSelector`	Simple label-based node selection
`affinity`	Complex scheduling rules (node, pod affinity/anti-affinity)
`tolerations`	Allow scheduling on tainted nodes
`topologySpreadConstraints`	Even distribution across failure domains
`priorityClassName`	Pod priority for preemption decisions
`runtimeClassName`	Container runtime selection (runc, kata, gvisor)
`securityContext`	Privilege and access control
`dnsPolicy`	DNS resolution behavior (ClusterFirst, Default, None)
`hostNetwork`	Share the host's network namespace
`shareProcessNamespace`	Containers can see each other's processes

Prerequisites¶

Kubernetes Ops (Production) (Topic Pack, L2)

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Kubernetes Pods & Scheduling - Primer¶

Why This Matters¶

Pod Anatomy¶

Container Types¶

Volumes¶

Service Account¶

Pod Lifecycle¶

Container States¶

Restart Policies¶

Resource Requests and Limits¶

Requests vs Limits¶

CPU Units¶

Memory Units¶

QoS Classes¶

Guaranteed¶

Burstable¶

BestEffort¶

Pod Priority and Preemption¶

Scheduling: Controlling Where Pods Run¶

Node Selectors¶

Node Affinity¶

Pod Affinity and Anti-Affinity¶

Taints and Tolerations¶

Topology Spread Constraints¶

PodDisruptionBudget¶

Pod Overhead and Runtime Class¶

Pod Security Context¶

Pod-Level Security Context¶

Container-Level Security Context¶

Pod Security Standards¶

Quick Reference: Pod Spec Fields That Matter¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Kubernetes Pods & Scheduling - Primer¶

Why This Matters¶

Pod Anatomy¶

Container Types¶

Volumes¶

Service Account¶

Pod Lifecycle¶

Container States¶

Restart Policies¶

Resource Requests and Limits¶

Requests vs Limits¶

CPU Units¶

Memory Units¶

QoS Classes¶

Guaranteed¶

Burstable¶

BestEffort¶

Pod Priority and Preemption¶

Scheduling: Controlling Where Pods Run¶

Node Selectors¶

Node Affinity¶

Pod Affinity and Anti-Affinity¶

Taints and Tolerations¶

Topology Spread Constraints¶

PodDisruptionBudget¶

Pod Overhead and Runtime Class¶

Pod Security Context¶

Pod-Level Security Context¶

Container-Level Security Context¶

Pod Security Standards¶

Quick Reference: Pod Spec Fields That Matter¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶