Skip to content

Portal | Level: L1: Foundations | Topics: Kubernetes Pods & Scheduling, Kubernetes Core | Domain: Kubernetes

Kubernetes Pods & Scheduling - Primer

Why This Matters

The pod is the fundamental unit of execution in Kubernetes. Everything you deploy — web servers, batch jobs, databases — runs inside a pod. Understanding pod anatomy, lifecycle, and scheduling is the foundation for operating anything in Kubernetes. Get this wrong and you'll spend hours debugging Pending pods, OOMKilled containers, and mysterious scheduling failures.


Pod Anatomy

A pod is a group of one or more containers that share networking and storage. Every container in a pod gets the same IP address and can communicate over localhost.

Container Types

A pod can contain three types of containers:

Application containers — your main workload. Most pods have exactly one.

Init containers — run to completion before any app containers start. They run sequentially (init-1 must succeed before init-2 starts). Common uses: database migrations, config file generation, waiting for a dependency.

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  initContainers:
    - name: wait-for-db
      image: busybox:1.36
      command: ['sh', '-c', 'until nslookup postgres.default.svc.cluster.local; do sleep 2; done']
    - name: run-migrations
      image: myapp:v2.1.0
      command: ['python', 'manage.py', 'migrate']
  containers:
    - name: app
      image: myapp:v2.1.0
      ports:
        - containerPort: 8000

Sidecar containers — as of Kubernetes 1.28+, native sidecar containers are specified as init containers with restartPolicy: Always. They start before app containers, run alongside them, and are terminated after the main containers exit. Before 1.28, sidecars were just regular containers in the pod spec — they had no guaranteed startup ordering and the pod would exit when all containers stopped.

spec:
  initContainers:
    - name: log-shipper
      image: fluent-bit:3.0
      restartPolicy: Always    # This makes it a native sidecar
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
  containers:
    - name: app
      image: myapp:v2.1.0
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
  volumes:
    - name: logs
      emptyDir: {}

Volumes

Pods can mount several volume types:

  • emptyDir — ephemeral, created when pod is assigned to a node, deleted when pod is removed. Useful for scratch space or sharing files between containers in the same pod.
  • hostPath — mounts a path from the host node's filesystem. Use with extreme caution — it breaks portability and is a security risk.
  • configMap / secret — project ConfigMap or Secret data as files.
  • persistentVolumeClaim — attach durable storage that survives pod restarts.
  • projected — combine multiple sources (configMap, secret, serviceAccountToken, downwardAPI) into a single mount.

Service Account

Every pod runs with a service account. If you don't specify one, it uses the default service account in the namespace. The service account determines what Kubernetes API permissions the pod has (via RBAC).

spec:
  serviceAccountName: my-app-sa
  automountServiceAccountToken: false  # Disable if you don't need API access

Setting automountServiceAccountToken: false is a security best practice for pods that don't need to talk to the Kubernetes API.


Pod Lifecycle

A pod moves through these phases:

Phase Meaning
Pending Pod accepted by the cluster but not yet running. Waiting for scheduling, image pull, or init containers.
Running At least one container is running, starting, or restarting.
Succeeded All containers terminated successfully (exit code 0). Won't restart.
Failed All containers terminated, at least one with a non-zero exit code.
Unknown Pod status can't be obtained — usually a node communication issue.

Container States

Each container within a pod has its own state:

  • Waiting — not yet running. Reasons include ContainerCreating, ImagePullBackOff, CrashLoopBackOff, CreateContainerConfigError.
  • Running — executing normally. The startedAt timestamp tells you when it started.
  • Terminated — finished execution. Check exitCode, reason (Completed, Error, OOMKilled), and signal.
# See container states
kubectl get pod myapp -o jsonpath='{.status.containerStatuses[*].state}'

# Detailed view
kubectl describe pod myapp

Restart Policies

The restartPolicy field controls what happens when a container exits:

Policy Behavior Use Case
Always (default) Restart on any exit, with exponential backoff Long-running services (Deployments)
OnFailure Restart only on non-zero exit code Batch jobs that should retry on failure
Never Never restart One-shot tasks, debugging

The backoff sequence is: 10s, 20s, 40s, 80s, 160s, capped at 5 minutes. It resets after 10 minutes of successful running.


Resource Requests and Limits

Every production container should declare resource requests and limits.

containers:
  - name: app
    image: myapp:v2.1.0
    resources:
      requests:
        cpu: 250m        # 0.25 CPU cores — used for scheduling
        memory: 256Mi    # used for scheduling and OOM scoring
      limits:
        cpu: 1000m       # 1 CPU core — container is throttled above this
        memory: 512Mi    # container is OOMKilled above this

Requests vs Limits

Requests determine scheduling. The scheduler finds a node with enough unrequested resources for the pod. The request is a guarantee — the container will always have at least this much available.

Limits determine enforcement. CPU limits cause throttling (the container is slowed down). Memory limits cause OOMKill (the container is killed).

Key differences:

Resource Over-request Over-limit
CPU Wastes capacity (node appears full but isn't) Throttled — processes slow down but survive
Memory Wastes capacity OOMKilled — container is terminated immediately

CPU is compressible (you can take it away and give it back). Memory is incompressible (once allocated, you can only reclaim it by killing the process).

Debug clue: If your application is slow but not crashing, check for CPU throttling: cat /sys/fs/cgroup/cpu.stat inside the container and look at nr_throttled and throttled_usec. High throttle counts with low CPU limits mean the container is being starved, not that the application is slow.

CPU Units

  • 1 = 1 vCPU/core
  • 100m = 100 millicores = 0.1 CPU
  • 250m = a quarter core

Memory Units

  • 128Mi = 128 mebibytes (1 MiB = 1,048,576 bytes)
  • 1Gi = 1 gibibyte
  • 128M = 128 megabytes (1 MB = 1,000,000 bytes) — note: Mi vs M matters

QoS Classes

Remember: The QoS eviction order is BEGBestEffort dies first, then Burstable (by how far over their request), then Guaranteed last. Mnemonic: "Best Effort Goes first."

Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. QoS determines eviction priority when a node is under memory pressure.

Guaranteed

All containers in the pod have both requests and limits set for both CPU and memory, and requests == limits.

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

Guaranteed pods are the last to be evicted. Use for critical workloads (databases, payment processing).

Burstable

At least one container has a resource request or limit set, but it doesn't meet the Guaranteed criteria (requests != limits, or not all resources specified).

resources:
  requests:
    cpu: 250m
    memory: 128Mi
  limits:
    cpu: 1000m
    memory: 512Mi

Burstable pods are evicted before BestEffort but after Guaranteed. Most application workloads land here.

BestEffort

No container has any resource requests or limits set. These pods are evicted first. Never run BestEffort in production.

# Check QoS class
kubectl get pod myapp -o jsonpath='{.status.qosClass}'

Pod Priority and Preemption

Pod priority lets you rank pods. When the cluster is full, higher-priority pods can preempt (evict) lower-priority ones.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-apps
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production services"
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-api
spec:
  priorityClassName: critical-apps
  containers:
    - name: api
      image: myapi:v3.0.0

Built-in priority classes: - system-cluster-critical (2000000000) — core cluster components (kube-apiserver, etcd) - system-node-critical (2000001000) — node-essential components (kube-proxy, CNI)

Priority values range from -2,147,483,648 to 1,000,000,000 (values above 1B are reserved for system use).


Scheduling: Controlling Where Pods Run

Node Selectors

The simplest form of scheduling constraint. Pods are only scheduled on nodes matching all specified labels.

spec:
  nodeSelector:
    disktype: ssd
    gpu: "true"
# Label a node
kubectl label node worker-3 disktype=ssd gpu=true

Node Affinity

More expressive than nodeSelector. Supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators.

requiredDuringSchedulingIgnoredDuringExecution — hard requirement. Pod won't schedule if no matching node exists.

preferredDuringSchedulingIgnoredDuringExecution — soft preference. Scheduler tries to match but will schedule elsewhere if needed.

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
                  - us-east-1b
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values:
                  - high-memory

Pod Affinity and Anti-Affinity

Schedule pods relative to other pods, not nodes.

Pod affinity — "schedule me near pods with label X." Useful for co-locating related services for low-latency communication.

Pod anti-affinity — "schedule me away from pods with label X." Essential for HA — spread replicas across nodes or zones.

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - api-server
          topologyKey: kubernetes.io/hostname
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - cache
            topologyKey: topology.kubernetes.io/zone

The topologyKey defines what "near" or "away" means: - kubernetes.io/hostname — same/different node - topology.kubernetes.io/zone — same/different availability zone - topology.kubernetes.io/region — same/different region

Taints and Tolerations

Taints are applied to nodes. They repel pods unless the pod has a matching toleration.

# Taint a node
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# Remove a taint
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule-

Taint effects: - NoSchedule — new pods without a matching toleration won't be scheduled here - PreferNoSchedule — soft version, scheduler tries to avoid but doesn't guarantee - NoExecute — existing pods without tolerations are evicted, new ones aren't scheduled

spec:
  tolerations:
    - key: "gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
    - key: "node.kubernetes.io/not-ready"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 300    # Tolerate for 5 minutes, then evict

The operator: Exists matches any value for the key (useful for built-in taints).

Topology Spread Constraints

Distribute pods evenly across failure domains. More fine-grained than pod anti-affinity.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: api-server
    - maxSkew: 2
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: api-server
  • maxSkew — maximum difference in pod count between any two topology domains
  • whenUnsatisfiable: DoNotSchedule — hard constraint (pod stays Pending)
  • whenUnsatisfiable: ScheduleAnyway — soft constraint (best-effort spread)

PodDisruptionBudget

A PDB declares how many pods of a workload must remain available during voluntary disruptions (node drains, cluster upgrades, rolling updates).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2           # At least 2 pods must be available
  # OR
  # maxUnavailable: 1       # At most 1 pod can be down
  selector:
    matchLabels:
      app: api-server

PDBs only protect against voluntary disruptions. They cannot prevent involuntary disruptions like node crashes or OOMKills.

# Check PDB status
kubectl get pdb -A
kubectl describe pdb api-pdb -n production

Pod Overhead and Runtime Class

Pod overhead accounts for resources consumed by the pod sandbox itself (not your containers). This matters for VM-based runtimes like Kata Containers or gVisor where the sandbox has non-trivial resource usage.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata-qemu
overhead:
  podFixed:
    cpu: 250m
    memory: 160Mi
---
apiVersion: v1
kind: Pod
metadata:
  name: secure-workload
spec:
  runtimeClassName: kata
  containers:
    - name: app
      image: myapp:v2.1.0
      resources:
        requests:
          cpu: 500m
          memory: 256Mi

The scheduler adds the overhead (250m CPU, 160Mi memory) to the container requests when finding a suitable node.


Pod Security Context

Security context defines privilege and access control for a pod or individual container.

Pod-Level Security Context

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault

Container-Level Security Context

containers:
  - name: app
    image: myapp:v2.1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE

Pod Security Standards

Kubernetes defines three security profiles enforced at the namespace level via Pod Security Admission:

Standard What It Allows
Privileged Unrestricted. No restrictions applied.
Baseline Prevents known privilege escalations. Allows most workloads without modification.
Restricted Heavily restricted. Requires running as non-root, dropping all capabilities, read-only root filesystem.
# Enforce restricted standard on a namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

The three enforcement modes: - enforce — reject pods that violate the standard - warn — allow but display a warning to the user - audit — allow but log the violation in the audit log


Quick Reference: Pod Spec Fields That Matter

Field What It Controls
restartPolicy What happens when a container exits
terminationGracePeriodSeconds How long to wait between SIGTERM and SIGKILL (default: 30)
serviceAccountName Kubernetes API identity for the pod
automountServiceAccountToken Whether to mount the SA token (disable if unused)
nodeSelector Simple label-based node selection
affinity Complex scheduling rules (node, pod affinity/anti-affinity)
tolerations Allow scheduling on tainted nodes
topologySpreadConstraints Even distribution across failure domains
priorityClassName Pod priority for preemption decisions
runtimeClassName Container runtime selection (runc, kata, gvisor)
securityContext Privilege and access control
dnsPolicy DNS resolution behavior (ClusterFirst, Default, None)
hostNetwork Share the host's network namespace
shareProcessNamespace Containers can see each other's processes

Wiki Navigation

Prerequisites