Skip to content

Kubernetes Pods & Scheduling Footguns

[!WARNING] These will bite you in production. Every item here has caused real incidents.


1. No Resource Limits (Noisy Neighbor)

A single container without resource limits can consume an entire node's CPU and memory, starving every other pod on that node. The scheduler has no idea how much the container actually needs because you didn't tell it.

A development pod running a memory-leaking test suite eats 32GB of RAM. Every production pod on that node gets OOMKilled.

Fix: Always set both requests and limits. At minimum, set requests so the scheduler can make informed decisions. Use LimitRange to enforce defaults at the namespace level:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      type: Container

2. Memory Limits Too Tight (OOMKilled)

You profile your app using 200Mi at steady state and set limits.memory: 200Mi. A traffic spike causes a brief allocation to 210Mi. Container gets OOMKilled. The restart causes a connection storm, which uses even more memory. Now you're in a CrashLoopBackOff caused by your own limit.

Fix: Set memory limits 50-100% above the request. Memory is incompressible — going over the limit means instant death, not graceful degradation.

resources:
  requests:
    memory: 200Mi
  limits:
    memory: 400Mi    # 2x headroom for spikes

3. CPU Limits Too Tight (Throttling)

CPU limits cause CFS throttling. Your app has 100ms bursts of CPU activity but the limit caps it at 250m. CFS throttles the process mid-burst, adding latency to every request. Your P99 latency goes from 50ms to 800ms and there's no obvious reason in the logs.

CPU throttling is invisible unless you look for it:

# On the node, check cgroup throttling
cat /sys/fs/cgroup/cpu/kubepods/pod<uid>/cpu.stat
# Look for nr_throttled and throttled_time

# Or via Prometheus
rate(container_cpu_cfs_throttled_seconds_total[5m])

Fix: Many production operators remove CPU limits entirely and rely only on CPU requests. CPU is compressible — a throttled process is slower, but an OOMKilled process is dead. If you keep CPU limits, set them 4-10x above requests.


4. Missing PDB (Drain Evicts All Replicas)

You have 3 replicas of your API server. An admin runs kubectl drain on a node. All 3 replicas happen to be on that node (or on nodes being drained in sequence). All pods are evicted simultaneously. Your service has zero replicas for the time it takes to reschedule.

Fix: Always create PDBs for production workloads:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server

Combine with pod anti-affinity to ensure replicas are on different nodes in the first place.


5. restartPolicy: Always for Batch Jobs

You create a Job but forget that the default restart policy for bare pods is Always. Your job completes successfully, exits with code 0, and Kubernetes restarts it. It runs again, inserts duplicate records into the database, and keeps doing this forever.

Fix: Jobs require restartPolicy: OnFailure or restartPolicy: Never. The Job controller handles retries — the pod-level restart policy should only catch unexpected crashes, not re-run completed work.

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  template:
    spec:
      restartPolicy: OnFailure    # NOT Always
      containers:
        - name: migrate
          image: myapp:v2.1.0
          command: ['python', 'manage.py', 'migrate']

6. hostNetwork/hostPID Without Realizing Security Implications

You set hostNetwork: true because port-forward was too slow for debugging. Now your pod shares the host's network namespace — it can see all traffic on the node, bind to any port, and access services listening on localhost on the host (including the kubelet API at port 10250).

hostPID: true is even worse — your container can see and signal every process on the host, including other pods' containers.

Fix: Never use hostNetwork or hostPID in production unless you're writing a CNI plugin or node-level monitoring agent. If you need to debug network issues, use kubectl debug node/worker-1 which creates a proper debug pod. Enforce this with Pod Security Admission:

kubectl label namespace production pod-security.kubernetes.io/enforce=baseline

The baseline standard blocks hostNetwork, hostPID, and hostIPC.


7. Pod Anti-Affinity Without Enough Nodes

You set requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity with topologyKey: kubernetes.io/hostname. You have 3 replicas and 3 nodes. It works. Then a node goes down. The replacement pod can't schedule — the only available nodes already have a replica. Pod stays Pending, you're now at 2 replicas with no way to get back to 3 until the node returns.

Fix: Use preferredDuringSchedulingIgnoredDuringExecution for anti-affinity in most cases. Hard anti-affinity should only be used when you have significantly more nodes than replicas:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values: ["api-server"]
          topologyKey: kubernetes.io/hostname

8. Taints Without Matching Tolerations

You taint your GPU nodes with gpu=true:NoSchedule. A month later someone adds a new deployment that should run on GPU nodes but forgets the toleration. The pod goes Pending. They stare at kubectl describe pod for an hour not realizing the message 1 node(s) had untolerated taint {gpu: true} is telling them exactly what's wrong.

This also bites during cluster upgrades when cloud providers add temporary taints to nodes being upgraded.

Fix: Document all taints. Use a policy engine (Kyverno, OPA/Gatekeeper) to automatically inject tolerations for matching workloads. Always check taints when debugging Pending pods:

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

9. terminationGracePeriodSeconds Too Short

Default is 30 seconds. Your app takes 45 seconds to drain connections, finish in-flight requests, and flush buffers. At second 30, Kubernetes sends SIGKILL. Every deployment drops in-flight requests.

You don't see this in staging because staging has no real traffic.

Fix: Set terminationGracePeriodSeconds to at least 2x your expected drain time. And actually handle SIGTERM in your application:

spec:
  terminationGracePeriodSeconds: 90
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]  # Wait for endpoints to deregister

The preStop sleep gives the Endpoints controller time to remove the pod from service before it starts shutting down. Without this, new requests arrive at a pod that's already draining.


10. No preStop Hook for Graceful Shutdown

You handle SIGTERM in your app. Good. But when Kubernetes sends SIGTERM, it also simultaneously updates the Endpoints object to remove the pod. The problem: kube-proxy and ingress controllers need time to propagate this update. For a few seconds after SIGTERM, traffic is still being routed to your pod while it's shutting down.

Fix: Add a preStop hook that sleeps for a few seconds. This gives the load balancer time to stop sending traffic before the app starts its shutdown sequence:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]

The sleep happens before the SIGTERM is sent to PID 1, giving the network stack time to catch up.


11. Security Context Not Set (Running as Root)

Most container images default to running as root. Your app doesn't need root. But because you never set securityContext, a container escape vulnerability means the attacker has root on the node.

Fix: Always set security context. The restricted Pod Security Standard enforces all of these:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 3000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop:
      - ALL

If your app needs to write to the filesystem, use an emptyDir volume mounted at the write path instead of disabling readOnlyRootFilesystem.