k8s
l3
deep-dive
k8s-core --- Portal | Level: L3: Advanced | Topics: Kubernetes Core | Domain: Kubernetes

Kubernetes Scheduler¶

Scope¶

This document explains the Kubernetes scheduler as a real control-plane component, not just "the thing that picks a node."

It covers:

scheduling queue
filtering and scoring
requests/limits relevance
affinity/anti-affinity
taints and tolerations
preemption
plugins/profiles
common scheduling failures

Reference anchors: - https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/ - https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/ - https://kubernetes.io/docs/reference/scheduling/config/

Big Picture¶

The scheduler's job is:

unscheduled Pod -> pick the most suitable node -> bind Pod to node

That sounds simple. It is not.

The scheduler must account for: - resource requests - node constraints - topology - affinity rules - taints/tolerations - priority - preemption - policy plugins

What the Scheduler Actually Sees¶

The scheduler does not run your containers. kubelet and container runtime do that on the chosen node.

The scheduler's domain is placement.

Its question is: "Given this pod spec and current cluster state, where is this pod allowed and preferred to run?"

Basic Scheduling Flow¶

Very roughly:

Pod enters scheduling queue
scheduler watches for it
candidate nodes are filtered
remaining nodes are scored
best node is selected
Pod is bound to node
kubelet on that node takes over runtime execution

This is often described as filter then score.

Filtering Stage¶

Filtering removes nodes that are not valid.

Reasons a node may be filtered out: - insufficient CPU or memory based on requests - node selector mismatch - affinity rules not satisfied - taint not tolerated - volume constraints - port conflicts / topology constraints - node unschedulable state

This stage answers: "Can it run here at all?"

Scoring Stage¶

Scoring ranks the nodes that survived filtering.

Scoring can consider: - resource balance - spreading - affinity preferences - image locality - topology preferences - custom plugin behavior

This stage answers: "Among valid nodes, which looks best?"

Requests and Limits¶

Scheduler decisions are primarily driven by requests, not limits.

That is a critical interview point.

If you set: - request too low -> bin-packing may overcommit reality - request too high -> pods stay pending unnecessarily

The scheduler is making a placement bet using declared demand. Garbage declarations produce garbage placement.

Affinity and Anti-Affinity¶

Node affinity¶

Expresses desired or required node labels.

Pod affinity¶

Place near certain other pods.

Pod anti-affinity¶

Avoid colocation with certain other pods.

These rules are powerful but can create fragile scheduling if overused.

A cluster can become "policy rich, schedulability poor."

Taints and Tolerations¶

Taints repel pods. Tolerations allow pods to stay eligible.

Use cases: - dedicated nodes - special hardware - quarantine/failure management - control-plane isolation

This is one of the cleaner ways to express "not everything belongs everywhere."

Preemption and Priority¶

Higher-priority pods can trigger preemption of lower-priority pods if that is the only path to placement.

That does not mean "scheduler is evil." It means the cluster has policy saying some workloads matter more.

You need to understand the blast radius: preemption can fix one pending pod by disrupting several others.

Scheduling Framework / Plugins¶

Modern kube-scheduler is plugin-oriented.

Extension points allow behavior in stages such as: - queue sorting - pre-filter - filter - post-filter - score - reserve - permit - bind

This matters because "scheduler behavior" is not one giant monolith anymore. It is structured pipeline logic.

Multiple Profiles / Custom Schedulers¶

You can configure different scheduling profiles and even run multiple schedulers.

Why this matters: - special workloads - experimental placement policies - custom platform behavior

But most environments are better served by understanding the default scheduler before getting cute.

Common Reasons Pods Stay Pending¶

no nodes satisfy requests
affinity impossible
taints not tolerated
PVC/volume constraints
topology spread constraints too strict
node selectors wrong
cluster simply too small

The scheduler often gets blamed for what is really bad resource declarations or contradictory policy.

Useful Commands¶

kubectl get pods -A
kubectl describe pod <name>
kubectl get nodes
kubectl describe node <name>

Look for: - events - taints - allocatable resources - request totals - affinity/toleration rules

Interview-Level Things to Explain¶

You should be able to explain:

filter vs score
why requests matter more than limits for placement
what taints/tolerations do
what affinity/anti-affinity do
what preemption is
why a pod can remain Pending forever without any scheduler bug

Fast Mental Model¶

The Kubernetes scheduler is a policy engine that filters impossible nodes, scores viable ones, and binds each unscheduled Pod to the best currently acceptable placement based on declared constraints and cluster state.

Prerequisites¶

Kubernetes Ops (Production) (Topic Pack, L2)

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Kubernetes Scheduler¶

Scope¶

Big Picture¶

What the Scheduler Actually Sees¶

Basic Scheduling Flow¶

Filtering Stage¶

Scoring Stage¶

Requests and Limits¶

Affinity and Anti-Affinity¶

Node affinity¶

Pod affinity¶

Pod anti-affinity¶

Taints and Tolerations¶

Preemption and Priority¶

Scheduling Framework / Plugins¶

Multiple Profiles / Custom Schedulers¶

Common Reasons Pods Stay Pending¶

Useful Commands¶

Interview-Level Things to Explain¶

Fast Mental Model¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Kubernetes Scheduler¶

Scope¶

Big Picture¶

What the Scheduler Actually Sees¶

Basic Scheduling Flow¶

Filtering Stage¶

Scoring Stage¶

Requests and Limits¶

Affinity and Anti-Affinity¶

Node affinity¶

Pod affinity¶

Pod anti-affinity¶

Taints and Tolerations¶

Preemption and Priority¶

Scheduling Framework / Plugins¶

Multiple Profiles / Custom Schedulers¶

Common Reasons Pods Stay Pending¶

Useful Commands¶

Interview-Level Things to Explain¶

Fast Mental Model¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶