- k8s
- l3
- deep-dive
- k8s-core --- Portal | Level: L3: Advanced | Topics: Kubernetes Core | Domain: Kubernetes
Kubernetes Scheduler¶
Scope¶
This document explains the Kubernetes scheduler as a real control-plane component, not just "the thing that picks a node."
It covers:
- scheduling queue
- filtering and scoring
- requests/limits relevance
- affinity/anti-affinity
- taints and tolerations
- preemption
- plugins/profiles
- common scheduling failures
Reference anchors: - https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/ - https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/ - https://kubernetes.io/docs/reference/scheduling/config/
Big Picture¶
The scheduler's job is:
That sounds simple. It is not.
The scheduler must account for: - resource requests - node constraints - topology - affinity rules - taints/tolerations - priority - preemption - policy plugins
What the Scheduler Actually Sees¶
The scheduler does not run your containers.
kubelet and container runtime do that on the chosen node.
The scheduler's domain is placement.
Its question is: "Given this pod spec and current cluster state, where is this pod allowed and preferred to run?"
Basic Scheduling Flow¶
Very roughly:
- Pod enters scheduling queue
- scheduler watches for it
- candidate nodes are filtered
- remaining nodes are scored
- best node is selected
- Pod is bound to node
- kubelet on that node takes over runtime execution
This is often described as filter then score.
Filtering Stage¶
Filtering removes nodes that are not valid.
Reasons a node may be filtered out: - insufficient CPU or memory based on requests - node selector mismatch - affinity rules not satisfied - taint not tolerated - volume constraints - port conflicts / topology constraints - node unschedulable state
This stage answers: "Can it run here at all?"
Scoring Stage¶
Scoring ranks the nodes that survived filtering.
Scoring can consider: - resource balance - spreading - affinity preferences - image locality - topology preferences - custom plugin behavior
This stage answers: "Among valid nodes, which looks best?"
Requests and Limits¶
Scheduler decisions are primarily driven by requests, not limits.
That is a critical interview point.
If you set: - request too low -> bin-packing may overcommit reality - request too high -> pods stay pending unnecessarily
The scheduler is making a placement bet using declared demand. Garbage declarations produce garbage placement.
Affinity and Anti-Affinity¶
Node affinity¶
Expresses desired or required node labels.
Pod affinity¶
Place near certain other pods.
Pod anti-affinity¶
Avoid colocation with certain other pods.
These rules are powerful but can create fragile scheduling if overused.
A cluster can become "policy rich, schedulability poor."
Taints and Tolerations¶
Taints repel pods. Tolerations allow pods to stay eligible.
Use cases: - dedicated nodes - special hardware - quarantine/failure management - control-plane isolation
This is one of the cleaner ways to express "not everything belongs everywhere."
Preemption and Priority¶
Higher-priority pods can trigger preemption of lower-priority pods if that is the only path to placement.
That does not mean "scheduler is evil." It means the cluster has policy saying some workloads matter more.
You need to understand the blast radius: preemption can fix one pending pod by disrupting several others.
Scheduling Framework / Plugins¶
Modern kube-scheduler is plugin-oriented.
Extension points allow behavior in stages such as: - queue sorting - pre-filter - filter - post-filter - score - reserve - permit - bind
This matters because "scheduler behavior" is not one giant monolith anymore. It is structured pipeline logic.
Multiple Profiles / Custom Schedulers¶
You can configure different scheduling profiles and even run multiple schedulers.
Why this matters: - special workloads - experimental placement policies - custom platform behavior
But most environments are better served by understanding the default scheduler before getting cute.
Common Reasons Pods Stay Pending¶
- no nodes satisfy requests
- affinity impossible
- taints not tolerated
- PVC/volume constraints
- topology spread constraints too strict
- node selectors wrong
- cluster simply too small
The scheduler often gets blamed for what is really bad resource declarations or contradictory policy.
Useful Commands¶
Look for: - events - taints - allocatable resources - request totals - affinity/toleration rules
Interview-Level Things to Explain¶
You should be able to explain:
- filter vs score
- why requests matter more than limits for placement
- what taints/tolerations do
- what affinity/anti-affinity do
- what preemption is
- why a pod can remain Pending forever without any scheduler bug
Fast Mental Model¶
The Kubernetes scheduler is a policy engine that filters impossible nodes, scores viable ones, and binds each unscheduled Pod to the best currently acceptable placement based on declared constraints and cluster state.
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core
Pages that link here¶
- Chaos Engineering & Fault Injection
- Practical Kubernetes Ops - Street Ops
- Primer
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning
- Symptoms: Canary Deploy Looks Healthy, Actually Routing to Wrong Backend, Ingress Misconfigured
- Symptoms: Deployment Stuck, ImagePull Auth Failure, Fix Is Vault Secret Rotation
- Symptoms: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config