- linux
- l3
- deep-dive
- linux-fundamentals --- Portal | Level: L3: Advanced | Topics: Linux Fundamentals | Domain: Linux
Linux Process Scheduler¶
Scope¶
This document explains Linux scheduling from the viewpoint of operations, performance, and interview readiness.
It covers:
- runnable tasks and run queues
- scheduler classes
- CFS/EEVDF-era intuition
- fairness
- priorities and nice values
- realtime scheduling
- CPU affinity
- load balancing
- cgroup CPU control
- scheduling pathologies
Reference anchors: - https://docs.kernel.org/scheduler/index.html - https://docs.kernel.org/scheduler/sched-design-CFS.html - https://docs.kernel.org/admin-guide/cgroup-v2.html
Big Picture¶
The scheduler answers one relentless question:
That is it.
Everything else is policy: - fairness - latency - throughput - realtime guarantees - power efficiency - CPU locality - cgroup bandwidth
Key Distinction: Runnable vs Running vs Sleeping¶
A process can exist without consuming CPU.
Important states conceptually:
- running - currently executing on a CPU
- runnable - ready to run, waiting for CPU
- sleeping/blocking - waiting on IO, lock, timer, event, etc.
High load average often means many tasks are runnable or uninterruptibly blocked, not necessarily that CPUs are 100% busy with useful work.
Run Queues¶
Each CPU has scheduler state and runnable entities queued for selection.
That matters because a multicore system is not one giant single-file line. The kernel tries to place and balance tasks sensibly across CPUs.
Bad outcomes include: - one CPU hot, others cool - cache locality losses - migration overhead - NUMA pain - tail latency spikes
Scheduler Classes¶
Linux has multiple scheduling classes.
At a high level: - normal/fair scheduling - realtime scheduling - deadline scheduling
For most admins, normal/fair and realtime are the big ones to understand.
Fair Scheduling Intuition¶
The fair scheduler is trying to share CPU over time in a way that approximates fairness while preserving responsiveness.
Conceptual model: - every runnable task "deserves" CPU time - tasks that have had less CPU recently should get preference - interactive behavior and weighting matter - the system is trying to avoid both starvation and awful latency
Nice values influence weights, not direct percentages in the naive sense.
Nice Values¶
nice changes relative priority within the normal scheduler class.
Important point: nice is not a magic "take exactly X% CPU" knob. It changes weight relative to competitors.
So: - low nice value -> stronger claim on CPU - high nice value -> weaker claim
This only matters when there is contention.
Realtime Scheduling¶
Linux also supports realtime policies such as FIFO and round-robin styles.
These are not toys. Misuse can starve ordinary work badly.
Realtime scheduling is appropriate when: - deterministic response is more important than fairness - workloads are designed carefully - priority inversion and starvation risks are understood
A box full of badly designed RT tasks can become a very expensive brick.
Deadline Scheduling¶
There is also deadline-oriented scheduling machinery for workloads needing temporal guarantees.
For many ops interviews it is enough to know: - it exists - it is not the same as normal fair scheduling - it targets explicit timing constraints rather than generic fairness
CPU Affinity and Placement¶
Tools like taskset, cpusets, and orchestrator policies can constrain where tasks run.
Reasons: - cache locality - NUMA locality - licensing weirdness - isolating noisy workloads - dedicating cores for latency-sensitive work
But manual pinning has costs: - imbalance - underutilization - operational complexity
Scheduler Load Balancing¶
The kernel periodically balances work across CPUs and scheduling domains.
It must trade off: - locality - fairness - migration cost - power model - asymmetric CPU capacity on some systems
This is why "just move the task to another core" is conceptually simple and operationally messy.
Cgroups and CPU Control¶
Cgroup v2 makes CPU control workload-aware.
Concepts include: - weighted CPU distribution - quotas/limits - hierarchical control
This matters for: - containers - systemd slices - noisy-neighbor control - service-level fairness
Again: Linux scheduling is now "which task runs next?" plus "what policy domain does this workload belong to?"
Common Scheduler Pathologies¶
CPU saturation¶
Too many runnable tasks, not enough cores.
Run-queue latency¶
Tasks are runnable but wait too long to get CPU.
Lock contention¶
Looks like CPU trouble but is really threads fighting over shared resources.
Interrupt/softirq pressure¶
CPU time consumed by networking/storage/kernel work, not just user processes.
Bad cgroup quota settings¶
Artificial throttling that looks like mysterious slowness.
Affinity mistakes¶
Pinned workloads bottleneck one part of the machine.
Useful Commands¶
For deeper work:
- perf sched
- ftrace
- eBPF sched tracing
How to Think About "High CPU"¶
Ask: 1. who is burning CPU? 2. user, system, irq, softirq, or steal? 3. are tasks runnable or blocked? 4. is there contention or throttling? 5. is scheduler behavior the cause or just the messenger?
A lot of "scheduler problems" are actually: - lock contention - bad query plans - garbage collection - interrupt storms - cgroup throttling - virtualization steal time
Interview-Level Things to Explain¶
You should be able to explain:
- runnable vs running vs sleeping
- what a run queue is
- what nice values actually do
- difference between fair and realtime scheduling
- why affinity exists
- how cgroups affect CPU scheduling
- why high load average is not the same as "CPU is 100%"
Fast Mental Model¶
The scheduler is the kernel's traffic cop for CPU time: it selects which runnable task gets a core, under policies shaped by fairness, latency, realtime rules, and cgroup isolation.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
- Case Study: Inode Exhaustion (Case Study, L1) — Linux Fundamentals