Skip to content

Linux Process Scheduler

Scope

This document explains Linux scheduling from the viewpoint of operations, performance, and interview readiness.

It covers:

  • runnable tasks and run queues
  • scheduler classes
  • CFS/EEVDF-era intuition
  • fairness
  • priorities and nice values
  • realtime scheduling
  • CPU affinity
  • load balancing
  • cgroup CPU control
  • scheduling pathologies

Reference anchors: - https://docs.kernel.org/scheduler/index.html - https://docs.kernel.org/scheduler/sched-design-CFS.html - https://docs.kernel.org/admin-guide/cgroup-v2.html


Big Picture

The scheduler answers one relentless question:

Which runnable task should run on which CPU right now?

That is it.

Everything else is policy: - fairness - latency - throughput - realtime guarantees - power efficiency - CPU locality - cgroup bandwidth


Key Distinction: Runnable vs Running vs Sleeping

A process can exist without consuming CPU.

Important states conceptually:

  • running - currently executing on a CPU
  • runnable - ready to run, waiting for CPU
  • sleeping/blocking - waiting on IO, lock, timer, event, etc.

High load average often means many tasks are runnable or uninterruptibly blocked, not necessarily that CPUs are 100% busy with useful work.


Run Queues

Each CPU has scheduler state and runnable entities queued for selection.

That matters because a multicore system is not one giant single-file line. The kernel tries to place and balance tasks sensibly across CPUs.

Bad outcomes include: - one CPU hot, others cool - cache locality losses - migration overhead - NUMA pain - tail latency spikes


Scheduler Classes

Linux has multiple scheduling classes.

At a high level: - normal/fair scheduling - realtime scheduling - deadline scheduling

For most admins, normal/fair and realtime are the big ones to understand.


Fair Scheduling Intuition

The fair scheduler is trying to share CPU over time in a way that approximates fairness while preserving responsiveness.

Conceptual model: - every runnable task "deserves" CPU time - tasks that have had less CPU recently should get preference - interactive behavior and weighting matter - the system is trying to avoid both starvation and awful latency

Nice values influence weights, not direct percentages in the naive sense.


Nice Values

nice changes relative priority within the normal scheduler class.

Important point: nice is not a magic "take exactly X% CPU" knob. It changes weight relative to competitors.

So: - low nice value -> stronger claim on CPU - high nice value -> weaker claim

This only matters when there is contention.


Realtime Scheduling

Linux also supports realtime policies such as FIFO and round-robin styles.

These are not toys. Misuse can starve ordinary work badly.

Realtime scheduling is appropriate when: - deterministic response is more important than fairness - workloads are designed carefully - priority inversion and starvation risks are understood

A box full of badly designed RT tasks can become a very expensive brick.


Deadline Scheduling

There is also deadline-oriented scheduling machinery for workloads needing temporal guarantees.

For many ops interviews it is enough to know: - it exists - it is not the same as normal fair scheduling - it targets explicit timing constraints rather than generic fairness


CPU Affinity and Placement

Tools like taskset, cpusets, and orchestrator policies can constrain where tasks run.

Reasons: - cache locality - NUMA locality - licensing weirdness - isolating noisy workloads - dedicating cores for latency-sensitive work

But manual pinning has costs: - imbalance - underutilization - operational complexity


Scheduler Load Balancing

The kernel periodically balances work across CPUs and scheduling domains.

It must trade off: - locality - fairness - migration cost - power model - asymmetric CPU capacity on some systems

This is why "just move the task to another core" is conceptually simple and operationally messy.


Cgroups and CPU Control

Cgroup v2 makes CPU control workload-aware.

Concepts include: - weighted CPU distribution - quotas/limits - hierarchical control

This matters for: - containers - systemd slices - noisy-neighbor control - service-level fairness

Again: Linux scheduling is now "which task runs next?" plus "what policy domain does this workload belong to?"


Common Scheduler Pathologies

CPU saturation

Too many runnable tasks, not enough cores.

Run-queue latency

Tasks are runnable but wait too long to get CPU.

Lock contention

Looks like CPU trouble but is really threads fighting over shared resources.

Interrupt/softirq pressure

CPU time consumed by networking/storage/kernel work, not just user processes.

Bad cgroup quota settings

Artificial throttling that looks like mysterious slowness.

Affinity mistakes

Pinned workloads bottleneck one part of the machine.


Useful Commands

uptime
top
htop
ps -eo pid,comm,ni,pri,cls,psr,%cpu --sort=-%cpu | head
vmstat 1
pidstat -u -t 1
mpstat -P ALL 1
taskset -cp <pid>
chrt -p <pid>
cat /proc/sched_debug

For deeper work: - perf sched - ftrace - eBPF sched tracing


How to Think About "High CPU"

Ask: 1. who is burning CPU? 2. user, system, irq, softirq, or steal? 3. are tasks runnable or blocked? 4. is there contention or throttling? 5. is scheduler behavior the cause or just the messenger?

A lot of "scheduler problems" are actually: - lock contention - bad query plans - garbage collection - interrupt storms - cgroup throttling - virtualization steal time


Interview-Level Things to Explain

You should be able to explain:

  • runnable vs running vs sleeping
  • what a run queue is
  • what nice values actually do
  • difference between fair and realtime scheduling
  • why affinity exists
  • how cgroups affect CPU scheduling
  • why high load average is not the same as "CPU is 100%"

Fast Mental Model

The scheduler is the kernel's traffic cop for CPU time: it selects which runnable task gets a core, under policies shaped by fairness, latency, realtime rules, and cgroup isolation.

Wiki Navigation

Prerequisites