Portal | Level: L2: Operations | Topics: cgroups & Linux Namespaces, Linux Fundamentals, Container Runtimes | Domain: Linux

cgroups & Linux Namespaces - Primer¶

Why This Matters¶

Every container is built on two kernel features: namespaces for isolation, cgroups for resource control. When a pod gets OOMKilled, that's cgroups. When a container has its own PID 1 and network stack, that's namespaces. Understanding these primitives lets you debug containers at the kernel level instead of treating them as black boxes.

Linux Namespaces¶

Namespaces partition kernel resources so that one set of processes sees one set of resources and another sees a different set.

Name origin: Linux namespaces were first introduced in kernel 2.4.19 (2002) with the mount namespace — the flag CLONE_NEWNS stands for "new namespace," not "new mount namespace," because at the time nobody anticipated there would be more than one type. Every subsequent namespace type got a more specific flag name (CLONE_NEWPID, CLONE_NEWNET, etc.), but the mount namespace is stuck with the generic name as a historical artifact.

The Eight Namespace Types¶

Namespace	Flag	Isolates
pid	`CLONE_NEWPID`	Process IDs — container sees its own PID tree starting at 1
net	`CLONE_NEWNET`	Network stack — interfaces, routing, iptables, sockets
mnt	`CLONE_NEWNS`	Mount points — own filesystem view
uts	`CLONE_NEWUTS`	Hostname and NIS domain name
ipc	`CLONE_NEWIPC`	System V IPC, POSIX message queues
user	`CLONE_NEWUSER`	User/group IDs — UID 0 inside maps to unprivileged UID outside
cgroup	`CLONE_NEWCGROUP`	cgroup root directory — own cgroup hierarchy view
time	`CLONE_NEWTIME`	`CLOCK_MONOTONIC` and `CLOCK_BOOTTIME` offsets

Every process has namespace references in /proc/PID/ns/:

$ ls -la /proc/self/ns/
lrwxrwxrwx 1 root root 0 Mar 19 10:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 net -> 'net:[4026531840]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 uts -> 'uts:[4026531838]'

The inode numbers are namespace identifiers. Two processes sharing the same inode are in the same namespace.

Creating Namespaces with `unshare`¶

# New PID + mount namespace — bash becomes PID 1
sudo unshare --pid --mount --fork bash
mount -t proc proc /proc    # remount procfs for correct ps output
ps aux                       # only shows processes in this namespace

# New network namespace — isolated network stack
sudo unshare --net bash
ip link    # only loopback, no external connectivity

# Full container-like isolation
sudo unshare --pid --net --mount --uts --ipc --fork bash

Entering Namespaces with `nsenter`¶

# Find container's PID on host
PID=$(docker inspect --format '{{.State.Pid}}' my_container)

# Enter all namespaces
sudo nsenter -t $PID -m -u -i -n -p -- /bin/bash

# Enter just network namespace (most common for debugging)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -nn port 80

# Enter PID + mount namespace to see container's filesystem
sudo nsenter -t $PID -m -p -- ls /app

How Containers Use Namespaces¶

When an OCI runtime starts a container, it creates namespaces per the config. Kubernetes pods share NET and IPC namespaces across all containers in the pod — this is why containers in a pod reach each other on localhost. Each container still gets its own PID and MNT namespace.

Key sharing patterns: - docker run --network=container:X — shares X's network namespace - docker run --pid=container:X — shares X's PID namespace (debug sidecars) - hostNetwork: true in k8s — skips NET namespace, uses host's network stack

cgroups v1¶

cgroups limit, account for, and isolate resource usage. v1 uses separate hierarchies per controller.

v1 Layout and Controllers¶

/sys/fs/cgroup/
├── cpu/           ← cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us, cpu.stat
├── memory/        ← memory.limit_in_bytes, memory.usage_in_bytes, memory.stat
├── blkio/         ← blkio.throttle.read_bps_device, blkio.throttle.write_bps_device
├── devices/       ← devices.allow, devices.deny
├── freezer/       ← freezer.state (FROZEN/THAWED)
├── pids/          ← pids.max, pids.current
├── cpuacct/       ← cpuacct.usage, cpuacct.stat
└── cpuset/        ← cpuset.cpus, cpuset.mems

Creating and Using v1 cgroups¶

# Create a memory-limited cgroup
sudo mkdir /sys/fs/cgroup/memory/myapp
echo 268435456 | sudo tee /sys/fs/cgroup/memory/myapp/memory.limit_in_bytes  # 256MB
echo $$ | sudo tee /sys/fs/cgroup/memory/myapp/cgroup.procs

# CPU limit: 50% of one core
sudo mkdir /sys/fs/cgroup/cpu/myapp
echo 50000 | sudo tee /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
echo 100000 | sudo tee /sys/fs/cgroup/cpu/myapp/cpu.cfs_period_us

# Using libcgroup tools
sudo cgcreate -g cpu,memory:myapp
sudo cgexec -g cpu,memory:myapp stress --cpu 2 --vm 1 --vm-bytes 128M

cgroups v2¶

Who made it: cgroups (control groups) were originally developed by Paul Menage and Rohit Seth at Google in 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with OS-level containers. Google used them internally to manage workloads on their fleet years before Docker existed. cgroups v2 was a complete rewrite by Tejun Heo, merged in kernel 4.5 (2016), addressing the design limitations of v1's separate-hierarchy-per-controller approach.

v2 replaces per-controller hierarchies with a single unified hierarchy. Cleaner, more consistent, adds Pressure Stall Information (PSI).

v2 Unified Layout¶

/sys/fs/cgroup/
├── cgroup.controllers          ← available controllers
├── cgroup.subtree_control      ← controllers enabled for children
├── system.slice/
│   └── myapp.service/
│       ├── cpu.max             ← "QUOTA PERIOD" (e.g., "50000 100000")
│       ├── cpu.weight          ← 1-10000 (replaces cpu.shares)
│       ├── memory.max          ← hard limit (replaces memory.limit_in_bytes)
│       ├── memory.high         ← throttling threshold (new in v2)
│       ├── memory.current      ← current usage
│       ├── memory.pressure     ← PSI metrics
│       ├── io.max              ← per-device I/O limits
│       ├── pids.max            ← PID limit
│       └── cgroup.procs        ← PIDs in this cgroup

Enabling Controllers and Setting Limits¶

In v2, controllers must be explicitly enabled for children:

# Enable controllers for child cgroups
echo "+cpu +memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Create a cgroup and set limits
sudo mkdir /sys/fs/cgroup/myapp
echo "50000 100000" | sudo tee /sys/fs/cgroup/myapp/cpu.max        # 50% of one core
echo 536870912 | sudo tee /sys/fs/cgroup/myapp/memory.max           # 512MB hard limit
echo 402653184 | sudo tee /sys/fs/cgroup/myapp/memory.high          # 384MB throttle point
echo 100 | sudo tee /sys/fs/cgroup/myapp/pids.max                   # max 100 processes
echo "8:0 rbps=52428800 wbps=20971520" | sudo tee /sys/fs/cgroup/myapp/io.max  # I/O limits
echo $$ | sudo tee /sys/fs/cgroup/myapp/cgroup.procs

Important v2 rule: processes must live in leaf cgroups. A cgroup with children and active controllers cannot directly contain processes.

Pressure Stall Information (PSI)¶

PSI quantifies how much time processes spend waiting for resources:

cat /sys/fs/cgroup/myapp/cpu.pressure
# some avg10=4.50 avg60=2.30 avg300=1.15 total=123456789
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some: percentage of time at least one task was stalled
full: percentage of time all tasks were stalled
avg10/60/300: exponential moving averages over 10s, 60s, 300s

If memory.pressure some avg60 stays above 10-20%, the workload needs more memory. If cpu.pressure full avg60 is above 0, you're losing CPU capacity across the board.

How systemd Uses cgroups¶

systemd is the primary cgroup manager on modern Linux. Every service, session, and scope gets its own cgroup automatically.

/sys/fs/cgroup/
├── init.scope/           ← PID 1
├── system.slice/         ← system services (sshd, nginx, docker)
├── user.slice/           ← user sessions
└── machine.slice/        ← VMs and containers (systemd-nspawn)

Viewing and Monitoring¶

systemd-cgls                    # cgroup hierarchy as a tree
systemd-cgtop                   # real-time resource usage per cgroup
systemctl show nginx.service | grep -E '(Memory|CPU|Tasks)' # current settings

Resource Control Directives¶

# /etc/systemd/system/myapp.service.d/override.conf
[Service]
CPUQuota=200%          # limit to 2 cores
MemoryMax=1G           # hard memory limit
MemoryHigh=768M        # throttle threshold
IOWeight=50            # relative I/O priority
TasksMax=512           # max processes/threads
Delegate=yes           # delegate cgroup control (needed for container runtimes)

Apply at runtime: sudo systemctl set-property nginx.service MemoryMax=512M

Transient Units with `systemd-run`¶

# Run a one-off command with resource limits
sudo systemd-run --scope -p MemoryMax=256M -p CPUQuota=100% ./my_benchmark

How Kubernetes Uses cgroups¶

Pod QoS Classes¶

QoS	Condition	OOM Priority
Guaranteed	Every container: `requests == limits` for CPU and memory	Lowest (last killed)
Burstable	At least one container has requests or limits, but not all equal	Medium
BestEffort	No requests or limits set	Highest (first killed)

Requests vs Limits in cgroup Terms¶

CPU requests → cpu.weight/cpu.shares — proportional, not a hard limit. If no contention, can use more.
CPU limits → cpu.max/cpu.cfs_quota_us — hard throttle regardless of available CPU.
Memory requests → scheduling decision + memory.low (v2) — guaranteed minimum.
Memory limits → memory.max/memory.limit_in_bytes — exceed and get OOM-killed.

cgroup Driver¶

Kubelet and container runtime must use the same driver. On modern systems, both should use systemd:

# kubelet config
cgroupDriver: systemd

Mismatching drivers (kubelet=systemd, runtime=cgroupfs) causes cryptic pod creation failures.

Debug clue: If pods fail to start with errors like "failed to create pod sandbox" or "cgroup not found," the most common cause is a cgroup driver mismatch between kubelet and the container runtime. Check both: kubelet --cgroup-driver and crictl info | grep cgroupDriver. They must match — both systemd (recommended) or both cgroupfs.

Gotcha: CPU limits in Kubernetes use CFS (Completely Fair Scheduler) quota, which enforces throttling in fixed periods (default 100ms). A process that needs 200ms of CPU in a burst but has a 100m (10%) CPU limit gets throttled every 100ms period even if the node has idle cores. This causes latency spikes that look like application slowness but are actually kernel-level throttling. Check with cat /sys/fs/cgroup/cpu/*/cpu.stat — a high nr_throttled count confirms this.

Kubelet cgroup Hierarchy¶

/sys/fs/cgroup/kubepods.slice/
├── kubepods-burstable.slice/pod<uid>.slice/    ← Burstable pods
├── kubepods-besteffort.slice/pod<uid>.slice/   ← BestEffort pods
└── kubepods-pod<uid>.slice/                    ← Guaranteed pods

The kubelet reserves resources via kubeReserved and systemReserved to prevent pods from starving system services.

v1 vs v2 Quick Reference¶

Aspect	v1	v2
Hierarchy	One per controller	Single unified
CPU limit	`cpu.cfs_quota_us`	`cpu.max` ("QUOTA PERIOD")
CPU weight	`cpu.shares` (2-262144)	`cpu.weight` (1-10000)
Memory limit	`memory.limit_in_bytes`	`memory.max`
Memory soft	`memory.soft_limit_in_bytes`	`memory.high`
I/O limit	`blkio.throttle.*`	`io.max`
PSI	N/A	Built-in (`*.pressure`)

Check your version: stat -f --format=%T /sys/fs/cgroup/ — returns cgroup2fs (v2) or tmpfs (v1).

Prerequisites¶

Linux Ops (Topic Pack, L0)
Docker (Topic Pack, L1)

/proc Filesystem (Topic Pack, L2) — Linux Fundamentals
Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
Case Study: Inode Exhaustion (Case Study, L1) — Linux Fundamentals

cgroups & Linux Namespaces - Primer¶

Why This Matters¶

Linux Namespaces¶

The Eight Namespace Types¶

Creating Namespaces with `unshare`¶

Entering Namespaces with `nsenter`¶

How Containers Use Namespaces¶

cgroups v1¶

v1 Layout and Controllers¶

Creating and Using v1 cgroups¶

cgroups v2¶

v2 Unified Layout¶

Enabling Controllers and Setting Limits¶

Pressure Stall Information (PSI)¶

How systemd Uses cgroups¶

Viewing and Monitoring¶

Resource Control Directives¶

Transient Units with `systemd-run`¶

How Kubernetes Uses cgroups¶

Pod QoS Classes¶

Requests vs Limits in cgroup Terms¶

cgroup Driver¶

Kubelet cgroup Hierarchy¶

v1 vs v2 Quick Reference¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

cgroups & Linux Namespaces - Primer¶

Why This Matters¶

Linux Namespaces¶

The Eight Namespace Types¶

Creating Namespaces with unshare¶

Entering Namespaces with nsenter¶

How Containers Use Namespaces¶

cgroups v1¶

v1 Layout and Controllers¶

Creating and Using v1 cgroups¶

cgroups v2¶

v2 Unified Layout¶

Enabling Controllers and Setting Limits¶

Pressure Stall Information (PSI)¶

How systemd Uses cgroups¶

Viewing and Monitoring¶

Resource Control Directives¶

Transient Units with systemd-run¶

How Kubernetes Uses cgroups¶

Pod QoS Classes¶

Requests vs Limits in cgroup Terms¶

cgroup Driver¶

Kubelet cgroup Hierarchy¶

v1 vs v2 Quick Reference¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶

Creating Namespaces with `unshare`¶

Entering Namespaces with `nsenter`¶

Transient Units with `systemd-run`¶