Portal | Level: L2: Operations | Topics: cgroups & Linux Namespaces, Linux Fundamentals, Container Runtimes | Domain: Linux
cgroups & Linux Namespaces - Primer¶
Why This Matters¶
Every container is built on two kernel features: namespaces for isolation, cgroups for resource control. When a pod gets OOMKilled, that's cgroups. When a container has its own PID 1 and network stack, that's namespaces. Understanding these primitives lets you debug containers at the kernel level instead of treating them as black boxes.
Linux Namespaces¶
Namespaces partition kernel resources so that one set of processes sees one set of resources and another sees a different set.
Name origin: Linux namespaces were first introduced in kernel 2.4.19 (2002) with the mount namespace — the flag
CLONE_NEWNSstands for "new namespace," not "new mount namespace," because at the time nobody anticipated there would be more than one type. Every subsequent namespace type got a more specific flag name (CLONE_NEWPID, CLONE_NEWNET, etc.), but the mount namespace is stuck with the generic name as a historical artifact.
The Eight Namespace Types¶
| Namespace | Flag | Isolates |
|---|---|---|
| pid | CLONE_NEWPID |
Process IDs — container sees its own PID tree starting at 1 |
| net | CLONE_NEWNET |
Network stack — interfaces, routing, iptables, sockets |
| mnt | CLONE_NEWNS |
Mount points — own filesystem view |
| uts | CLONE_NEWUTS |
Hostname and NIS domain name |
| ipc | CLONE_NEWIPC |
System V IPC, POSIX message queues |
| user | CLONE_NEWUSER |
User/group IDs — UID 0 inside maps to unprivileged UID outside |
| cgroup | CLONE_NEWCGROUP |
cgroup root directory — own cgroup hierarchy view |
| time | CLONE_NEWTIME |
CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets |
Every process has namespace references in /proc/PID/ns/:
$ ls -la /proc/self/ns/
lrwxrwxrwx 1 root root 0 Mar 19 10:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 net -> 'net:[4026531840]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 uts -> 'uts:[4026531838]'
The inode numbers are namespace identifiers. Two processes sharing the same inode are in the same namespace.
Creating Namespaces with unshare¶
# New PID + mount namespace — bash becomes PID 1
sudo unshare --pid --mount --fork bash
mount -t proc proc /proc # remount procfs for correct ps output
ps aux # only shows processes in this namespace
# New network namespace — isolated network stack
sudo unshare --net bash
ip link # only loopback, no external connectivity
# Full container-like isolation
sudo unshare --pid --net --mount --uts --ipc --fork bash
Entering Namespaces with nsenter¶
# Find container's PID on host
PID=$(docker inspect --format '{{.State.Pid}}' my_container)
# Enter all namespaces
sudo nsenter -t $PID -m -u -i -n -p -- /bin/bash
# Enter just network namespace (most common for debugging)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -nn port 80
# Enter PID + mount namespace to see container's filesystem
sudo nsenter -t $PID -m -p -- ls /app
How Containers Use Namespaces¶
When an OCI runtime starts a container, it creates namespaces per the config. Kubernetes pods share NET and IPC namespaces across all containers in the pod — this is why containers in a pod reach each other on localhost. Each container still gets its own PID and MNT namespace.
Key sharing patterns:
- docker run --network=container:X — shares X's network namespace
- docker run --pid=container:X — shares X's PID namespace (debug sidecars)
- hostNetwork: true in k8s — skips NET namespace, uses host's network stack
cgroups v1¶
cgroups limit, account for, and isolate resource usage. v1 uses separate hierarchies per controller.
v1 Layout and Controllers¶
/sys/fs/cgroup/
├── cpu/ ← cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us, cpu.stat
├── memory/ ← memory.limit_in_bytes, memory.usage_in_bytes, memory.stat
├── blkio/ ← blkio.throttle.read_bps_device, blkio.throttle.write_bps_device
├── devices/ ← devices.allow, devices.deny
├── freezer/ ← freezer.state (FROZEN/THAWED)
├── pids/ ← pids.max, pids.current
├── cpuacct/ ← cpuacct.usage, cpuacct.stat
└── cpuset/ ← cpuset.cpus, cpuset.mems
Creating and Using v1 cgroups¶
# Create a memory-limited cgroup
sudo mkdir /sys/fs/cgroup/memory/myapp
echo 268435456 | sudo tee /sys/fs/cgroup/memory/myapp/memory.limit_in_bytes # 256MB
echo $$ | sudo tee /sys/fs/cgroup/memory/myapp/cgroup.procs
# CPU limit: 50% of one core
sudo mkdir /sys/fs/cgroup/cpu/myapp
echo 50000 | sudo tee /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
echo 100000 | sudo tee /sys/fs/cgroup/cpu/myapp/cpu.cfs_period_us
# Using libcgroup tools
sudo cgcreate -g cpu,memory:myapp
sudo cgexec -g cpu,memory:myapp stress --cpu 2 --vm 1 --vm-bytes 128M
cgroups v2¶
Who made it: cgroups (control groups) were originally developed by Paul Menage and Rohit Seth at Google in 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with OS-level containers. Google used them internally to manage workloads on their fleet years before Docker existed. cgroups v2 was a complete rewrite by Tejun Heo, merged in kernel 4.5 (2016), addressing the design limitations of v1's separate-hierarchy-per-controller approach.
v2 replaces per-controller hierarchies with a single unified hierarchy. Cleaner, more consistent, adds Pressure Stall Information (PSI).
v2 Unified Layout¶
/sys/fs/cgroup/
├── cgroup.controllers ← available controllers
├── cgroup.subtree_control ← controllers enabled for children
├── system.slice/
│ └── myapp.service/
│ ├── cpu.max ← "QUOTA PERIOD" (e.g., "50000 100000")
│ ├── cpu.weight ← 1-10000 (replaces cpu.shares)
│ ├── memory.max ← hard limit (replaces memory.limit_in_bytes)
│ ├── memory.high ← throttling threshold (new in v2)
│ ├── memory.current ← current usage
│ ├── memory.pressure ← PSI metrics
│ ├── io.max ← per-device I/O limits
│ ├── pids.max ← PID limit
│ └── cgroup.procs ← PIDs in this cgroup
Enabling Controllers and Setting Limits¶
In v2, controllers must be explicitly enabled for children:
# Enable controllers for child cgroups
echo "+cpu +memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# Create a cgroup and set limits
sudo mkdir /sys/fs/cgroup/myapp
echo "50000 100000" | sudo tee /sys/fs/cgroup/myapp/cpu.max # 50% of one core
echo 536870912 | sudo tee /sys/fs/cgroup/myapp/memory.max # 512MB hard limit
echo 402653184 | sudo tee /sys/fs/cgroup/myapp/memory.high # 384MB throttle point
echo 100 | sudo tee /sys/fs/cgroup/myapp/pids.max # max 100 processes
echo "8:0 rbps=52428800 wbps=20971520" | sudo tee /sys/fs/cgroup/myapp/io.max # I/O limits
echo $$ | sudo tee /sys/fs/cgroup/myapp/cgroup.procs
Important v2 rule: processes must live in leaf cgroups. A cgroup with children and active controllers cannot directly contain processes.
Pressure Stall Information (PSI)¶
PSI quantifies how much time processes spend waiting for resources:
cat /sys/fs/cgroup/myapp/cpu.pressure
# some avg10=4.50 avg60=2.30 avg300=1.15 total=123456789
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
- some: percentage of time at least one task was stalled
- full: percentage of time all tasks were stalled
- avg10/60/300: exponential moving averages over 10s, 60s, 300s
If memory.pressure some avg60 stays above 10-20%, the workload needs more memory. If cpu.pressure full avg60 is above 0, you're losing CPU capacity across the board.
How systemd Uses cgroups¶
systemd is the primary cgroup manager on modern Linux. Every service, session, and scope gets its own cgroup automatically.
/sys/fs/cgroup/
├── init.scope/ ← PID 1
├── system.slice/ ← system services (sshd, nginx, docker)
├── user.slice/ ← user sessions
└── machine.slice/ ← VMs and containers (systemd-nspawn)
Viewing and Monitoring¶
systemd-cgls # cgroup hierarchy as a tree
systemd-cgtop # real-time resource usage per cgroup
systemctl show nginx.service | grep -E '(Memory|CPU|Tasks)' # current settings
Resource Control Directives¶
# /etc/systemd/system/myapp.service.d/override.conf
[Service]
CPUQuota=200% # limit to 2 cores
MemoryMax=1G # hard memory limit
MemoryHigh=768M # throttle threshold
IOWeight=50 # relative I/O priority
TasksMax=512 # max processes/threads
Delegate=yes # delegate cgroup control (needed for container runtimes)
Apply at runtime: sudo systemctl set-property nginx.service MemoryMax=512M
Transient Units with systemd-run¶
# Run a one-off command with resource limits
sudo systemd-run --scope -p MemoryMax=256M -p CPUQuota=100% ./my_benchmark
How Kubernetes Uses cgroups¶
Pod QoS Classes¶
| QoS | Condition | OOM Priority |
|---|---|---|
| Guaranteed | Every container: requests == limits for CPU and memory |
Lowest (last killed) |
| Burstable | At least one container has requests or limits, but not all equal | Medium |
| BestEffort | No requests or limits set | Highest (first killed) |
Requests vs Limits in cgroup Terms¶
- CPU requests →
cpu.weight/cpu.shares— proportional, not a hard limit. If no contention, can use more. - CPU limits →
cpu.max/cpu.cfs_quota_us— hard throttle regardless of available CPU. - Memory requests → scheduling decision +
memory.low(v2) — guaranteed minimum. - Memory limits →
memory.max/memory.limit_in_bytes— exceed and get OOM-killed.
cgroup Driver¶
Kubelet and container runtime must use the same driver. On modern systems, both should use systemd:
Mismatching drivers (kubelet=systemd, runtime=cgroupfs) causes cryptic pod creation failures.
Debug clue: If pods fail to start with errors like "failed to create pod sandbox" or "cgroup not found," the most common cause is a cgroup driver mismatch between kubelet and the container runtime. Check both:
kubelet --cgroup-driverandcrictl info | grep cgroupDriver. They must match — bothsystemd(recommended) or bothcgroupfs.Gotcha: CPU limits in Kubernetes use CFS (Completely Fair Scheduler) quota, which enforces throttling in fixed periods (default 100ms). A process that needs 200ms of CPU in a burst but has a 100m (10%) CPU limit gets throttled every 100ms period even if the node has idle cores. This causes latency spikes that look like application slowness but are actually kernel-level throttling. Check with
cat /sys/fs/cgroup/cpu/*/cpu.stat— a highnr_throttledcount confirms this.
Kubelet cgroup Hierarchy¶
/sys/fs/cgroup/kubepods.slice/
├── kubepods-burstable.slice/pod<uid>.slice/ ← Burstable pods
├── kubepods-besteffort.slice/pod<uid>.slice/ ← BestEffort pods
└── kubepods-pod<uid>.slice/ ← Guaranteed pods
The kubelet reserves resources via kubeReserved and systemReserved to prevent pods from starving system services.
v1 vs v2 Quick Reference¶
| Aspect | v1 | v2 |
|---|---|---|
| Hierarchy | One per controller | Single unified |
| CPU limit | cpu.cfs_quota_us |
cpu.max ("QUOTA PERIOD") |
| CPU weight | cpu.shares (2-262144) |
cpu.weight (1-10000) |
| Memory limit | memory.limit_in_bytes |
memory.max |
| Memory soft | memory.soft_limit_in_bytes |
memory.high |
| I/O limit | blkio.throttle.* |
io.max |
| PSI | N/A | Built-in (*.pressure) |
Check your version: stat -f --format=%T /sys/fs/cgroup/ — returns cgroup2fs (v2) or tmpfs (v1).
Wiki Navigation¶
Prerequisites¶
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
- Case Study: Inode Exhaustion (Case Study, L1) — Linux Fundamentals