Skip to content

Portal | Level: L2: Operations | Topics: cgroups & Linux Namespaces, Linux Fundamentals, Container Runtimes | Domain: Linux

cgroups & Linux Namespaces - Primer

Why This Matters

Every container is built on two kernel features: namespaces for isolation, cgroups for resource control. When a pod gets OOMKilled, that's cgroups. When a container has its own PID 1 and network stack, that's namespaces. Understanding these primitives lets you debug containers at the kernel level instead of treating them as black boxes.


Linux Namespaces

Namespaces partition kernel resources so that one set of processes sees one set of resources and another sees a different set.

Name origin: Linux namespaces were first introduced in kernel 2.4.19 (2002) with the mount namespace — the flag CLONE_NEWNS stands for "new namespace," not "new mount namespace," because at the time nobody anticipated there would be more than one type. Every subsequent namespace type got a more specific flag name (CLONE_NEWPID, CLONE_NEWNET, etc.), but the mount namespace is stuck with the generic name as a historical artifact.

The Eight Namespace Types

Namespace Flag Isolates
pid CLONE_NEWPID Process IDs — container sees its own PID tree starting at 1
net CLONE_NEWNET Network stack — interfaces, routing, iptables, sockets
mnt CLONE_NEWNS Mount points — own filesystem view
uts CLONE_NEWUTS Hostname and NIS domain name
ipc CLONE_NEWIPC System V IPC, POSIX message queues
user CLONE_NEWUSER User/group IDs — UID 0 inside maps to unprivileged UID outside
cgroup CLONE_NEWCGROUP cgroup root directory — own cgroup hierarchy view
time CLONE_NEWTIME CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets

Every process has namespace references in /proc/PID/ns/:

$ ls -la /proc/self/ns/
lrwxrwxrwx 1 root root 0 Mar 19 10:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 net -> 'net:[4026531840]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Mar 19 10:00 uts -> 'uts:[4026531838]'

The inode numbers are namespace identifiers. Two processes sharing the same inode are in the same namespace.

Creating Namespaces with unshare

# New PID + mount namespace — bash becomes PID 1
sudo unshare --pid --mount --fork bash
mount -t proc proc /proc    # remount procfs for correct ps output
ps aux                       # only shows processes in this namespace

# New network namespace — isolated network stack
sudo unshare --net bash
ip link    # only loopback, no external connectivity

# Full container-like isolation
sudo unshare --pid --net --mount --uts --ipc --fork bash

Entering Namespaces with nsenter

# Find container's PID on host
PID=$(docker inspect --format '{{.State.Pid}}' my_container)

# Enter all namespaces
sudo nsenter -t $PID -m -u -i -n -p -- /bin/bash

# Enter just network namespace (most common for debugging)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -nn port 80

# Enter PID + mount namespace to see container's filesystem
sudo nsenter -t $PID -m -p -- ls /app

How Containers Use Namespaces

When an OCI runtime starts a container, it creates namespaces per the config. Kubernetes pods share NET and IPC namespaces across all containers in the pod — this is why containers in a pod reach each other on localhost. Each container still gets its own PID and MNT namespace.

Key sharing patterns: - docker run --network=container:X — shares X's network namespace - docker run --pid=container:X — shares X's PID namespace (debug sidecars) - hostNetwork: true in k8s — skips NET namespace, uses host's network stack


cgroups v1

cgroups limit, account for, and isolate resource usage. v1 uses separate hierarchies per controller.

v1 Layout and Controllers

/sys/fs/cgroup/
├── cpu/           ← cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us, cpu.stat
├── memory/        ← memory.limit_in_bytes, memory.usage_in_bytes, memory.stat
├── blkio/         ← blkio.throttle.read_bps_device, blkio.throttle.write_bps_device
├── devices/       ← devices.allow, devices.deny
├── freezer/       ← freezer.state (FROZEN/THAWED)
├── pids/          ← pids.max, pids.current
├── cpuacct/       ← cpuacct.usage, cpuacct.stat
└── cpuset/        ← cpuset.cpus, cpuset.mems

Creating and Using v1 cgroups

# Create a memory-limited cgroup
sudo mkdir /sys/fs/cgroup/memory/myapp
echo 268435456 | sudo tee /sys/fs/cgroup/memory/myapp/memory.limit_in_bytes  # 256MB
echo $$ | sudo tee /sys/fs/cgroup/memory/myapp/cgroup.procs

# CPU limit: 50% of one core
sudo mkdir /sys/fs/cgroup/cpu/myapp
echo 50000 | sudo tee /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
echo 100000 | sudo tee /sys/fs/cgroup/cpu/myapp/cpu.cfs_period_us

# Using libcgroup tools
sudo cgcreate -g cpu,memory:myapp
sudo cgexec -g cpu,memory:myapp stress --cpu 2 --vm 1 --vm-bytes 128M

cgroups v2

Who made it: cgroups (control groups) were originally developed by Paul Menage and Rohit Seth at Google in 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with OS-level containers. Google used them internally to manage workloads on their fleet years before Docker existed. cgroups v2 was a complete rewrite by Tejun Heo, merged in kernel 4.5 (2016), addressing the design limitations of v1's separate-hierarchy-per-controller approach.

v2 replaces per-controller hierarchies with a single unified hierarchy. Cleaner, more consistent, adds Pressure Stall Information (PSI).

v2 Unified Layout

/sys/fs/cgroup/
├── cgroup.controllers          ← available controllers
├── cgroup.subtree_control      ← controllers enabled for children
├── system.slice/
│   └── myapp.service/
│       ├── cpu.max             ← "QUOTA PERIOD" (e.g., "50000 100000")
│       ├── cpu.weight          ← 1-10000 (replaces cpu.shares)
│       ├── memory.max          ← hard limit (replaces memory.limit_in_bytes)
│       ├── memory.high         ← throttling threshold (new in v2)
│       ├── memory.current      ← current usage
│       ├── memory.pressure     ← PSI metrics
│       ├── io.max              ← per-device I/O limits
│       ├── pids.max            ← PID limit
│       └── cgroup.procs        ← PIDs in this cgroup

Enabling Controllers and Setting Limits

In v2, controllers must be explicitly enabled for children:

# Enable controllers for child cgroups
echo "+cpu +memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Create a cgroup and set limits
sudo mkdir /sys/fs/cgroup/myapp
echo "50000 100000" | sudo tee /sys/fs/cgroup/myapp/cpu.max        # 50% of one core
echo 536870912 | sudo tee /sys/fs/cgroup/myapp/memory.max           # 512MB hard limit
echo 402653184 | sudo tee /sys/fs/cgroup/myapp/memory.high          # 384MB throttle point
echo 100 | sudo tee /sys/fs/cgroup/myapp/pids.max                   # max 100 processes
echo "8:0 rbps=52428800 wbps=20971520" | sudo tee /sys/fs/cgroup/myapp/io.max  # I/O limits
echo $$ | sudo tee /sys/fs/cgroup/myapp/cgroup.procs

Important v2 rule: processes must live in leaf cgroups. A cgroup with children and active controllers cannot directly contain processes.

Pressure Stall Information (PSI)

PSI quantifies how much time processes spend waiting for resources:

cat /sys/fs/cgroup/myapp/cpu.pressure
# some avg10=4.50 avg60=2.30 avg300=1.15 total=123456789
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
  • some: percentage of time at least one task was stalled
  • full: percentage of time all tasks were stalled
  • avg10/60/300: exponential moving averages over 10s, 60s, 300s

If memory.pressure some avg60 stays above 10-20%, the workload needs more memory. If cpu.pressure full avg60 is above 0, you're losing CPU capacity across the board.


How systemd Uses cgroups

systemd is the primary cgroup manager on modern Linux. Every service, session, and scope gets its own cgroup automatically.

/sys/fs/cgroup/
├── init.scope/            PID 1
├── system.slice/          system services (sshd, nginx, docker)
├── user.slice/            user sessions
└── machine.slice/         VMs and containers (systemd-nspawn)

Viewing and Monitoring

systemd-cgls                    # cgroup hierarchy as a tree
systemd-cgtop                   # real-time resource usage per cgroup
systemctl show nginx.service | grep -E '(Memory|CPU|Tasks)' # current settings

Resource Control Directives

# /etc/systemd/system/myapp.service.d/override.conf
[Service]
CPUQuota=200%          # limit to 2 cores
MemoryMax=1G           # hard memory limit
MemoryHigh=768M        # throttle threshold
IOWeight=50            # relative I/O priority
TasksMax=512           # max processes/threads
Delegate=yes           # delegate cgroup control (needed for container runtimes)

Apply at runtime: sudo systemctl set-property nginx.service MemoryMax=512M

Transient Units with systemd-run

# Run a one-off command with resource limits
sudo systemd-run --scope -p MemoryMax=256M -p CPUQuota=100% ./my_benchmark

How Kubernetes Uses cgroups

Pod QoS Classes

QoS Condition OOM Priority
Guaranteed Every container: requests == limits for CPU and memory Lowest (last killed)
Burstable At least one container has requests or limits, but not all equal Medium
BestEffort No requests or limits set Highest (first killed)

Requests vs Limits in cgroup Terms

  • CPU requestscpu.weight/cpu.shares — proportional, not a hard limit. If no contention, can use more.
  • CPU limitscpu.max/cpu.cfs_quota_us — hard throttle regardless of available CPU.
  • Memory requests → scheduling decision + memory.low (v2) — guaranteed minimum.
  • Memory limitsmemory.max/memory.limit_in_bytes — exceed and get OOM-killed.

cgroup Driver

Kubelet and container runtime must use the same driver. On modern systems, both should use systemd:

# kubelet config
cgroupDriver: systemd

Mismatching drivers (kubelet=systemd, runtime=cgroupfs) causes cryptic pod creation failures.

Debug clue: If pods fail to start with errors like "failed to create pod sandbox" or "cgroup not found," the most common cause is a cgroup driver mismatch between kubelet and the container runtime. Check both: kubelet --cgroup-driver and crictl info | grep cgroupDriver. They must match — both systemd (recommended) or both cgroupfs.

Gotcha: CPU limits in Kubernetes use CFS (Completely Fair Scheduler) quota, which enforces throttling in fixed periods (default 100ms). A process that needs 200ms of CPU in a burst but has a 100m (10%) CPU limit gets throttled every 100ms period even if the node has idle cores. This causes latency spikes that look like application slowness but are actually kernel-level throttling. Check with cat /sys/fs/cgroup/cpu/*/cpu.stat — a high nr_throttled count confirms this.

Kubelet cgroup Hierarchy

/sys/fs/cgroup/kubepods.slice/
├── kubepods-burstable.slice/pod<uid>.slice/    ← Burstable pods
├── kubepods-besteffort.slice/pod<uid>.slice/   ← BestEffort pods
└── kubepods-pod<uid>.slice/                    ← Guaranteed pods

The kubelet reserves resources via kubeReserved and systemReserved to prevent pods from starving system services.


v1 vs v2 Quick Reference

Aspect v1 v2
Hierarchy One per controller Single unified
CPU limit cpu.cfs_quota_us cpu.max ("QUOTA PERIOD")
CPU weight cpu.shares (2-262144) cpu.weight (1-10000)
Memory limit memory.limit_in_bytes memory.max
Memory soft memory.soft_limit_in_bytes memory.high
I/O limit blkio.throttle.* io.max
PSI N/A Built-in (*.pressure)

Check your version: stat -f --format=%T /sys/fs/cgroup/ — returns cgroup2fs (v2) or tmpfs (v1).


Wiki Navigation

Prerequisites