cgroups and Namespaces: Containers Are a Lie

lesson
linux-namespaces
cgroups-v1/v2
container-internals
docker-architecture
security
l2 ---# Cgroups and Namespaces — Containers Are a Lie

Topics: Linux namespaces, cgroups v1/v2, container internals, Docker architecture, security (capabilities, seccomp), rootless containers, /proc, /sys/fs/cgroup Level: L2 (Operations) Time: 75–90 minutes Prerequisites: None — we start from scratch

The Mission¶

Build a container from scratch. No Docker. No containerd. No runc. Just you, a Linux kernel, and a handful of syscalls.

By the end you'll understand that a "container" is marketing — it's a regular Linux process wearing a disguise. Docker's genius wasn't inventing new technology. It was wrapping twenty years of kernel features in a CLI that made them feel like magic.

Here's the proof: after this lesson you'll create an isolated process with its own PID tree, its own filesystem, its own network stack, and resource limits — using nothing but unshare, mount, and writing numbers to files. Then you'll break out of it, because understanding the escape is how you understand the wall.

Part 1: The Proof — Containers Are Just Processes¶

Before we build anything, let's settle the argument.

Start any Docker container:

docker run -d --name proof nginx:1.25

Now look at it from the host:

# Get the container's PID on the host
PID=$(docker inspect --format '{{.State.Pid}}' proof)
echo "Container's host PID: $PID"

# It's a regular process in the host's process table
ps aux | grep $PID | grep -v grep

That nginx process lives in /proc/$PID like every other process. It has file descriptors, memory maps, a cgroup path, and namespace links. The only thing that makes it "a container" is what the kernel restricts it from seeing.

# The namespace trick — every process has these
ls -la /proc/$PID/ns/

lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026532419]'
lrwxrwxrwx 1 root root 0 ... ipc    -> 'ipc:[4026532417]'
lrwxrwxrwx 1 root root 0 ... mnt    -> 'mnt:[4026532415]'
lrwxrwxrwx 1 root root 0 ... net    -> 'net:[4026532420]'
lrwxrwxrwx 1 root root 0 ... pid    -> 'pid:[4026532418]'
lrwxrwxrwx 1 root root 0 ... user   -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 ... uts    -> 'uts:[4026532416]'

Those inode numbers are namespace IDs. Compare them with PID 1 on the host:

ls -la /proc/1/ns/ | awk '{print $NF}'
ls -la /proc/$PID/ns/ | awk '{print $NF}'

Different inodes = different namespaces = different view of the system. Same kernel.

Mental Model: A container is a process with a restricted viewport. Namespaces control what it can see. Cgroups control how much it can use. Everything else — images, registries, orchestrators — is convenience tooling built on top.

Clean up before we build our own:

docker rm -f proof

Part 2: Namespaces — Controlling What a Process Can See¶

The Eight Namespace Types¶

Linux has eight namespace types. Each one isolates a different kernel resource:

Namespace	Flag	Isolates	Since
mnt	`CLONE_NEWNS`	Mount points — own filesystem view	2.4.19 (2002)
pid	`CLONE_NEWPID`	Process IDs — own PID tree starting at 1	2.6.24 (2008)
net	`CLONE_NEWNET`	Network stack — interfaces, routes, iptables	2.6.29 (2009)
uts	`CLONE_NEWUTS`	Hostname and NIS domain name	2.6.19 (2006)
ipc	`CLONE_NEWIPC`	System V IPC, POSIX message queues	2.6.19 (2006)
user	`CLONE_NEWUSER`	User/group IDs — UID 0 inside maps to unprivileged UID	3.8 (2013)
cgroup	`CLONE_NEWCGROUP`	cgroup root directory — own hierarchy view	4.6 (2016)
time	`CLONE_NEWTIME`	CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets	5.6 (2020)

Name Origin: The mount namespace flag is CLONE_NEWNS — "new namespace" — because in 2002 it was the only namespace type. Nobody anticipated there'd be seven more. Every subsequent type got a descriptive flag (CLONE_NEWPID, CLONE_NEWNET), but the mount namespace is stuck with the generic name as a historical artifact. (Source: kernel commit history, man clone(2))

Creating Namespaces with unshare¶

unshare runs a program in new namespaces. Here's your first "container":

# New PID namespace + new mount namespace
sudo unshare --pid --mount --fork bash

Flag	What it does
`--pid`	New PID namespace — this shell becomes PID 1
`--mount`	New mount namespace — mounts won't affect the host
`--fork`	Fork before exec — required for PID namespaces

Now remount /proc so ps reflects the new PID namespace:

mount -t proc proc /proc
ps aux

You should see only a handful of processes. Your shell is PID 1. The host's thousands of processes are invisible.

Under the Hood: ps reads /proc to enumerate processes. Without remounting, /proc still shows the host's process list because the mount namespace inherited it. Remounting /proc inside the new PID namespace makes it reflect only the processes visible in that namespace.

Type exit to leave. You're back on the host.

The Network Namespace — Total Isolation¶

sudo unshare --net bash
ip link

1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Only loopback. No eth0. No connectivity. This is what every container starts with before Docker creates a veth pair and plugs it into a bridge.

Exit and move on.

The UTS Namespace — Your Own Hostname¶

sudo unshare --uts bash
hostname container-from-scratch
hostname

container-from-scratch

Back on the host, hostname still shows the original name. The UTS namespace isolated it.

Trivia: UTS stands for "UNIX Time-sharing System." The name comes from the utsname struct in the kernel (returned by uname(2)), which dates back to the original UNIX. It has nothing to do with time zones — it's the hostname and domain name.

Flashcard Check — Namespaces¶

Cover the answers and test yourself:

Question	Answer
How many namespace types does Linux have?	Eight: mnt, pid, net, uts, ipc, user, cgroup, time
Why is the mount namespace flag called `CLONE_NEWNS`?	It was the first and only namespace type in 2002
What command creates new namespaces for a process?	`unshare`
What command enters an existing process's namespaces?	`nsenter`
Where are a process's namespace links stored?	`/proc/PID/ns/`
How do you tell if two processes share a namespace?	Same inode number in `/proc/PID/ns/<type>`

Part 3: Building a Container from Scratch¶

Time to build a real container. No Docker. Just Linux primitives.

Step 1: Create a Root Filesystem¶

Every container needs a filesystem. We'll use Alpine Linux — it's 7 MB:

mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer

# Download and extract Alpine's minimal rootfs
curl -L -o alpine.tar.gz \
  https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz
tar -xzf alpine.tar.gz -C rootfs

Step 2: Unshare All the Things¶

sudo unshare \
  --pid \
  --mount \
  --net \
  --uts \
  --ipc \
  --fork \
  /bin/bash

You're now in a new set of namespaces. But you're still looking at the host's filesystem.

Step 3: Mount the New Root¶

# Bind mount the rootfs to itself (required for pivot_root)
mount --bind /tmp/mycontainer/rootfs /tmp/mycontainer/rootfs

# Create a directory for the old root
mkdir -p /tmp/mycontainer/rootfs/oldroot

# Pivot the root filesystem
cd /tmp/mycontainer/rootfs
pivot_root . oldroot

Command	What it does
`mount --bind ... ...`	Makes the directory a mount point (pivot_root requires this)
`pivot_root . oldroot`	Swaps the root filesystem — new root is `.`, old root moves to `oldroot/`

Under the Hood: pivot_root(2) is the syscall that container runtimes use (not chroot). Unlike chroot, which only changes the pathname resolution root, pivot_root actually moves the root mount. The old root can then be unmounted entirely, which chroot cannot do. This is why containers don't see the host filesystem — it's been unmounted, not just hidden. (Source: man 2 pivot_root)

Step 4: Set Up Essential Mounts¶

# Mount /proc for process visibility
mount -t proc proc /proc

# Mount /sys for kernel interfaces
mount -t sysfs sysfs /sys

# Mount /dev/pts for terminal support
mkdir -p /dev/pts
mount -t devpts devpts /dev/pts

# Unmount the old root — we don't need it anymore
umount -l /oldroot
rmdir /oldroot

Step 5: Set the Hostname¶

hostname my-container

Step 6: Verify the Isolation¶

# Only our processes
ps aux

# Our own hostname
hostname

# Only loopback networking
ip link

# We're "root" inside
whoami

You just built a container. No Docker daemon. No containerd. No runc. The process thinks it has its own machine — its own PID tree, its own filesystem, its own hostname, its own network stack.

Interview Bridge: "Explain what happens when you run docker run" is a common interview question. The answer is: Docker asks runc to call clone() with namespace flags, then pivot_root() to swap the filesystem, then sets up cgroups and seccomp filters. Everything we just did by hand, plus resource limits and security profiles.

Exit with exit or Ctrl+D.

Flashcard Check — Building Containers¶

Question	Answer
What syscall do container runtimes use to swap the root filesystem?	`pivot_root(2)` — not `chroot`
Why must you remount `/proc` after creating a new PID namespace?	`ps` reads `/proc` — without remounting, it shows the host's processes
What is the minimum set of namespaces for basic container isolation?	PID + mount + UTS + IPC + net (5 of 8)
Why does `unshare --pid` require `--fork`?	The calling process stays in the old PID namespace; only children enter the new one

Part 4: Cgroups — Controlling How Much a Process Can Use¶

Namespaces control visibility. Cgroups control consumption. Without cgroups, your "container" from Part 3 could eat all the CPU, all the RAM, and all the disk I/O on the host.

cgroups v1 vs v2 — Why It Matters¶

Name Origin: cgroups were originally developed by Paul Menage and Rohit Seth at Google in 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with OS-level containers. Google used them to manage workloads on their fleet years before Docker existed. (Source: Linux kernel documentation, Documentation/admin-guide/cgroup-v2.rst)

There are two versions, and the differences matter:

Aspect	v1	v2
Hierarchy	One tree per controller (cpu, memory, blkio...)	Single unified tree
CPU limit file	`cpu.cfs_quota_us` / `cpu.cfs_period_us`	`cpu.max` ("quota period")
CPU weight file	`cpu.shares` (2–262144)	`cpu.weight` (1–10000)
Memory hard limit	`memory.limit_in_bytes`	`memory.max`
Memory soft limit	`memory.soft_limit_in_bytes`	`memory.high` (throttles, doesn't kill)
I/O limits	`blkio.throttle.*`	`io.max`
PSI (pressure info)	Not available	Built-in (`*.pressure`)
Default distros	RHEL 8, Ubuntu 20.04, Amazon Linux 2	RHEL 9, Ubuntu 22.04+, Fedora 31+

Check which version you're running:

stat -f --format=%T /sys/fs/cgroup/
# "cgroup2fs" = v2
# "tmpfs"     = v1

Remember: v1 = many trees, v2 = one tree. In v1, CPU lives in /sys/fs/cgroup/cpu/, memory in /sys/fs/cgroup/memory/, and a process can be in different places in each tree. In v2, everything is in one unified hierarchy under /sys/fs/cgroup/. This is why v2 is cleaner — you can't have a process limited to 512MB in the memory tree but unlimited in the cpu tree because someone forgot to add it there.

v2 in Practice — Setting Limits by Writing Files¶

cgroups are controlled by writing to files in /sys/fs/cgroup/. That's it. No daemon, no API, no special tools. Just the filesystem.

# Create a cgroup
sudo mkdir /sys/fs/cgroup/demo

# Enable controllers for it
echo "+cpu +memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Set a memory limit: 64MB hard, 48MB soft (throttle point)
echo 67108864 | sudo tee /sys/fs/cgroup/demo/memory.max
echo 50331648 | sudo tee /sys/fs/cgroup/demo/memory.high

# Set a CPU limit: 25% of one core (25000µs per 100000µs period)
echo "25000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max

# Limit to 20 processes (fork bomb protection)
echo 20 | sudo tee /sys/fs/cgroup/demo/pids.max

# Move the current shell into this cgroup
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs

File	Format	Example	Effect
`memory.max`	bytes	`67108864`	Hard limit — OOM kill above this
`memory.high`	bytes	`50331648`	Throttle threshold — reclaim pressure starts
`cpu.max`	"quota period" (µs)	`"25000 100000"`	25% of one core
`cpu.weight`	1–10000	`100`	Relative weight when competing
`pids.max`	count	`20`	Max processes in cgroup
`io.max`	"major:minor rbps=N wbps=N"	`"8:0 rbps=10485760"`	10 MB/s read on device 8:0

Now test the memory limit:

# This will be killed by the cgroup OOM killer
python3 -c "x = bytearray(100 * 1024 * 1024)"  # Try to allocate 100MB in a 64MB cgroup

The process gets killed. Check what happened:

cat /sys/fs/cgroup/demo/memory.events

low 0
high 42
max 3
oom 1
oom_kill 1

That's the audit trail. high 42 means memory.high was exceeded 42 times (throttling). oom_kill 1 means one process was killed for exceeding memory.max.

Gotcha: memory.high is your friend. In v1, the only option was memory.limit_in_bytes — exceed it and you die. v2 added memory.high as a throttling threshold: the kernel reclaims memory aggressively and slows the process down before it hits the hard limit. Always set both. In Kubernetes, requests influences memory.low/memory.high and limits sets memory.max.

Clean up:

# Move shell out of the cgroup first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/demo

Pressure Stall Information (PSI) — v2 Only¶

PSI tells you how much a cgroup is suffering, not just how much it's using:

cat /proc/pressure/cpu
# some avg10=2.50 avg60=1.80 avg300=1.20 total=1234567890

cat /proc/pressure/memory
# some avg10=0.30 avg60=0.10 avg300=0.05 total=12345678

some: at least one task was stalled waiting for this resource
full: all tasks were stalled (severe)
avg10/60/300: exponential moving averages over 10s, 60s, 300s

Metric	Healthy	Warning	Action needed
`cpu some avg60`	< 5%	5–25%	> 25%: add CPU
`memory some avg60`	< 5%	5–20%	> 20%: add RAM
`memory full avg60`	0%	> 0%	> 5%: thrashing, urgent
`io some avg60`	< 10%	10–30%	> 30%: upgrade storage

Mental Model: Traditional monitoring tells you utilization (80% CPU used). PSI tells you saturation (20% of the time, work is stalled waiting for CPU). A system can be at 90% utilization and 0% saturation (healthy) or at 60% utilization and 40% saturation (in trouble because of lock contention or throttling).

Flashcard Check — Cgroups¶

Question	Answer
How do you check if your system runs cgroups v1 or v2?	`stat -f --format=%T /sys/fs/cgroup/` — `cgroup2fs` = v2, `tmpfs` = v1
What's the difference between `memory.max` and `memory.high` in v2?	`memory.max` = hard limit (OOM kill). `memory.high` = throttle threshold (slow down)
What does `cpu.max "50000 100000"` mean?	50% of one core (50,000µs quota per 100,000µs period)
How do you add a process to a cgroup?	`echo PID > /sys/fs/cgroup/<path>/cgroup.procs`
What file shows cgroup OOM kill history?	`memory.events`
What does PSI measure that utilization doesn't?	Saturation — how much time work is stalled waiting for a resource

Part 5: How Docker Assembles the Lie¶

When you run docker run -d -p 8080:80 --memory=512m --name web nginx:1.25, here's what actually happens — mapped to the primitives you just learned:

docker CLI → REST API → dockerd → gRPC → containerd → containerd-shim → runc

runc does:
1. clone() with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC
   → Creates 5+ new namespaces (what you did with unshare)

2. Assembles OverlayFS from image layers
   → Stacks read-only layers + writable upper layer into one filesystem view

3. pivot_root() to the assembled filesystem
   → Swaps the root (what you did with pivot_root)

4. Creates a cgroup and writes limits
   → echo 536870912 > memory.max  (512MB)
   → echo $PID > cgroup.procs

5. Drops capabilities (removes CAP_SYS_ADMIN, etc.)

6. Applies seccomp filter (blocks ~44 dangerous syscalls)

7. Applies AppArmor/SELinux profile

8. exec() the entrypoint (nginx) as PID 1 in the new namespace set

Docker networking adds one more step:

9. Creates a veth pair
   → One end in the container's net namespace (becomes eth0)
   → Other end on the docker0 bridge on the host

10. Adds iptables DNAT rule
    → Host:8080 → Container-IP:80

Every single step maps to something you built by hand in Parts 2–4. Docker is automation, not magic.

The OverlayFS Layer Cake¶

Docker images are not disk images. They're stacked filesystem diffs:

┌─────────────────────────┐
│  Writable container layer│  ← Your runtime changes go here
├─────────────────────────┤
│  nginx config layer      │  ← Read-only
├─────────────────────────┤
│  nginx binary layer      │  ← Read-only
├─────────────────────────┤
│  Debian base layer       │  ← Read-only
└─────────────────────────┘
         ↓
  [Merged view via OverlayFS]  ← What the container process sees

When the container modifies a file from a lower layer, OverlayFS copies the entire file to the writable layer first (copy-on-write). Deleting a file creates a whiteout marker. This is why RUN apt-get install && apt-get clean must be in the same Dockerfile layer.

# See the overlay mount for a running container
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' web
# /var/lib/docker/overlay2/<hash>/merged

Gotcha: /proc/meminfo and /proc/cpuinfo inside a container show the host's values, not the container's cgroup limits. A JVM reading /proc/meminfo sees 64GB of host RAM and sets its heap accordingly — inside a 512MB container. Modern JVMs (10+) use -XX:+UseContainerSupport to read cgroup limits instead. For Go, use go.uber.org/automaxprocs. Python's os.cpu_count() still returns the host CPU count as of 3.12. (Source: JDK-8146115, Go issue #33803)

Part 6: Container Escapes — Breaking the Walls¶

Understanding escapes is understanding the walls. If you can't explain how a container is escaped, you don't understand how it's contained.

Escape 1: The /proc/self/exe Overwrite (CVE-2019-5736)¶

This is the most famous container escape. It affected runc < 1.0.0-rc6.

The vulnerability: A malicious container could overwrite the host's runc binary by exploiting how Linux handles /proc/self/exe.

How it worked:

When docker exec runs, runc enters the container's namespaces and then exec()s the requested command
During a brief window, runc's process is inside the container's namespaces but /proc/self/exe still points to the host's runc binary
The attacker opens /proc/self/exe (which resolves to the host's /usr/bin/runc) with O_WRONLY and overwrites it with a malicious binary
Next time anyone runs docker exec or docker run, the malicious runc executes as root on the host

The fix: runc now re-executes itself into a cloned binary (memfd), so /proc/self/exe never points to the on-disk binary during container operations.

War Story: CVE-2019-5736 was disclosed in February 2019 by Adam Iwaniuk and Borys Popawski. It affected Docker, Kubernetes, and any system using runc. The severity was scored CVSS 8.6. It demonstrated that the boundary between "inside the container" and "on the host" is thinner than most people assume — it's maintained by kernel namespace isolation, and any process that crosses that boundary (like runc during exec) is a potential attack vector. (Source: NVD CVE-2019-5736, runc security advisory GHSA-f3fp-gc8g-vw66)

Escape 2: Privileged Mode — The Door You Left Open¶

# Never do this in production
docker run --privileged -it alpine sh

--privileged disables almost every isolation mechanism:

All capabilities granted (including CAP_SYS_ADMIN)
Seccomp disabled
AppArmor disabled
Full access to host /dev
Can mount the host filesystem

# Inside a --privileged container — full host access
mount /dev/sda1 /mnt
ls /mnt   # host's root filesystem
chroot /mnt
# You are now effectively root on the host

Remember: --privileged is not "give the container more permissions." It's "remove all container isolation." The process is still in namespaces, but with all capabilities and no seccomp/AppArmor, it can escape trivially. There is almost never a legitimate reason to use --privileged in production.

Escape 3: The Docker Socket Mount¶

# Commonly seen in CI tools and monitoring agents
docker run -v /var/run/docker.sock:/var/run/docker.sock alpine sh

If a container has access to the Docker socket, it can:

# From inside the container — create a privileged container
apk add docker-cli
docker run --privileged -v /:/host alpine chroot /host
# Full host access via a container it launched

The Docker socket grants the same power as root on the host.

The Defense Stack¶

Each layer blocks different attacks:

Defense	What it prevents
Namespaces	Seeing host processes, files, network
Cgroups	Consuming all host resources
Capabilities	Performing privileged kernel operations
Seccomp	Making dangerous syscalls (mount, reboot, kexec)
AppArmor/SELinux	Accessing specific files and resources
User namespaces	Being actual root on the host even if root in container
Read-only rootfs	Writing malicious binaries

Part 7: Capabilities and Seccomp — Fine-Grained Controls¶

Linux Capabilities¶

Root privileges aren't all-or-nothing. Since Linux 2.2, they're split into ~40 capabilities:

# What capabilities does a container have?
docker run --rm alpine cat /proc/1/status | grep Cap

CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

Decode those hex values:

capsh --decode=00000000a80425fb

Docker's default set keeps ~14 capabilities and drops the rest. The dangerous ones that Docker drops by default:

Capability	What it allows	Why it's dropped
`CAP_SYS_ADMIN`	Mount filesystems, set hostname, access many kernel interfaces	Near-root power — the "too broad" capability
`CAP_NET_ADMIN`	Configure network interfaces, iptables, routing	Could break host networking
`CAP_SYS_PTRACE`	Trace and debug other processes	Could read other containers' memory
`CAP_SYS_MODULE`	Load kernel modules	Game over — arbitrary kernel code
`CAP_SYS_RAWIO`	Raw I/O access	Direct hardware access

Best practice — drop everything, add only what you need:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx:1.25

Trivia: CAP_SYS_ADMIN is sometimes called "the new root" because it grants such a broad set of privileges that having it effectively makes you root for most practical purposes. There have been repeated proposals to split it into smaller capabilities, but the kernel ABI stability guarantee makes this difficult. (Source: man 7 capabilities, lwn.net articles on capability splitting)

Seccomp Profiles¶

Seccomp filters restrict which system calls a process can make. Docker's default profile blocks approximately 44 of ~330 available syscalls:

# Run with default seccomp (automatic)
docker run nginx:1.25

# Run with NO seccomp (dangerous — for debugging only)
docker run --security-opt seccomp=unconfined nginx:1.25

# See what's blocked by default
docker run --rm -it alpine sh -c 'mount -t proc proc /tmp 2>&1'
# "mount: permission denied (are you root?)"
# The mount syscall is blocked by seccomp, even though you're root inside

Key blocked syscalls:

Syscall	Why it's blocked
`mount`, `umount2`	Could mount host filesystems
`reboot`	Could reboot the host
`kexec_load`	Could load a new kernel
`bpf`	Could attach eBPF programs to kernel
`ptrace`	Could trace other processes
`add_key`, `keyctl`	Could access kernel keyring

Rootless Containers — The Best Defense¶

Rootless containers run the entire runtime stack as an unprivileged user:

# The Docker daemon itself runs without root
dockerd-rootless-setuptool.sh install

# Or use Podman, which is rootless by default
podman run -d nginx:1.25

How it works: User namespaces map UID 0 inside the container to your regular UID (e.g., 1000) on the host. Even if an attacker escapes the container, they land as an unprivileged user.

# Check the UID mapping
cat /proc/<container-pid>/uid_map
# 0  1000  65536
# Container UID 0 maps to host UID 1000

Limitations of rootless mode: - Cannot bind to ports below 1024 (without net.ipv4.ip_unprivileged_port_start) - OverlayFS requires kernel 5.11+ in rootless mode (older kernels use FUSE-OverlayFS, which is slower) - Some storage drivers have restrictions

Under the Hood: Rootless Podman uses slirp4netns or pasta for networking instead of creating veth pairs (which requires CAP_NET_ADMIN). slirp4netns runs a userspace TCP/IP stack, routing container traffic through a TAP device without any kernel privilege. Since kernel 5.11, native OverlayFS supports user namespaces, eliminating the FUSE overhead. (Source: man slirp4netns, rootlesscontaine.rs)

Part 8: The /proc and /sys/fs/cgroup Interfaces¶

These two pseudo-filesystems are your debugging interface into namespaces and cgroups.

/proc/PID/ns/ — Namespace Membership¶

# List all namespaces on the system
lsns

# Check which namespaces a specific process belongs to
ls -la /proc/$(pgrep nginx | head -1)/ns/

# Compare two processes — same namespace?
readlink /proc/1/ns/net
readlink /proc/$(pgrep nginx | head -1)/ns/net
# Same inode = same namespace

/proc/PID/cgroup — Cgroup Membership¶

# What cgroup is this process in?
cat /proc/$(pgrep nginx | head -1)/cgroup
# v2: 0::/system.slice/docker-abc123.scope
# v1: 12:memory:/docker/abc123 (multiple lines)

/sys/fs/cgroup/ — Reading and Setting Limits¶

# Find a container's cgroup path
CGPATH=$(cat /proc/$(pgrep nginx | head -1)/cgroup | cut -d: -f3)

# Read current resource usage
cat /sys/fs/cgroup${CGPATH}/memory.current     # bytes used now
cat /sys/fs/cgroup${CGPATH}/memory.max         # hard limit
cat /sys/fs/cgroup${CGPATH}/memory.high        # throttle point
cat /sys/fs/cgroup${CGPATH}/cpu.max            # CPU quota
cat /sys/fs/cgroup${CGPATH}/cpu.stat           # throttle stats
cat /sys/fs/cgroup${CGPATH}/pids.current       # process count
cat /sys/fs/cgroup${CGPATH}/memory.events      # OOM kill history

nsenter — The Debugging Swiss Army Knife¶

nsenter enters an existing process's namespaces — use host tools against a container's view:

PID=$(docker inspect --format '{{.State.Pid}}' myapp)

# Enter network namespace — tcpdump without installing it in the container
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 50 port 80

# Enter mount namespace — browse the container's filesystem
sudo nsenter -t $PID -m ls -la /app/

# Enter all namespaces — full "exec" equivalent from the host
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh

nsenter flag	Namespace	Most common use
`-n`	Network	tcpdump, ss, ip addr, curl (95% of debugging)
`-m`	Mount	Browse container filesystem, check configs
`-p`	PID	See container's process tree
`-u`	UTS	Check hostname
`-i`	IPC	Debug shared memory issues

Remember: nsenter flags map to namespace types: Network, Mount, PID. These three cover 95% of container debugging. You keep the host's tools but see the container's view.

Part 9: The Hands-On Exercises¶

Exercise 1: Prove Containers Are Processes (2 minutes)¶

Run a Docker container and find it in the host's process table:

docker run -d --name ex1 nginx:1.25

Find the PID, check its namespace links, compare with PID 1.

Solution

PID=$(docker inspect --format '{{.State.Pid}}' ex1)
echo "Container PID on host: $PID"
ps aux | grep $PID | grep -v grep
ls -la /proc/$PID/ns/
readlink /proc/1/ns/pid
readlink /proc/$PID/ns/pid
docker rm -f ex1

The PID namespace inodes will be different — proving the container is isolated.

Exercise 2: Create a Memory-Limited Cgroup (5 minutes)¶

Create a cgroup that limits memory to 32MB. Run a process that tries to allocate 64MB. Observe the OOM kill in memory.events.

Hints

1. `sudo mkdir /sys/fs/cgroup/exercise2` 2. Enable controllers on the parent: `echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control` 3. Write to `memory.max` 4. Move your shell to the cgroup with `echo $$ > cgroup.procs` 5. Try to allocate: `python3 -c "x = bytearray(64 * 1024 * 1024)"` 6. Check `memory.events` for the `oom_kill` count

Solution

echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
sudo mkdir /sys/fs/cgroup/exercise2
echo 33554432 | sudo tee /sys/fs/cgroup/exercise2/memory.max
echo $$ | sudo tee /sys/fs/cgroup/exercise2/cgroup.procs
python3 -c "x = bytearray(64 * 1024 * 1024)"
# Killed
cat /sys/fs/cgroup/exercise2/memory.events
# oom_kill should be >= 1

# Clean up: move shell out first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/exercise2

Exercise 3: Debug a Container with nsenter (10 minutes)¶

A container is running but you can't exec into it (imagine it's a distroless image with no shell). Use nsenter from the host to:

Check what ports it's listening on
Capture 10 packets on its network interface
Read its /etc/resolv.conf

Solution

docker run -d --name ex3 nginx:1.25
PID=$(docker inspect --format '{{.State.Pid}}' ex3)

# 1. Listening ports (network namespace only)
sudo nsenter -t $PID -n ss -tlnp

# 2. Packet capture (network namespace only)
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 10

# 3. Read resolv.conf (mount namespace)
sudo nsenter -t $PID -m cat /etc/resolv.conf

docker rm -f ex3

Exercise 4: Spot the Escape Vector (judgment call)¶

This docker-compose.yml is used by a CI tool. Find the security problem:

services:
  ci-runner:
    image: myorg/ci-runner:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./workspace:/workspace
    privileged: false
    cap_drop:
      - ALL

Answer

The Docker socket mount (`/var/run/docker.sock`) gives the container full control of the Docker daemon. Even with `cap_drop: ALL`, the container can `docker run --privileged` to launch a new container with full host access. The socket mount is equivalent to root. Fix: Use a dedicated CI runner that doesn't need Docker socket access, or use Kaniko/Buildah for in-container image building.

Cheat Sheet¶

Namespace Commands¶

Task	Command
Create namespaces	`sudo unshare --pid --mount --net --uts --ipc --fork bash`
Enter container namespaces	`sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh`
Enter network ns only	`sudo nsenter -t $PID -n <command>`
List all namespaces	`lsns`
Check namespace membership	`ls -la /proc/$PID/ns/`
Compare namespaces	`readlink /proc/$PID1/ns/net` vs `readlink /proc/$PID2/ns/net`

Cgroup Commands (v2)¶

Task	Command
Check v1 or v2	`stat -f --format=%T /sys/fs/cgroup/`
Find process cgroup	`cat /proc/$PID/cgroup`
Read memory limit	`cat /sys/fs/cgroup/<path>/memory.max`
Read memory usage	`cat /sys/fs/cgroup/<path>/memory.current`
Read CPU limit	`cat /sys/fs/cgroup/<path>/cpu.max`
Check throttle count	`cat /sys/fs/cgroup/<path>/cpu.stat` (nr_throttled)
Check OOM history	`cat /sys/fs/cgroup/<path>/memory.events`
System-wide pressure	`cat /proc/pressure/{cpu,memory,io}`
Per-service pressure	`cat /sys/fs/cgroup/system.slice/<service>/cpu.pressure`

Docker → Linux Primitives Mapping¶

Docker concept	Linux primitive
Container isolation	Namespaces (pid, net, mnt, uts, ipc)
Resource limits (--memory, --cpus)	cgroups (memory.max, cpu.max)
Image layers	OverlayFS (lower/upper/merged dirs)
Port mapping (-p 8080:80)	iptables DNAT rule
Container networking	veth pair + bridge (docker0)
--cap-drop / --cap-add	Linux capabilities
Seccomp profile	seccomp BPF filter
User namespace remap	CLONE_NEWUSER + /proc/PID/uid_map

Security Hardening Quick Reference¶

Hardening	Command / Config
Drop all capabilities	`--cap-drop=ALL --cap-add=<needed>`
Read-only rootfs	`--read-only --tmpfs /tmp`
No new privileges	`--security-opt=no-new-privileges`
Custom seccomp profile	`--security-opt seccomp=profile.json`
Non-root user	`USER appuser` in Dockerfile
Rootless mode	`dockerd-rootless-setuptool.sh install`
Never mount docker socket	Remove `-v /var/run/docker.sock:...`

Takeaways¶

Containers are processes, not VMs. They share the host kernel. The isolation comes from namespaces (visibility) and cgroups (resource limits), not from hardware virtualization. You can prove this with ps and /proc.
You can build a container with five commands. unshare + mount + pivot_root + writing to cgroup files. Docker automates this, but the primitives are simple enough to use by hand.
cgroups v2 is the future. Unified hierarchy, memory.high for graceful throttling, PSI for saturation monitoring. If you're still on v1, plan the migration.
The security stack is layered and each layer matters. Namespaces, cgroups, capabilities, seccomp, AppArmor/SELinux, user namespaces. Removing any one layer (especially via --privileged) opens real attack paths.
/proc/meminfo lies to containers. Applications that auto-tune from /proc will see host resources, not cgroup limits. Modern runtimes handle this; older apps need environment variables (OMP_NUM_THREADS, GOMAXPROCS).
nsenter is the most powerful container debugging tool. It lets you use host tools against a container's namespace view. Learn -n (network), -m (mount), -p (PID).

The Container Escape — deep dive into container security and exploitation techniques
The Hanging Deploy — PID namespaces, signals, and why containers won't stop gracefully
The Proc Filesystem — everything /proc exposes about processes and the kernel
What Happens When You Docker Build — image layers, build cache, and OverlayFS in detail
Out of Memory — OOM killer, cgroup memory limits, and debugging memory pressure
From Init Scripts to Systemd — systemd's role as cgroup manager and service supervisor
Strace: Reading the Matrix — syscall tracing, which is how you see namespaces and cgroups being created

cgroups and Namespaces: Containers Are a Lie

The Mission¶

Part 1: The Proof — Containers Are Just Processes¶

Part 2: Namespaces — Controlling What a Process Can See¶

The Eight Namespace Types¶

Creating Namespaces with unshare¶

The Network Namespace — Total Isolation¶

The UTS Namespace — Your Own Hostname¶

Flashcard Check — Namespaces¶

Part 3: Building a Container from Scratch¶

Step 1: Create a Root Filesystem¶

Step 2: Unshare All the Things¶

Step 3: Mount the New Root¶

Step 4: Set Up Essential Mounts¶

Step 5: Set the Hostname¶

Step 6: Verify the Isolation¶

Flashcard Check — Building Containers¶

Part 4: Cgroups — Controlling How Much a Process Can Use¶

cgroups v1 vs v2 — Why It Matters¶

v2 in Practice — Setting Limits by Writing Files¶

Pressure Stall Information (PSI) — v2 Only¶

Flashcard Check — Cgroups¶

Part 5: How Docker Assembles the Lie¶

The OverlayFS Layer Cake¶

Part 6: Container Escapes — Breaking the Walls¶

Escape 1: The /proc/self/exe Overwrite (CVE-2019-5736)¶

Escape 2: Privileged Mode — The Door You Left Open¶

Escape 3: The Docker Socket Mount¶

The Defense Stack¶

Part 7: Capabilities and Seccomp — Fine-Grained Controls¶

Linux Capabilities¶

Seccomp Profiles¶

Rootless Containers — The Best Defense¶

Part 8: The /proc and /sys/fs/cgroup Interfaces¶

/proc/PID/ns/ — Namespace Membership¶

/proc/PID/cgroup — Cgroup Membership¶

/sys/fs/cgroup/ — Reading and Setting Limits¶

nsenter — The Debugging Swiss Army Knife¶

Part 9: The Hands-On Exercises¶

Exercise 1: Prove Containers Are Processes (2 minutes)¶

Exercise 2: Create a Memory-Limited Cgroup (5 minutes)¶

Exercise 3: Debug a Container with nsenter (10 minutes)¶

Exercise 4: Spot the Escape Vector (judgment call)¶

Cheat Sheet¶

Namespace Commands¶

Cgroup Commands (v2)¶

Docker → Linux Primitives Mapping¶

Security Hardening Quick Reference¶

Takeaways¶

Related Lessons¶

Pages that link here¶