cgroups and Namespaces: Containers Are a Lie
- lesson
- linux-namespaces
- cgroups-v1/v2
- container-internals
- docker-architecture
- security
- l2 ---# Cgroups and Namespaces — Containers Are a Lie
Topics: Linux namespaces, cgroups v1/v2, container internals, Docker architecture, security (capabilities, seccomp), rootless containers, /proc, /sys/fs/cgroup Level: L2 (Operations) Time: 75–90 minutes Prerequisites: None — we start from scratch
The Mission¶
Build a container from scratch. No Docker. No containerd. No runc. Just you, a Linux kernel, and a handful of syscalls.
By the end you'll understand that a "container" is marketing — it's a regular Linux process wearing a disguise. Docker's genius wasn't inventing new technology. It was wrapping twenty years of kernel features in a CLI that made them feel like magic.
Here's the proof: after this lesson you'll create an isolated process with its own PID tree,
its own filesystem, its own network stack, and resource limits — using nothing but unshare,
mount, and writing numbers to files. Then you'll break out of it, because understanding
the escape is how you understand the wall.
Part 1: The Proof — Containers Are Just Processes¶
Before we build anything, let's settle the argument.
Start any Docker container:
Now look at it from the host:
# Get the container's PID on the host
PID=$(docker inspect --format '{{.State.Pid}}' proof)
echo "Container's host PID: $PID"
# It's a regular process in the host's process table
ps aux | grep $PID | grep -v grep
That nginx process lives in /proc/$PID like every other process. It has file descriptors,
memory maps, a cgroup path, and namespace links. The only thing that makes it "a container"
is what the kernel restricts it from seeing.
lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026532419]'
lrwxrwxrwx 1 root root 0 ... ipc -> 'ipc:[4026532417]'
lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532415]'
lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026532420]'
lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532418]'
lrwxrwxrwx 1 root root 0 ... user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 ... uts -> 'uts:[4026532416]'
Those inode numbers are namespace IDs. Compare them with PID 1 on the host:
Different inodes = different namespaces = different view of the system. Same kernel.
Mental Model: A container is a process with a restricted viewport. Namespaces control what it can see. Cgroups control how much it can use. Everything else — images, registries, orchestrators — is convenience tooling built on top.
Clean up before we build our own:
Part 2: Namespaces — Controlling What a Process Can See¶
The Eight Namespace Types¶
Linux has eight namespace types. Each one isolates a different kernel resource:
| Namespace | Flag | Isolates | Since |
|---|---|---|---|
| mnt | CLONE_NEWNS |
Mount points — own filesystem view | 2.4.19 (2002) |
| pid | CLONE_NEWPID |
Process IDs — own PID tree starting at 1 | 2.6.24 (2008) |
| net | CLONE_NEWNET |
Network stack — interfaces, routes, iptables | 2.6.29 (2009) |
| uts | CLONE_NEWUTS |
Hostname and NIS domain name | 2.6.19 (2006) |
| ipc | CLONE_NEWIPC |
System V IPC, POSIX message queues | 2.6.19 (2006) |
| user | CLONE_NEWUSER |
User/group IDs — UID 0 inside maps to unprivileged UID | 3.8 (2013) |
| cgroup | CLONE_NEWCGROUP |
cgroup root directory — own hierarchy view | 4.6 (2016) |
| time | CLONE_NEWTIME |
CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets | 5.6 (2020) |
Name Origin: The mount namespace flag is
CLONE_NEWNS— "new namespace" — because in 2002 it was the only namespace type. Nobody anticipated there'd be seven more. Every subsequent type got a descriptive flag (CLONE_NEWPID,CLONE_NEWNET), but the mount namespace is stuck with the generic name as a historical artifact. (Source: kernel commit history,man clone(2))
Creating Namespaces with unshare¶
unshare runs a program in new namespaces. Here's your first "container":
| Flag | What it does |
|---|---|
--pid |
New PID namespace — this shell becomes PID 1 |
--mount |
New mount namespace — mounts won't affect the host |
--fork |
Fork before exec — required for PID namespaces |
Now remount /proc so ps reflects the new PID namespace:
You should see only a handful of processes. Your shell is PID 1. The host's thousands of processes are invisible.
Under the Hood:
psreads/procto enumerate processes. Without remounting,/procstill shows the host's process list because the mount namespace inherited it. Remounting/procinside the new PID namespace makes it reflect only the processes visible in that namespace.
Type exit to leave. You're back on the host.
The Network Namespace — Total Isolation¶
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Only loopback. No eth0. No connectivity. This is what every container starts with before
Docker creates a veth pair and plugs it into a bridge.
Exit and move on.
The UTS Namespace — Your Own Hostname¶
Back on the host, hostname still shows the original name. The UTS namespace isolated it.
Trivia: UTS stands for "UNIX Time-sharing System." The name comes from the
utsnamestruct in the kernel (returned byuname(2)), which dates back to the original UNIX. It has nothing to do with time zones — it's the hostname and domain name.
Flashcard Check — Namespaces¶
Cover the answers and test yourself:
| Question | Answer |
|---|---|
| How many namespace types does Linux have? | Eight: mnt, pid, net, uts, ipc, user, cgroup, time |
Why is the mount namespace flag called CLONE_NEWNS? |
It was the first and only namespace type in 2002 |
| What command creates new namespaces for a process? | unshare |
| What command enters an existing process's namespaces? | nsenter |
| Where are a process's namespace links stored? | /proc/PID/ns/ |
| How do you tell if two processes share a namespace? | Same inode number in /proc/PID/ns/<type> |
Part 3: Building a Container from Scratch¶
Time to build a real container. No Docker. Just Linux primitives.
Step 1: Create a Root Filesystem¶
Every container needs a filesystem. We'll use Alpine Linux — it's 7 MB:
mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer
# Download and extract Alpine's minimal rootfs
curl -L -o alpine.tar.gz \
https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz
tar -xzf alpine.tar.gz -C rootfs
Step 2: Unshare All the Things¶
You're now in a new set of namespaces. But you're still looking at the host's filesystem.
Step 3: Mount the New Root¶
# Bind mount the rootfs to itself (required for pivot_root)
mount --bind /tmp/mycontainer/rootfs /tmp/mycontainer/rootfs
# Create a directory for the old root
mkdir -p /tmp/mycontainer/rootfs/oldroot
# Pivot the root filesystem
cd /tmp/mycontainer/rootfs
pivot_root . oldroot
| Command | What it does |
|---|---|
mount --bind ... ... |
Makes the directory a mount point (pivot_root requires this) |
pivot_root . oldroot |
Swaps the root filesystem — new root is ., old root moves to oldroot/ |
Under the Hood:
pivot_root(2)is the syscall that container runtimes use (notchroot). Unlikechroot, which only changes the pathname resolution root,pivot_rootactually moves the root mount. The old root can then be unmounted entirely, whichchrootcannot do. This is why containers don't see the host filesystem — it's been unmounted, not just hidden. (Source:man 2 pivot_root)
Step 4: Set Up Essential Mounts¶
# Mount /proc for process visibility
mount -t proc proc /proc
# Mount /sys for kernel interfaces
mount -t sysfs sysfs /sys
# Mount /dev/pts for terminal support
mkdir -p /dev/pts
mount -t devpts devpts /dev/pts
# Unmount the old root — we don't need it anymore
umount -l /oldroot
rmdir /oldroot
Step 5: Set the Hostname¶
Step 6: Verify the Isolation¶
# Only our processes
ps aux
# Our own hostname
hostname
# Only loopback networking
ip link
# We're "root" inside
whoami
You just built a container. No Docker daemon. No containerd. No runc. The process thinks it has its own machine — its own PID tree, its own filesystem, its own hostname, its own network stack.
Interview Bridge: "Explain what happens when you run
docker run" is a common interview question. The answer is: Docker asks runc to callclone()with namespace flags, thenpivot_root()to swap the filesystem, then sets up cgroups and seccomp filters. Everything we just did by hand, plus resource limits and security profiles.
Exit with exit or Ctrl+D.
Flashcard Check — Building Containers¶
| Question | Answer |
|---|---|
| What syscall do container runtimes use to swap the root filesystem? | pivot_root(2) — not chroot |
Why must you remount /proc after creating a new PID namespace? |
ps reads /proc — without remounting, it shows the host's processes |
| What is the minimum set of namespaces for basic container isolation? | PID + mount + UTS + IPC + net (5 of 8) |
Why does unshare --pid require --fork? |
The calling process stays in the old PID namespace; only children enter the new one |
Part 4: Cgroups — Controlling How Much a Process Can Use¶
Namespaces control visibility. Cgroups control consumption. Without cgroups, your "container" from Part 3 could eat all the CPU, all the RAM, and all the disk I/O on the host.
cgroups v1 vs v2 — Why It Matters¶
Name Origin: cgroups were originally developed by Paul Menage and Rohit Seth at Google in 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with OS-level containers. Google used them to manage workloads on their fleet years before Docker existed. (Source: Linux kernel documentation,
Documentation/admin-guide/cgroup-v2.rst)
There are two versions, and the differences matter:
| Aspect | v1 | v2 |
|---|---|---|
| Hierarchy | One tree per controller (cpu, memory, blkio...) | Single unified tree |
| CPU limit file | cpu.cfs_quota_us / cpu.cfs_period_us |
cpu.max ("quota period") |
| CPU weight file | cpu.shares (2–262144) |
cpu.weight (1–10000) |
| Memory hard limit | memory.limit_in_bytes |
memory.max |
| Memory soft limit | memory.soft_limit_in_bytes |
memory.high (throttles, doesn't kill) |
| I/O limits | blkio.throttle.* |
io.max |
| PSI (pressure info) | Not available | Built-in (*.pressure) |
| Default distros | RHEL 8, Ubuntu 20.04, Amazon Linux 2 | RHEL 9, Ubuntu 22.04+, Fedora 31+ |
Check which version you're running:
Remember: v1 = many trees, v2 = one tree. In v1, CPU lives in
/sys/fs/cgroup/cpu/, memory in/sys/fs/cgroup/memory/, and a process can be in different places in each tree. In v2, everything is in one unified hierarchy under/sys/fs/cgroup/. This is why v2 is cleaner — you can't have a process limited to 512MB in the memory tree but unlimited in the cpu tree because someone forgot to add it there.
v2 in Practice — Setting Limits by Writing Files¶
cgroups are controlled by writing to files in /sys/fs/cgroup/. That's it. No daemon, no
API, no special tools. Just the filesystem.
# Create a cgroup
sudo mkdir /sys/fs/cgroup/demo
# Enable controllers for it
echo "+cpu +memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# Set a memory limit: 64MB hard, 48MB soft (throttle point)
echo 67108864 | sudo tee /sys/fs/cgroup/demo/memory.max
echo 50331648 | sudo tee /sys/fs/cgroup/demo/memory.high
# Set a CPU limit: 25% of one core (25000µs per 100000µs period)
echo "25000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max
# Limit to 20 processes (fork bomb protection)
echo 20 | sudo tee /sys/fs/cgroup/demo/pids.max
# Move the current shell into this cgroup
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs
| File | Format | Example | Effect |
|---|---|---|---|
memory.max |
bytes | 67108864 |
Hard limit — OOM kill above this |
memory.high |
bytes | 50331648 |
Throttle threshold — reclaim pressure starts |
cpu.max |
"quota period" (µs) | "25000 100000" |
25% of one core |
cpu.weight |
1–10000 | 100 |
Relative weight when competing |
pids.max |
count | 20 |
Max processes in cgroup |
io.max |
"major:minor rbps=N wbps=N" | "8:0 rbps=10485760" |
10 MB/s read on device 8:0 |
Now test the memory limit:
# This will be killed by the cgroup OOM killer
python3 -c "x = bytearray(100 * 1024 * 1024)" # Try to allocate 100MB in a 64MB cgroup
The process gets killed. Check what happened:
That's the audit trail. high 42 means memory.high was exceeded 42 times (throttling).
oom_kill 1 means one process was killed for exceeding memory.max.
Gotcha:
memory.highis your friend. In v1, the only option wasmemory.limit_in_bytes— exceed it and you die. v2 addedmemory.highas a throttling threshold: the kernel reclaims memory aggressively and slows the process down before it hits the hard limit. Always set both. In Kubernetes,requestsinfluencesmemory.low/memory.highandlimitssetsmemory.max.
Clean up:
# Move shell out of the cgroup first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/demo
Pressure Stall Information (PSI) — v2 Only¶
PSI tells you how much a cgroup is suffering, not just how much it's using:
cat /proc/pressure/cpu
# some avg10=2.50 avg60=1.80 avg300=1.20 total=1234567890
cat /proc/pressure/memory
# some avg10=0.30 avg60=0.10 avg300=0.05 total=12345678
- some: at least one task was stalled waiting for this resource
- full: all tasks were stalled (severe)
- avg10/60/300: exponential moving averages over 10s, 60s, 300s
| Metric | Healthy | Warning | Action needed |
|---|---|---|---|
cpu some avg60 |
< 5% | 5–25% | > 25%: add CPU |
memory some avg60 |
< 5% | 5–20% | > 20%: add RAM |
memory full avg60 |
0% | > 0% | > 5%: thrashing, urgent |
io some avg60 |
< 10% | 10–30% | > 30%: upgrade storage |
Mental Model: Traditional monitoring tells you utilization (80% CPU used). PSI tells you saturation (20% of the time, work is stalled waiting for CPU). A system can be at 90% utilization and 0% saturation (healthy) or at 60% utilization and 40% saturation (in trouble because of lock contention or throttling).
Flashcard Check — Cgroups¶
| Question | Answer |
|---|---|
| How do you check if your system runs cgroups v1 or v2? | stat -f --format=%T /sys/fs/cgroup/ — cgroup2fs = v2, tmpfs = v1 |
What's the difference between memory.max and memory.high in v2? |
memory.max = hard limit (OOM kill). memory.high = throttle threshold (slow down) |
What does cpu.max "50000 100000" mean? |
50% of one core (50,000µs quota per 100,000µs period) |
| How do you add a process to a cgroup? | echo PID > /sys/fs/cgroup/<path>/cgroup.procs |
| What file shows cgroup OOM kill history? | memory.events |
| What does PSI measure that utilization doesn't? | Saturation — how much time work is stalled waiting for a resource |
Part 5: How Docker Assembles the Lie¶
When you run docker run -d -p 8080:80 --memory=512m --name web nginx:1.25, here's
what actually happens — mapped to the primitives you just learned:
docker CLI → REST API → dockerd → gRPC → containerd → containerd-shim → runc
runc does:
1. clone() with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC
→ Creates 5+ new namespaces (what you did with unshare)
2. Assembles OverlayFS from image layers
→ Stacks read-only layers + writable upper layer into one filesystem view
3. pivot_root() to the assembled filesystem
→ Swaps the root (what you did with pivot_root)
4. Creates a cgroup and writes limits
→ echo 536870912 > memory.max (512MB)
→ echo $PID > cgroup.procs
5. Drops capabilities (removes CAP_SYS_ADMIN, etc.)
6. Applies seccomp filter (blocks ~44 dangerous syscalls)
7. Applies AppArmor/SELinux profile
8. exec() the entrypoint (nginx) as PID 1 in the new namespace set
Docker networking adds one more step:
9. Creates a veth pair
→ One end in the container's net namespace (becomes eth0)
→ Other end on the docker0 bridge on the host
10. Adds iptables DNAT rule
→ Host:8080 → Container-IP:80
Every single step maps to something you built by hand in Parts 2–4. Docker is automation, not magic.
The OverlayFS Layer Cake¶
Docker images are not disk images. They're stacked filesystem diffs:
┌─────────────────────────┐
│ Writable container layer│ ← Your runtime changes go here
├─────────────────────────┤
│ nginx config layer │ ← Read-only
├─────────────────────────┤
│ nginx binary layer │ ← Read-only
├─────────────────────────┤
│ Debian base layer │ ← Read-only
└─────────────────────────┘
↓
[Merged view via OverlayFS] ← What the container process sees
When the container modifies a file from a lower layer, OverlayFS copies the entire file
to the writable layer first (copy-on-write). Deleting a file creates a whiteout marker.
This is why RUN apt-get install && apt-get clean must be in the same Dockerfile layer.
# See the overlay mount for a running container
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' web
# /var/lib/docker/overlay2/<hash>/merged
Gotcha:
/proc/meminfoand/proc/cpuinfoinside a container show the host's values, not the container's cgroup limits. A JVM reading/proc/meminfosees 64GB of host RAM and sets its heap accordingly — inside a 512MB container. Modern JVMs (10+) use-XX:+UseContainerSupportto read cgroup limits instead. For Go, usego.uber.org/automaxprocs. Python'sos.cpu_count()still returns the host CPU count as of 3.12. (Source: JDK-8146115, Go issue #33803)
Part 6: Container Escapes — Breaking the Walls¶
Understanding escapes is understanding the walls. If you can't explain how a container is escaped, you don't understand how it's contained.
Escape 1: The /proc/self/exe Overwrite (CVE-2019-5736)¶
This is the most famous container escape. It affected runc < 1.0.0-rc6.
The vulnerability: A malicious container could overwrite the host's runc binary by
exploiting how Linux handles /proc/self/exe.
How it worked:
- When
docker execruns, runc enters the container's namespaces and thenexec()s the requested command - During a brief window, runc's process is inside the container's namespaces but
/proc/self/exestill points to the host's runc binary - The attacker opens
/proc/self/exe(which resolves to the host's/usr/bin/runc) withO_WRONLYand overwrites it with a malicious binary - Next time anyone runs
docker execordocker run, the malicious runc executes as root on the host
The fix: runc now re-executes itself into a cloned binary (memfd), so /proc/self/exe
never points to the on-disk binary during container operations.
War Story: CVE-2019-5736 was disclosed in February 2019 by Adam Iwaniuk and Borys Popawski. It affected Docker, Kubernetes, and any system using runc. The severity was scored CVSS 8.6. It demonstrated that the boundary between "inside the container" and "on the host" is thinner than most people assume — it's maintained by kernel namespace isolation, and any process that crosses that boundary (like runc during exec) is a potential attack vector. (Source: NVD CVE-2019-5736, runc security advisory GHSA-f3fp-gc8g-vw66)
Escape 2: Privileged Mode — The Door You Left Open¶
--privileged disables almost every isolation mechanism:
- All capabilities granted (including
CAP_SYS_ADMIN) - Seccomp disabled
- AppArmor disabled
- Full access to host
/dev - Can mount the host filesystem
# Inside a --privileged container — full host access
mount /dev/sda1 /mnt
ls /mnt # host's root filesystem
chroot /mnt
# You are now effectively root on the host
Remember:
--privilegedis not "give the container more permissions." It's "remove all container isolation." The process is still in namespaces, but with all capabilities and no seccomp/AppArmor, it can escape trivially. There is almost never a legitimate reason to use--privilegedin production.
Escape 3: The Docker Socket Mount¶
# Commonly seen in CI tools and monitoring agents
docker run -v /var/run/docker.sock:/var/run/docker.sock alpine sh
If a container has access to the Docker socket, it can:
# From inside the container — create a privileged container
apk add docker-cli
docker run --privileged -v /:/host alpine chroot /host
# Full host access via a container it launched
The Docker socket grants the same power as root on the host.
The Defense Stack¶
Each layer blocks different attacks:
| Defense | What it prevents |
|---|---|
| Namespaces | Seeing host processes, files, network |
| Cgroups | Consuming all host resources |
| Capabilities | Performing privileged kernel operations |
| Seccomp | Making dangerous syscalls (mount, reboot, kexec) |
| AppArmor/SELinux | Accessing specific files and resources |
| User namespaces | Being actual root on the host even if root in container |
| Read-only rootfs | Writing malicious binaries |
Part 7: Capabilities and Seccomp — Fine-Grained Controls¶
Linux Capabilities¶
Root privileges aren't all-or-nothing. Since Linux 2.2, they're split into ~40 capabilities:
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
Decode those hex values:
Docker's default set keeps ~14 capabilities and drops the rest. The dangerous ones that Docker drops by default:
| Capability | What it allows | Why it's dropped |
|---|---|---|
CAP_SYS_ADMIN |
Mount filesystems, set hostname, access many kernel interfaces | Near-root power — the "too broad" capability |
CAP_NET_ADMIN |
Configure network interfaces, iptables, routing | Could break host networking |
CAP_SYS_PTRACE |
Trace and debug other processes | Could read other containers' memory |
CAP_SYS_MODULE |
Load kernel modules | Game over — arbitrary kernel code |
CAP_SYS_RAWIO |
Raw I/O access | Direct hardware access |
Best practice — drop everything, add only what you need:
Trivia:
CAP_SYS_ADMINis sometimes called "the new root" because it grants such a broad set of privileges that having it effectively makes you root for most practical purposes. There have been repeated proposals to split it into smaller capabilities, but the kernel ABI stability guarantee makes this difficult. (Source:man 7 capabilities, lwn.net articles on capability splitting)
Seccomp Profiles¶
Seccomp filters restrict which system calls a process can make. Docker's default profile blocks approximately 44 of ~330 available syscalls:
# Run with default seccomp (automatic)
docker run nginx:1.25
# Run with NO seccomp (dangerous — for debugging only)
docker run --security-opt seccomp=unconfined nginx:1.25
# See what's blocked by default
docker run --rm -it alpine sh -c 'mount -t proc proc /tmp 2>&1'
# "mount: permission denied (are you root?)"
# The mount syscall is blocked by seccomp, even though you're root inside
Key blocked syscalls:
| Syscall | Why it's blocked |
|---|---|
mount, umount2 |
Could mount host filesystems |
reboot |
Could reboot the host |
kexec_load |
Could load a new kernel |
bpf |
Could attach eBPF programs to kernel |
ptrace |
Could trace other processes |
add_key, keyctl |
Could access kernel keyring |
Rootless Containers — The Best Defense¶
Rootless containers run the entire runtime stack as an unprivileged user:
# The Docker daemon itself runs without root
dockerd-rootless-setuptool.sh install
# Or use Podman, which is rootless by default
podman run -d nginx:1.25
How it works: User namespaces map UID 0 inside the container to your regular UID (e.g., 1000) on the host. Even if an attacker escapes the container, they land as an unprivileged user.
# Check the UID mapping
cat /proc/<container-pid>/uid_map
# 0 1000 65536
# Container UID 0 maps to host UID 1000
Limitations of rootless mode:
- Cannot bind to ports below 1024 (without net.ipv4.ip_unprivileged_port_start)
- OverlayFS requires kernel 5.11+ in rootless mode (older kernels use FUSE-OverlayFS,
which is slower)
- Some storage drivers have restrictions
Under the Hood: Rootless Podman uses
slirp4netnsorpastafor networking instead of creating veth pairs (which requiresCAP_NET_ADMIN).slirp4netnsruns a userspace TCP/IP stack, routing container traffic through a TAP device without any kernel privilege. Since kernel 5.11, native OverlayFS supports user namespaces, eliminating the FUSE overhead. (Source:man slirp4netns, rootlesscontaine.rs)
Part 8: The /proc and /sys/fs/cgroup Interfaces¶
These two pseudo-filesystems are your debugging interface into namespaces and cgroups.
/proc/PID/ns/ — Namespace Membership¶
# List all namespaces on the system
lsns
# Check which namespaces a specific process belongs to
ls -la /proc/$(pgrep nginx | head -1)/ns/
# Compare two processes — same namespace?
readlink /proc/1/ns/net
readlink /proc/$(pgrep nginx | head -1)/ns/net
# Same inode = same namespace
/proc/PID/cgroup — Cgroup Membership¶
# What cgroup is this process in?
cat /proc/$(pgrep nginx | head -1)/cgroup
# v2: 0::/system.slice/docker-abc123.scope
# v1: 12:memory:/docker/abc123 (multiple lines)
/sys/fs/cgroup/ — Reading and Setting Limits¶
# Find a container's cgroup path
CGPATH=$(cat /proc/$(pgrep nginx | head -1)/cgroup | cut -d: -f3)
# Read current resource usage
cat /sys/fs/cgroup${CGPATH}/memory.current # bytes used now
cat /sys/fs/cgroup${CGPATH}/memory.max # hard limit
cat /sys/fs/cgroup${CGPATH}/memory.high # throttle point
cat /sys/fs/cgroup${CGPATH}/cpu.max # CPU quota
cat /sys/fs/cgroup${CGPATH}/cpu.stat # throttle stats
cat /sys/fs/cgroup${CGPATH}/pids.current # process count
cat /sys/fs/cgroup${CGPATH}/memory.events # OOM kill history
nsenter — The Debugging Swiss Army Knife¶
nsenter enters an existing process's namespaces — use host tools against a container's
view:
PID=$(docker inspect --format '{{.State.Pid}}' myapp)
# Enter network namespace — tcpdump without installing it in the container
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 50 port 80
# Enter mount namespace — browse the container's filesystem
sudo nsenter -t $PID -m ls -la /app/
# Enter all namespaces — full "exec" equivalent from the host
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh
| nsenter flag | Namespace | Most common use |
|---|---|---|
-n |
Network | tcpdump, ss, ip addr, curl (95% of debugging) |
-m |
Mount | Browse container filesystem, check configs |
-p |
PID | See container's process tree |
-u |
UTS | Check hostname |
-i |
IPC | Debug shared memory issues |
Remember: nsenter flags map to namespace types: Network, Mount, PID. These three cover 95% of container debugging. You keep the host's tools but see the container's view.
Part 9: The Hands-On Exercises¶
Exercise 1: Prove Containers Are Processes (2 minutes)¶
Run a Docker container and find it in the host's process table:
Find the PID, check its namespace links, compare with PID 1.
Solution
The PID namespace inodes will be different — proving the container is isolated.Exercise 2: Create a Memory-Limited Cgroup (5 minutes)¶
Create a cgroup that limits memory to 32MB. Run a process that tries to allocate 64MB.
Observe the OOM kill in memory.events.
Hints
1. `sudo mkdir /sys/fs/cgroup/exercise2` 2. Enable controllers on the parent: `echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control` 3. Write to `memory.max` 4. Move your shell to the cgroup with `echo $$ > cgroup.procs` 5. Try to allocate: `python3 -c "x = bytearray(64 * 1024 * 1024)"` 6. Check `memory.events` for the `oom_kill` countSolution
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
sudo mkdir /sys/fs/cgroup/exercise2
echo 33554432 | sudo tee /sys/fs/cgroup/exercise2/memory.max
echo $$ | sudo tee /sys/fs/cgroup/exercise2/cgroup.procs
python3 -c "x = bytearray(64 * 1024 * 1024)"
# Killed
cat /sys/fs/cgroup/exercise2/memory.events
# oom_kill should be >= 1
# Clean up: move shell out first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/exercise2
Exercise 3: Debug a Container with nsenter (10 minutes)¶
A container is running but you can't exec into it (imagine it's a distroless image with
no shell). Use nsenter from the host to:
- Check what ports it's listening on
- Capture 10 packets on its network interface
- Read its
/etc/resolv.conf
Solution
docker run -d --name ex3 nginx:1.25
PID=$(docker inspect --format '{{.State.Pid}}' ex3)
# 1. Listening ports (network namespace only)
sudo nsenter -t $PID -n ss -tlnp
# 2. Packet capture (network namespace only)
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 10
# 3. Read resolv.conf (mount namespace)
sudo nsenter -t $PID -m cat /etc/resolv.conf
docker rm -f ex3
Exercise 4: Spot the Escape Vector (judgment call)¶
This docker-compose.yml is used by a CI tool. Find the security problem:
services:
ci-runner:
image: myorg/ci-runner:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./workspace:/workspace
privileged: false
cap_drop:
- ALL
Answer
The Docker socket mount (`/var/run/docker.sock`) gives the container full control of the Docker daemon. Even with `cap_drop: ALL`, the container can `docker run --privileged` to launch a new container with full host access. The socket mount is equivalent to root. Fix: Use a dedicated CI runner that doesn't need Docker socket access, or use Kaniko/Buildah for in-container image building.Cheat Sheet¶
Namespace Commands¶
| Task | Command |
|---|---|
| Create namespaces | sudo unshare --pid --mount --net --uts --ipc --fork bash |
| Enter container namespaces | sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh |
| Enter network ns only | sudo nsenter -t $PID -n <command> |
| List all namespaces | lsns |
| Check namespace membership | ls -la /proc/$PID/ns/ |
| Compare namespaces | readlink /proc/$PID1/ns/net vs readlink /proc/$PID2/ns/net |
Cgroup Commands (v2)¶
| Task | Command |
|---|---|
| Check v1 or v2 | stat -f --format=%T /sys/fs/cgroup/ |
| Find process cgroup | cat /proc/$PID/cgroup |
| Read memory limit | cat /sys/fs/cgroup/<path>/memory.max |
| Read memory usage | cat /sys/fs/cgroup/<path>/memory.current |
| Read CPU limit | cat /sys/fs/cgroup/<path>/cpu.max |
| Check throttle count | cat /sys/fs/cgroup/<path>/cpu.stat (nr_throttled) |
| Check OOM history | cat /sys/fs/cgroup/<path>/memory.events |
| System-wide pressure | cat /proc/pressure/{cpu,memory,io} |
| Per-service pressure | cat /sys/fs/cgroup/system.slice/<service>/cpu.pressure |
Docker → Linux Primitives Mapping¶
| Docker concept | Linux primitive |
|---|---|
| Container isolation | Namespaces (pid, net, mnt, uts, ipc) |
| Resource limits (--memory, --cpus) | cgroups (memory.max, cpu.max) |
| Image layers | OverlayFS (lower/upper/merged dirs) |
| Port mapping (-p 8080:80) | iptables DNAT rule |
| Container networking | veth pair + bridge (docker0) |
| --cap-drop / --cap-add | Linux capabilities |
| Seccomp profile | seccomp BPF filter |
| User namespace remap | CLONE_NEWUSER + /proc/PID/uid_map |
Security Hardening Quick Reference¶
| Hardening | Command / Config |
|---|---|
| Drop all capabilities | --cap-drop=ALL --cap-add=<needed> |
| Read-only rootfs | --read-only --tmpfs /tmp |
| No new privileges | --security-opt=no-new-privileges |
| Custom seccomp profile | --security-opt seccomp=profile.json |
| Non-root user | USER appuser in Dockerfile |
| Rootless mode | dockerd-rootless-setuptool.sh install |
| Never mount docker socket | Remove -v /var/run/docker.sock:... |
Takeaways¶
-
Containers are processes, not VMs. They share the host kernel. The isolation comes from namespaces (visibility) and cgroups (resource limits), not from hardware virtualization. You can prove this with
psand/proc. -
You can build a container with five commands.
unshare+mount+pivot_root+ writing to cgroup files. Docker automates this, but the primitives are simple enough to use by hand. -
cgroups v2 is the future. Unified hierarchy,
memory.highfor graceful throttling, PSI for saturation monitoring. If you're still on v1, plan the migration. -
The security stack is layered and each layer matters. Namespaces, cgroups, capabilities, seccomp, AppArmor/SELinux, user namespaces. Removing any one layer (especially via
--privileged) opens real attack paths. -
/proc/meminfolies to containers. Applications that auto-tune from/procwill see host resources, not cgroup limits. Modern runtimes handle this; older apps need environment variables (OMP_NUM_THREADS,GOMAXPROCS). -
nsenter is the most powerful container debugging tool. It lets you use host tools against a container's namespace view. Learn
-n(network),-m(mount),-p(PID).
Related Lessons¶
- The Container Escape — deep dive into container security and exploitation techniques
- The Hanging Deploy — PID namespaces, signals, and why containers won't stop gracefully
- The Proc Filesystem — everything
/procexposes about processes and the kernel - What Happens When You Docker Build — image layers, build cache, and OverlayFS in detail
- Out of Memory — OOM killer, cgroup memory limits, and debugging memory pressure
- From Init Scripts to Systemd — systemd's role as cgroup manager and service supervisor
- Strace: Reading the Matrix — syscall tracing, which is how you see namespaces and cgroups being created