Skip to content

cgroups and Namespaces: Containers Are a Lie

  • lesson
  • linux-namespaces
  • cgroups-v1/v2
  • container-internals
  • docker-architecture
  • security
  • l2 ---# Cgroups and Namespaces — Containers Are a Lie

Topics: Linux namespaces, cgroups v1/v2, container internals, Docker architecture, security (capabilities, seccomp), rootless containers, /proc, /sys/fs/cgroup Level: L2 (Operations) Time: 75–90 minutes Prerequisites: None — we start from scratch


The Mission

Build a container from scratch. No Docker. No containerd. No runc. Just you, a Linux kernel, and a handful of syscalls.

By the end you'll understand that a "container" is marketing — it's a regular Linux process wearing a disguise. Docker's genius wasn't inventing new technology. It was wrapping twenty years of kernel features in a CLI that made them feel like magic.

Here's the proof: after this lesson you'll create an isolated process with its own PID tree, its own filesystem, its own network stack, and resource limits — using nothing but unshare, mount, and writing numbers to files. Then you'll break out of it, because understanding the escape is how you understand the wall.


Part 1: The Proof — Containers Are Just Processes

Before we build anything, let's settle the argument.

Start any Docker container:

docker run -d --name proof nginx:1.25

Now look at it from the host:

# Get the container's PID on the host
PID=$(docker inspect --format '{{.State.Pid}}' proof)
echo "Container's host PID: $PID"

# It's a regular process in the host's process table
ps aux | grep $PID | grep -v grep

That nginx process lives in /proc/$PID like every other process. It has file descriptors, memory maps, a cgroup path, and namespace links. The only thing that makes it "a container" is what the kernel restricts it from seeing.

# The namespace trick — every process has these
ls -la /proc/$PID/ns/
lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026532419]'
lrwxrwxrwx 1 root root 0 ... ipc    -> 'ipc:[4026532417]'
lrwxrwxrwx 1 root root 0 ... mnt    -> 'mnt:[4026532415]'
lrwxrwxrwx 1 root root 0 ... net    -> 'net:[4026532420]'
lrwxrwxrwx 1 root root 0 ... pid    -> 'pid:[4026532418]'
lrwxrwxrwx 1 root root 0 ... user   -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 ... uts    -> 'uts:[4026532416]'

Those inode numbers are namespace IDs. Compare them with PID 1 on the host:

ls -la /proc/1/ns/ | awk '{print $NF}'
ls -la /proc/$PID/ns/ | awk '{print $NF}'

Different inodes = different namespaces = different view of the system. Same kernel.

Mental Model: A container is a process with a restricted viewport. Namespaces control what it can see. Cgroups control how much it can use. Everything else — images, registries, orchestrators — is convenience tooling built on top.

Clean up before we build our own:

docker rm -f proof

Part 2: Namespaces — Controlling What a Process Can See

The Eight Namespace Types

Linux has eight namespace types. Each one isolates a different kernel resource:

Namespace Flag Isolates Since
mnt CLONE_NEWNS Mount points — own filesystem view 2.4.19 (2002)
pid CLONE_NEWPID Process IDs — own PID tree starting at 1 2.6.24 (2008)
net CLONE_NEWNET Network stack — interfaces, routes, iptables 2.6.29 (2009)
uts CLONE_NEWUTS Hostname and NIS domain name 2.6.19 (2006)
ipc CLONE_NEWIPC System V IPC, POSIX message queues 2.6.19 (2006)
user CLONE_NEWUSER User/group IDs — UID 0 inside maps to unprivileged UID 3.8 (2013)
cgroup CLONE_NEWCGROUP cgroup root directory — own hierarchy view 4.6 (2016)
time CLONE_NEWTIME CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets 5.6 (2020)

Name Origin: The mount namespace flag is CLONE_NEWNS — "new namespace" — because in 2002 it was the only namespace type. Nobody anticipated there'd be seven more. Every subsequent type got a descriptive flag (CLONE_NEWPID, CLONE_NEWNET), but the mount namespace is stuck with the generic name as a historical artifact. (Source: kernel commit history, man clone(2))

Creating Namespaces with unshare

unshare runs a program in new namespaces. Here's your first "container":

# New PID namespace + new mount namespace
sudo unshare --pid --mount --fork bash
Flag What it does
--pid New PID namespace — this shell becomes PID 1
--mount New mount namespace — mounts won't affect the host
--fork Fork before exec — required for PID namespaces

Now remount /proc so ps reflects the new PID namespace:

mount -t proc proc /proc
ps aux

You should see only a handful of processes. Your shell is PID 1. The host's thousands of processes are invisible.

Under the Hood: ps reads /proc to enumerate processes. Without remounting, /proc still shows the host's process list because the mount namespace inherited it. Remounting /proc inside the new PID namespace makes it reflect only the processes visible in that namespace.

Type exit to leave. You're back on the host.

The Network Namespace — Total Isolation

sudo unshare --net bash
ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Only loopback. No eth0. No connectivity. This is what every container starts with before Docker creates a veth pair and plugs it into a bridge.

Exit and move on.

The UTS Namespace — Your Own Hostname

sudo unshare --uts bash
hostname container-from-scratch
hostname
container-from-scratch

Back on the host, hostname still shows the original name. The UTS namespace isolated it.

Trivia: UTS stands for "UNIX Time-sharing System." The name comes from the utsname struct in the kernel (returned by uname(2)), which dates back to the original UNIX. It has nothing to do with time zones — it's the hostname and domain name.


Flashcard Check — Namespaces

Cover the answers and test yourself:

Question Answer
How many namespace types does Linux have? Eight: mnt, pid, net, uts, ipc, user, cgroup, time
Why is the mount namespace flag called CLONE_NEWNS? It was the first and only namespace type in 2002
What command creates new namespaces for a process? unshare
What command enters an existing process's namespaces? nsenter
Where are a process's namespace links stored? /proc/PID/ns/
How do you tell if two processes share a namespace? Same inode number in /proc/PID/ns/<type>

Part 3: Building a Container from Scratch

Time to build a real container. No Docker. Just Linux primitives.

Step 1: Create a Root Filesystem

Every container needs a filesystem. We'll use Alpine Linux — it's 7 MB:

mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer

# Download and extract Alpine's minimal rootfs
curl -L -o alpine.tar.gz \
  https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz
tar -xzf alpine.tar.gz -C rootfs

Step 2: Unshare All the Things

sudo unshare \
  --pid \
  --mount \
  --net \
  --uts \
  --ipc \
  --fork \
  /bin/bash

You're now in a new set of namespaces. But you're still looking at the host's filesystem.

Step 3: Mount the New Root

# Bind mount the rootfs to itself (required for pivot_root)
mount --bind /tmp/mycontainer/rootfs /tmp/mycontainer/rootfs

# Create a directory for the old root
mkdir -p /tmp/mycontainer/rootfs/oldroot

# Pivot the root filesystem
cd /tmp/mycontainer/rootfs
pivot_root . oldroot
Command What it does
mount --bind ... ... Makes the directory a mount point (pivot_root requires this)
pivot_root . oldroot Swaps the root filesystem — new root is ., old root moves to oldroot/

Under the Hood: pivot_root(2) is the syscall that container runtimes use (not chroot). Unlike chroot, which only changes the pathname resolution root, pivot_root actually moves the root mount. The old root can then be unmounted entirely, which chroot cannot do. This is why containers don't see the host filesystem — it's been unmounted, not just hidden. (Source: man 2 pivot_root)

Step 4: Set Up Essential Mounts

# Mount /proc for process visibility
mount -t proc proc /proc

# Mount /sys for kernel interfaces
mount -t sysfs sysfs /sys

# Mount /dev/pts for terminal support
mkdir -p /dev/pts
mount -t devpts devpts /dev/pts

# Unmount the old root — we don't need it anymore
umount -l /oldroot
rmdir /oldroot

Step 5: Set the Hostname

hostname my-container

Step 6: Verify the Isolation

# Only our processes
ps aux

# Our own hostname
hostname

# Only loopback networking
ip link

# We're "root" inside
whoami

You just built a container. No Docker daemon. No containerd. No runc. The process thinks it has its own machine — its own PID tree, its own filesystem, its own hostname, its own network stack.

Interview Bridge: "Explain what happens when you run docker run" is a common interview question. The answer is: Docker asks runc to call clone() with namespace flags, then pivot_root() to swap the filesystem, then sets up cgroups and seccomp filters. Everything we just did by hand, plus resource limits and security profiles.

Exit with exit or Ctrl+D.


Flashcard Check — Building Containers

Question Answer
What syscall do container runtimes use to swap the root filesystem? pivot_root(2) — not chroot
Why must you remount /proc after creating a new PID namespace? ps reads /proc — without remounting, it shows the host's processes
What is the minimum set of namespaces for basic container isolation? PID + mount + UTS + IPC + net (5 of 8)
Why does unshare --pid require --fork? The calling process stays in the old PID namespace; only children enter the new one

Part 4: Cgroups — Controlling How Much a Process Can Use

Namespaces control visibility. Cgroups control consumption. Without cgroups, your "container" from Part 3 could eat all the CPU, all the RAM, and all the disk I/O on the host.

cgroups v1 vs v2 — Why It Matters

Name Origin: cgroups were originally developed by Paul Menage and Rohit Seth at Google in 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with OS-level containers. Google used them to manage workloads on their fleet years before Docker existed. (Source: Linux kernel documentation, Documentation/admin-guide/cgroup-v2.rst)

There are two versions, and the differences matter:

Aspect v1 v2
Hierarchy One tree per controller (cpu, memory, blkio...) Single unified tree
CPU limit file cpu.cfs_quota_us / cpu.cfs_period_us cpu.max ("quota period")
CPU weight file cpu.shares (2–262144) cpu.weight (1–10000)
Memory hard limit memory.limit_in_bytes memory.max
Memory soft limit memory.soft_limit_in_bytes memory.high (throttles, doesn't kill)
I/O limits blkio.throttle.* io.max
PSI (pressure info) Not available Built-in (*.pressure)
Default distros RHEL 8, Ubuntu 20.04, Amazon Linux 2 RHEL 9, Ubuntu 22.04+, Fedora 31+

Check which version you're running:

stat -f --format=%T /sys/fs/cgroup/
# "cgroup2fs" = v2
# "tmpfs"     = v1

Remember: v1 = many trees, v2 = one tree. In v1, CPU lives in /sys/fs/cgroup/cpu/, memory in /sys/fs/cgroup/memory/, and a process can be in different places in each tree. In v2, everything is in one unified hierarchy under /sys/fs/cgroup/. This is why v2 is cleaner — you can't have a process limited to 512MB in the memory tree but unlimited in the cpu tree because someone forgot to add it there.

v2 in Practice — Setting Limits by Writing Files

cgroups are controlled by writing to files in /sys/fs/cgroup/. That's it. No daemon, no API, no special tools. Just the filesystem.

# Create a cgroup
sudo mkdir /sys/fs/cgroup/demo

# Enable controllers for it
echo "+cpu +memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Set a memory limit: 64MB hard, 48MB soft (throttle point)
echo 67108864 | sudo tee /sys/fs/cgroup/demo/memory.max
echo 50331648 | sudo tee /sys/fs/cgroup/demo/memory.high

# Set a CPU limit: 25% of one core (25000µs per 100000µs period)
echo "25000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max

# Limit to 20 processes (fork bomb protection)
echo 20 | sudo tee /sys/fs/cgroup/demo/pids.max

# Move the current shell into this cgroup
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs
File Format Example Effect
memory.max bytes 67108864 Hard limit — OOM kill above this
memory.high bytes 50331648 Throttle threshold — reclaim pressure starts
cpu.max "quota period" (µs) "25000 100000" 25% of one core
cpu.weight 1–10000 100 Relative weight when competing
pids.max count 20 Max processes in cgroup
io.max "major:minor rbps=N wbps=N" "8:0 rbps=10485760" 10 MB/s read on device 8:0

Now test the memory limit:

# This will be killed by the cgroup OOM killer
python3 -c "x = bytearray(100 * 1024 * 1024)"  # Try to allocate 100MB in a 64MB cgroup

The process gets killed. Check what happened:

cat /sys/fs/cgroup/demo/memory.events
low 0
high 42
max 3
oom 1
oom_kill 1

That's the audit trail. high 42 means memory.high was exceeded 42 times (throttling). oom_kill 1 means one process was killed for exceeding memory.max.

Gotcha: memory.high is your friend. In v1, the only option was memory.limit_in_bytes — exceed it and you die. v2 added memory.high as a throttling threshold: the kernel reclaims memory aggressively and slows the process down before it hits the hard limit. Always set both. In Kubernetes, requests influences memory.low/memory.high and limits sets memory.max.

Clean up:

# Move shell out of the cgroup first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/demo

Pressure Stall Information (PSI) — v2 Only

PSI tells you how much a cgroup is suffering, not just how much it's using:

cat /proc/pressure/cpu
# some avg10=2.50 avg60=1.80 avg300=1.20 total=1234567890

cat /proc/pressure/memory
# some avg10=0.30 avg60=0.10 avg300=0.05 total=12345678
  • some: at least one task was stalled waiting for this resource
  • full: all tasks were stalled (severe)
  • avg10/60/300: exponential moving averages over 10s, 60s, 300s
Metric Healthy Warning Action needed
cpu some avg60 < 5% 5–25% > 25%: add CPU
memory some avg60 < 5% 5–20% > 20%: add RAM
memory full avg60 0% > 0% > 5%: thrashing, urgent
io some avg60 < 10% 10–30% > 30%: upgrade storage

Mental Model: Traditional monitoring tells you utilization (80% CPU used). PSI tells you saturation (20% of the time, work is stalled waiting for CPU). A system can be at 90% utilization and 0% saturation (healthy) or at 60% utilization and 40% saturation (in trouble because of lock contention or throttling).


Flashcard Check — Cgroups

Question Answer
How do you check if your system runs cgroups v1 or v2? stat -f --format=%T /sys/fs/cgroup/cgroup2fs = v2, tmpfs = v1
What's the difference between memory.max and memory.high in v2? memory.max = hard limit (OOM kill). memory.high = throttle threshold (slow down)
What does cpu.max "50000 100000" mean? 50% of one core (50,000µs quota per 100,000µs period)
How do you add a process to a cgroup? echo PID > /sys/fs/cgroup/<path>/cgroup.procs
What file shows cgroup OOM kill history? memory.events
What does PSI measure that utilization doesn't? Saturation — how much time work is stalled waiting for a resource

Part 5: How Docker Assembles the Lie

When you run docker run -d -p 8080:80 --memory=512m --name web nginx:1.25, here's what actually happens — mapped to the primitives you just learned:

docker CLI → REST API → dockerd → gRPC → containerd → containerd-shim → runc

runc does:
1. clone() with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC
   → Creates 5+ new namespaces (what you did with unshare)

2. Assembles OverlayFS from image layers
   → Stacks read-only layers + writable upper layer into one filesystem view

3. pivot_root() to the assembled filesystem
   → Swaps the root (what you did with pivot_root)

4. Creates a cgroup and writes limits
   → echo 536870912 > memory.max  (512MB)
   → echo $PID > cgroup.procs

5. Drops capabilities (removes CAP_SYS_ADMIN, etc.)

6. Applies seccomp filter (blocks ~44 dangerous syscalls)

7. Applies AppArmor/SELinux profile

8. exec() the entrypoint (nginx) as PID 1 in the new namespace set

Docker networking adds one more step:

9. Creates a veth pair
   → One end in the container's net namespace (becomes eth0)
   → Other end on the docker0 bridge on the host

10. Adds iptables DNAT rule
    → Host:8080 → Container-IP:80

Every single step maps to something you built by hand in Parts 2–4. Docker is automation, not magic.

The OverlayFS Layer Cake

Docker images are not disk images. They're stacked filesystem diffs:

┌─────────────────────────┐
│  Writable container layer│  ← Your runtime changes go here
├─────────────────────────┤
│  nginx config layer      │  ← Read-only
├─────────────────────────┤
│  nginx binary layer      │  ← Read-only
├─────────────────────────┤
│  Debian base layer       │  ← Read-only
└─────────────────────────┘
  [Merged view via OverlayFS]  ← What the container process sees

When the container modifies a file from a lower layer, OverlayFS copies the entire file to the writable layer first (copy-on-write). Deleting a file creates a whiteout marker. This is why RUN apt-get install && apt-get clean must be in the same Dockerfile layer.

# See the overlay mount for a running container
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' web
# /var/lib/docker/overlay2/<hash>/merged

Gotcha: /proc/meminfo and /proc/cpuinfo inside a container show the host's values, not the container's cgroup limits. A JVM reading /proc/meminfo sees 64GB of host RAM and sets its heap accordingly — inside a 512MB container. Modern JVMs (10+) use -XX:+UseContainerSupport to read cgroup limits instead. For Go, use go.uber.org/automaxprocs. Python's os.cpu_count() still returns the host CPU count as of 3.12. (Source: JDK-8146115, Go issue #33803)


Part 6: Container Escapes — Breaking the Walls

Understanding escapes is understanding the walls. If you can't explain how a container is escaped, you don't understand how it's contained.

Escape 1: The /proc/self/exe Overwrite (CVE-2019-5736)

This is the most famous container escape. It affected runc < 1.0.0-rc6.

The vulnerability: A malicious container could overwrite the host's runc binary by exploiting how Linux handles /proc/self/exe.

How it worked:

  1. When docker exec runs, runc enters the container's namespaces and then exec()s the requested command
  2. During a brief window, runc's process is inside the container's namespaces but /proc/self/exe still points to the host's runc binary
  3. The attacker opens /proc/self/exe (which resolves to the host's /usr/bin/runc) with O_WRONLY and overwrites it with a malicious binary
  4. Next time anyone runs docker exec or docker run, the malicious runc executes as root on the host

The fix: runc now re-executes itself into a cloned binary (memfd), so /proc/self/exe never points to the on-disk binary during container operations.

War Story: CVE-2019-5736 was disclosed in February 2019 by Adam Iwaniuk and Borys Popawski. It affected Docker, Kubernetes, and any system using runc. The severity was scored CVSS 8.6. It demonstrated that the boundary between "inside the container" and "on the host" is thinner than most people assume — it's maintained by kernel namespace isolation, and any process that crosses that boundary (like runc during exec) is a potential attack vector. (Source: NVD CVE-2019-5736, runc security advisory GHSA-f3fp-gc8g-vw66)

Escape 2: Privileged Mode — The Door You Left Open

# Never do this in production
docker run --privileged -it alpine sh

--privileged disables almost every isolation mechanism:

  • All capabilities granted (including CAP_SYS_ADMIN)
  • Seccomp disabled
  • AppArmor disabled
  • Full access to host /dev
  • Can mount the host filesystem
# Inside a --privileged container — full host access
mount /dev/sda1 /mnt
ls /mnt   # host's root filesystem
chroot /mnt
# You are now effectively root on the host

Remember: --privileged is not "give the container more permissions." It's "remove all container isolation." The process is still in namespaces, but with all capabilities and no seccomp/AppArmor, it can escape trivially. There is almost never a legitimate reason to use --privileged in production.

Escape 3: The Docker Socket Mount

# Commonly seen in CI tools and monitoring agents
docker run -v /var/run/docker.sock:/var/run/docker.sock alpine sh

If a container has access to the Docker socket, it can:

# From inside the container — create a privileged container
apk add docker-cli
docker run --privileged -v /:/host alpine chroot /host
# Full host access via a container it launched

The Docker socket grants the same power as root on the host.

The Defense Stack

Each layer blocks different attacks:

Defense What it prevents
Namespaces Seeing host processes, files, network
Cgroups Consuming all host resources
Capabilities Performing privileged kernel operations
Seccomp Making dangerous syscalls (mount, reboot, kexec)
AppArmor/SELinux Accessing specific files and resources
User namespaces Being actual root on the host even if root in container
Read-only rootfs Writing malicious binaries

Part 7: Capabilities and Seccomp — Fine-Grained Controls

Linux Capabilities

Root privileges aren't all-or-nothing. Since Linux 2.2, they're split into ~40 capabilities:

# What capabilities does a container have?
docker run --rm alpine cat /proc/1/status | grep Cap
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

Decode those hex values:

capsh --decode=00000000a80425fb

Docker's default set keeps ~14 capabilities and drops the rest. The dangerous ones that Docker drops by default:

Capability What it allows Why it's dropped
CAP_SYS_ADMIN Mount filesystems, set hostname, access many kernel interfaces Near-root power — the "too broad" capability
CAP_NET_ADMIN Configure network interfaces, iptables, routing Could break host networking
CAP_SYS_PTRACE Trace and debug other processes Could read other containers' memory
CAP_SYS_MODULE Load kernel modules Game over — arbitrary kernel code
CAP_SYS_RAWIO Raw I/O access Direct hardware access

Best practice — drop everything, add only what you need:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx:1.25

Trivia: CAP_SYS_ADMIN is sometimes called "the new root" because it grants such a broad set of privileges that having it effectively makes you root for most practical purposes. There have been repeated proposals to split it into smaller capabilities, but the kernel ABI stability guarantee makes this difficult. (Source: man 7 capabilities, lwn.net articles on capability splitting)

Seccomp Profiles

Seccomp filters restrict which system calls a process can make. Docker's default profile blocks approximately 44 of ~330 available syscalls:

# Run with default seccomp (automatic)
docker run nginx:1.25

# Run with NO seccomp (dangerous — for debugging only)
docker run --security-opt seccomp=unconfined nginx:1.25

# See what's blocked by default
docker run --rm -it alpine sh -c 'mount -t proc proc /tmp 2>&1'
# "mount: permission denied (are you root?)"
# The mount syscall is blocked by seccomp, even though you're root inside

Key blocked syscalls:

Syscall Why it's blocked
mount, umount2 Could mount host filesystems
reboot Could reboot the host
kexec_load Could load a new kernel
bpf Could attach eBPF programs to kernel
ptrace Could trace other processes
add_key, keyctl Could access kernel keyring

Rootless Containers — The Best Defense

Rootless containers run the entire runtime stack as an unprivileged user:

# The Docker daemon itself runs without root
dockerd-rootless-setuptool.sh install

# Or use Podman, which is rootless by default
podman run -d nginx:1.25

How it works: User namespaces map UID 0 inside the container to your regular UID (e.g., 1000) on the host. Even if an attacker escapes the container, they land as an unprivileged user.

# Check the UID mapping
cat /proc/<container-pid>/uid_map
# 0  1000  65536
# Container UID 0 maps to host UID 1000

Limitations of rootless mode: - Cannot bind to ports below 1024 (without net.ipv4.ip_unprivileged_port_start) - OverlayFS requires kernel 5.11+ in rootless mode (older kernels use FUSE-OverlayFS, which is slower) - Some storage drivers have restrictions

Under the Hood: Rootless Podman uses slirp4netns or pasta for networking instead of creating veth pairs (which requires CAP_NET_ADMIN). slirp4netns runs a userspace TCP/IP stack, routing container traffic through a TAP device without any kernel privilege. Since kernel 5.11, native OverlayFS supports user namespaces, eliminating the FUSE overhead. (Source: man slirp4netns, rootlesscontaine.rs)


Part 8: The /proc and /sys/fs/cgroup Interfaces

These two pseudo-filesystems are your debugging interface into namespaces and cgroups.

/proc/PID/ns/ — Namespace Membership

# List all namespaces on the system
lsns

# Check which namespaces a specific process belongs to
ls -la /proc/$(pgrep nginx | head -1)/ns/

# Compare two processes — same namespace?
readlink /proc/1/ns/net
readlink /proc/$(pgrep nginx | head -1)/ns/net
# Same inode = same namespace

/proc/PID/cgroup — Cgroup Membership

# What cgroup is this process in?
cat /proc/$(pgrep nginx | head -1)/cgroup
# v2: 0::/system.slice/docker-abc123.scope
# v1: 12:memory:/docker/abc123 (multiple lines)

/sys/fs/cgroup/ — Reading and Setting Limits

# Find a container's cgroup path
CGPATH=$(cat /proc/$(pgrep nginx | head -1)/cgroup | cut -d: -f3)

# Read current resource usage
cat /sys/fs/cgroup${CGPATH}/memory.current     # bytes used now
cat /sys/fs/cgroup${CGPATH}/memory.max         # hard limit
cat /sys/fs/cgroup${CGPATH}/memory.high        # throttle point
cat /sys/fs/cgroup${CGPATH}/cpu.max            # CPU quota
cat /sys/fs/cgroup${CGPATH}/cpu.stat           # throttle stats
cat /sys/fs/cgroup${CGPATH}/pids.current       # process count
cat /sys/fs/cgroup${CGPATH}/memory.events      # OOM kill history

nsenter — The Debugging Swiss Army Knife

nsenter enters an existing process's namespaces — use host tools against a container's view:

PID=$(docker inspect --format '{{.State.Pid}}' myapp)

# Enter network namespace — tcpdump without installing it in the container
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 50 port 80

# Enter mount namespace — browse the container's filesystem
sudo nsenter -t $PID -m ls -la /app/

# Enter all namespaces — full "exec" equivalent from the host
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh
nsenter flag Namespace Most common use
-n Network tcpdump, ss, ip addr, curl (95% of debugging)
-m Mount Browse container filesystem, check configs
-p PID See container's process tree
-u UTS Check hostname
-i IPC Debug shared memory issues

Remember: nsenter flags map to namespace types: Network, Mount, PID. These three cover 95% of container debugging. You keep the host's tools but see the container's view.


Part 9: The Hands-On Exercises

Exercise 1: Prove Containers Are Processes (2 minutes)

Run a Docker container and find it in the host's process table:

docker run -d --name ex1 nginx:1.25

Find the PID, check its namespace links, compare with PID 1.

Solution
PID=$(docker inspect --format '{{.State.Pid}}' ex1)
echo "Container PID on host: $PID"
ps aux | grep $PID | grep -v grep
ls -la /proc/$PID/ns/
readlink /proc/1/ns/pid
readlink /proc/$PID/ns/pid
docker rm -f ex1
The PID namespace inodes will be different — proving the container is isolated.

Exercise 2: Create a Memory-Limited Cgroup (5 minutes)

Create a cgroup that limits memory to 32MB. Run a process that tries to allocate 64MB. Observe the OOM kill in memory.events.

Hints 1. `sudo mkdir /sys/fs/cgroup/exercise2` 2. Enable controllers on the parent: `echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control` 3. Write to `memory.max` 4. Move your shell to the cgroup with `echo $$ > cgroup.procs` 5. Try to allocate: `python3 -c "x = bytearray(64 * 1024 * 1024)"` 6. Check `memory.events` for the `oom_kill` count
Solution
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
sudo mkdir /sys/fs/cgroup/exercise2
echo 33554432 | sudo tee /sys/fs/cgroup/exercise2/memory.max
echo $$ | sudo tee /sys/fs/cgroup/exercise2/cgroup.procs
python3 -c "x = bytearray(64 * 1024 * 1024)"
# Killed
cat /sys/fs/cgroup/exercise2/memory.events
# oom_kill should be >= 1

# Clean up: move shell out first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/exercise2

Exercise 3: Debug a Container with nsenter (10 minutes)

A container is running but you can't exec into it (imagine it's a distroless image with no shell). Use nsenter from the host to:

  1. Check what ports it's listening on
  2. Capture 10 packets on its network interface
  3. Read its /etc/resolv.conf
Solution
docker run -d --name ex3 nginx:1.25
PID=$(docker inspect --format '{{.State.Pid}}' ex3)

# 1. Listening ports (network namespace only)
sudo nsenter -t $PID -n ss -tlnp

# 2. Packet capture (network namespace only)
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 10

# 3. Read resolv.conf (mount namespace)
sudo nsenter -t $PID -m cat /etc/resolv.conf

docker rm -f ex3

Exercise 4: Spot the Escape Vector (judgment call)

This docker-compose.yml is used by a CI tool. Find the security problem:

services:
  ci-runner:
    image: myorg/ci-runner:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./workspace:/workspace
    privileged: false
    cap_drop:
      - ALL
Answer The Docker socket mount (`/var/run/docker.sock`) gives the container full control of the Docker daemon. Even with `cap_drop: ALL`, the container can `docker run --privileged` to launch a new container with full host access. The socket mount is equivalent to root. Fix: Use a dedicated CI runner that doesn't need Docker socket access, or use Kaniko/Buildah for in-container image building.

Cheat Sheet

Namespace Commands

Task Command
Create namespaces sudo unshare --pid --mount --net --uts --ipc --fork bash
Enter container namespaces sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh
Enter network ns only sudo nsenter -t $PID -n <command>
List all namespaces lsns
Check namespace membership ls -la /proc/$PID/ns/
Compare namespaces readlink /proc/$PID1/ns/net vs readlink /proc/$PID2/ns/net

Cgroup Commands (v2)

Task Command
Check v1 or v2 stat -f --format=%T /sys/fs/cgroup/
Find process cgroup cat /proc/$PID/cgroup
Read memory limit cat /sys/fs/cgroup/<path>/memory.max
Read memory usage cat /sys/fs/cgroup/<path>/memory.current
Read CPU limit cat /sys/fs/cgroup/<path>/cpu.max
Check throttle count cat /sys/fs/cgroup/<path>/cpu.stat (nr_throttled)
Check OOM history cat /sys/fs/cgroup/<path>/memory.events
System-wide pressure cat /proc/pressure/{cpu,memory,io}
Per-service pressure cat /sys/fs/cgroup/system.slice/<service>/cpu.pressure

Docker → Linux Primitives Mapping

Docker concept Linux primitive
Container isolation Namespaces (pid, net, mnt, uts, ipc)
Resource limits (--memory, --cpus) cgroups (memory.max, cpu.max)
Image layers OverlayFS (lower/upper/merged dirs)
Port mapping (-p 8080:80) iptables DNAT rule
Container networking veth pair + bridge (docker0)
--cap-drop / --cap-add Linux capabilities
Seccomp profile seccomp BPF filter
User namespace remap CLONE_NEWUSER + /proc/PID/uid_map

Security Hardening Quick Reference

Hardening Command / Config
Drop all capabilities --cap-drop=ALL --cap-add=<needed>
Read-only rootfs --read-only --tmpfs /tmp
No new privileges --security-opt=no-new-privileges
Custom seccomp profile --security-opt seccomp=profile.json
Non-root user USER appuser in Dockerfile
Rootless mode dockerd-rootless-setuptool.sh install
Never mount docker socket Remove -v /var/run/docker.sock:...

Takeaways

  • Containers are processes, not VMs. They share the host kernel. The isolation comes from namespaces (visibility) and cgroups (resource limits), not from hardware virtualization. You can prove this with ps and /proc.

  • You can build a container with five commands. unshare + mount + pivot_root + writing to cgroup files. Docker automates this, but the primitives are simple enough to use by hand.

  • cgroups v2 is the future. Unified hierarchy, memory.high for graceful throttling, PSI for saturation monitoring. If you're still on v1, plan the migration.

  • The security stack is layered and each layer matters. Namespaces, cgroups, capabilities, seccomp, AppArmor/SELinux, user namespaces. Removing any one layer (especially via --privileged) opens real attack paths.

  • /proc/meminfo lies to containers. Applications that auto-tune from /proc will see host resources, not cgroup limits. Modern runtimes handle this; older apps need environment variables (OMP_NUM_THREADS, GOMAXPROCS).

  • nsenter is the most powerful container debugging tool. It lets you use host tools against a container's namespace view. Learn -n (network), -m (mount), -p (PID).