The Container Escape

lesson
namespaces
capabilities
seccomp
docker-security
container-isolation
attack-surface
l2 ---# The Container Escape

Topics: namespaces, capabilities, seccomp, Docker security, container isolation, attack surface Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic container understanding helpful but not required

The Mission¶

Your security team asks: "If an attacker gets code execution inside a container, what can they reach?" The honest answer, for most default Docker configurations, is: more than you'd like.

Containers are not VMs. They share the host kernel. The isolation comes from Linux kernel features — namespaces, cgroups, capabilities, seccomp — and each one has limits, edge cases, and well-known bypasses.

This lesson breaks container security by examining each isolation layer, understanding what it prevents, and seeing what happens when it's misconfigured. You'll understand container isolation deeply enough to harden it — or to explain to your CISO exactly what the risk is.

The Isolation Stack¶

A container is a process with restrictions applied by the kernel:

Normal process   →   Container process
     ↓                      ↓
Full filesystem  →   Mount namespace (restricted view)
All PIDs visible →   PID namespace (only sees its own processes)
Host network     →   Network namespace (isolated network stack)
All capabilities →   Dropped capabilities (reduced privileges)
All syscalls     →   Seccomp filter (blocked dangerous syscalls)
No cgroup limits →   Cgroup limits (CPU, memory, I/O)
Root UID = root  →   User namespace (UID 0 in container ≠ UID 0 on host)

Each layer is a defense. Remove any one, and you open an attack path.

Layer 1: Namespaces — What the Container Can See¶

Linux has 7 namespace types. Each one isolates a different kernel resource:

Namespace	Isolates	Kernel version
PID	Process IDs	2.6.24 (2008)
NET	Network stack (interfaces, routing, ports)	2.6.29 (2009)
MNT	Filesystem mount points	2.4.19 (2002)
UTS	Hostname and domain name	2.6.19 (2006)
IPC	Inter-process communication (shared memory, semaphores)	2.6.19 (2006)
USER	User and group IDs	3.8 (2013)
CGROUP	Cgroup hierarchy view	4.6 (2016)

What namespaces DON'T isolate¶

The container still shares the host's: - Kernel — same kernel, same vulnerabilities. A kernel exploit inside the container = host compromise. - /proc and /sys contents — /proc/meminfo, /proc/cpuinfo show HOST values, not container limits. This has caused countless JVM misconfigurations where the JVM reads host RAM (64GB) and sets its heap accordingly, while the container limit is 512MB. - Time — containers share the host clock. Changing the system time inside a container (if you have the capability) affects the host. - Kernel parameters — most /proc/sys/ values are global. A container that can write to them affects all containers on the host.

# Inside a container — these show HOST values
cat /proc/meminfo | head -3
# → MemTotal:  65536000 kB   ← this is the HOST's RAM, not the container's limit

# The container's actual limit is in cgroups
cat /sys/fs/cgroup/memory.max
# → 536870912  ← 512MB (the real limit)

Gotcha: JVMs before Java 10 read /proc/meminfo for memory sizing. A JVM in a 512MB container on a 64GB host would try to allocate ~16GB of heap, get OOM killed instantly, and the error message was unhelpful. Java 10+ added container awareness via cgroup detection. Always use -XX:MaxRAMPercentage=75.0 or explicit -Xmx in containers.

Layer 2: Capabilities — What the Container Can Do¶

Root on a traditional Linux system can do anything. Capabilities split root's power into 41 individual privileges. Docker drops most of them by default.

What Docker keeps:

Capability	What it allows	Risk if exploited
`CHOWN`	Change file ownership	Moderate
`NET_BIND_SERVICE`	Bind to ports < 1024	Low
`KILL`	Send signals to processes	Low (within container)
`NET_RAW`	Raw sockets (ping)	Medium — can sniff traffic on shared network
`SETUID`/`SETGID`	Change user/group ID	High — privilege escalation inside container

What Docker drops (among others):

Capability	What it allows	Why it's dropped
`SYS_ADMIN`	Mount filesystems, load modules, everything	Effectively root on host
`SYS_PTRACE`	Trace other processes	Can inspect/modify other containers' memory
`NET_ADMIN`	Configure networking	Can modify routing, sniff traffic
`SYS_MODULE`	Load kernel modules	Direct kernel access
`SYS_RAWIO`	Raw I/O port access	Direct hardware access

The `--privileged` flag¶

# This gives the container ALL capabilities, disables seccomp, and grants device access
docker run --privileged myimage

--privileged is the container equivalent of chmod 777. The container can: - Mount the host filesystem - Load kernel modules - Access all devices - Modify kernel parameters - Essentially do anything root on the host can do

Mental Model: A privileged container is a process running as host root with a fancy filesystem. The namespace boundaries still exist technically, but with all capabilities and device access, breaking out is trivial.

# From inside a privileged container — mount the host filesystem
mount /dev/sda1 /mnt
# You now have full read/write access to the host's root filesystem
# Add your SSH key, modify /etc/shadow, install a backdoor

War Story: Tesla's Kubernetes cluster (2018) had an unauthenticated Kubernetes dashboard with cluster-admin access. Attackers deployed cryptocurrency miners. But the real damage was that the pods had access to an S3 bucket containing proprietary telemetry data. The fix was simple: RBAC. But the exposure lasted until external researchers reported it.

Layer 3: Seccomp — What Syscalls the Container Can Make¶

Even without special capabilities, the container's process can make system calls. Seccomp (secure computing) filters which syscalls are allowed.

Docker's default seccomp profile blocks ~44 dangerous syscalls:

Blocked syscall	What it does	Why blocked
`reboot`	Reboots the system	Host reboot
`kexec_load`	Loads a new kernel	Kernel replacement
`mount`	Mounts filesystems	Filesystem escape
`umount`	Unmounts filesystems	Filesystem manipulation
`swapon`/`swapoff`	Manages swap	Resource manipulation
`init_module`/`delete_module`	Loads/unloads kernel modules	Kernel access
`clock_settime`	Changes system clock	Time manipulation
`settimeofday`	Changes system time	Time manipulation

War Story: A Go service worked in dev but crashed in production with a cryptic error. Same image SHA everywhere. Root cause: production had a custom seccomp profile that blocked mmap with PROT_EXEC flags — something the default Docker profile allows but the hardened production profile didn't. The Go runtime needed executable memory mappings for its garbage collector. "Same image" does NOT mean "same runtime environment" — security policies, seccomp profiles, and capability sets differ between environments.

Layer 4: User Namespaces — The UID Question¶

Inside the container, you're root (UID 0). On the host, what UID are you?

Without user namespaces (default Docker): - Container root = host root (UID 0) - If you escape the container, you have host root privileges

With user namespaces (rootless Docker/Podman): - Container root (UID 0) maps to unprivileged host UID (e.g., 100000) - Even if you escape, you're an unprivileged user on the host

# Check UID mapping from inside container
cat /proc/1/uid_map
# Without user namespaces:  0  0  4294967295  (container 0 = host 0)
# With user namespaces:     0  100000  65536   (container 0 = host 100000)

Rootless containers (Podman default, Docker rootless mode) are the strongest isolation available without a VM. The trade-off: can't bind ports below 1024, and some operations are slower (FUSE-based OverlayFS before kernel 5.11).

Layer 5: The Docker Socket — The Master Key¶

# This is the most dangerous volume mount in all of Docker
docker run -v /var/run/docker.sock:/var/run/docker.sock myimage

The Docker socket gives the container full control over the Docker daemon. From inside, you can:

# Create a new privileged container that mounts the host filesystem
docker run -it --privileged -v /:/host alpine chroot /host
# You now have a host root shell

Mounting the Docker socket is equivalent to giving the container root access to the host. Period. CI/CD tools, monitoring agents, and log collectors that "need Docker access" should use alternatives (rootless Docker-in-Docker, Kaniko for builds, read-only APIs).

The Hardened Container Checklist¶

# 1. Non-root user
FROM python:3.11-slim
RUN useradd -r -u 1000 appuser
USER appuser

# 2. Minimal base image (less attack surface)
FROM gcr.io/distroless/static:nonroot

# 3. Drop ALL capabilities, add only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# 4. Read-only filesystem
docker run --read-only --tmpfs /tmp myimage

# 5. No new privileges (prevent setuid binaries)
docker run --security-opt=no-new-privileges myimage

# 6. PID 1 init (signal handling + zombie reaping)
docker run --init myimage

# 7. Memory limits (prevent host OOM)
docker run --memory=512m myimage

# 8. Never mount the Docker socket
# 9. Never use --privileged
# 10. Scan images for vulnerabilities
trivy image --severity HIGH,CRITICAL myimage:v1

In Kubernetes:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]

Flashcard Check¶

Q1: Containers share _____ with the host that VMs don't.

The kernel. A kernel exploit inside a container compromises the host. VMs have their own kernel, isolated by the hypervisor.

Q2: /proc/meminfo inside a container shows 64GB but the limit is 512MB. Why?

/proc/meminfo shows host values, not container limits. The actual limit is in /sys/fs/cgroup/memory.max. JVMs before Java 10 misread this and OOM'd.

Q3: What does --privileged do?

Grants ALL capabilities, disables seccomp, gives access to all host devices. The container can mount the host filesystem, load kernel modules, and effectively operate as host root.

Q4: Why is mounting /var/run/docker.sock dangerous?

It gives the container full control over the Docker daemon. From inside, you can create a new privileged container that mounts the host filesystem — instant host root.

Q5: Container root (UID 0) without user namespaces = what on the host?

Also UID 0 (root). Container escape = host root. With user namespaces, container UID 0 maps to an unprivileged host UID.

Q6: What does Docker's default seccomp profile block?

~44 dangerous syscalls including reboot, mount, kexec_load, init_module. Custom profiles can restrict further based on your app's needs.

Exercises¶

Exercise 1: Inspect your container's isolation (hands-on)¶

# Run a container and inspect its capabilities
docker run --rm alpine cat /proc/1/status | grep Cap

# Decode the capability bitmask
docker run --rm alpine sh -c 'apk add -q libcap && capsh --decode=$(cat /proc/1/status | grep CapEff | awk "{print \$2}")'

# Compare with a privileged container
docker run --rm --privileged alpine cat /proc/1/status | grep Cap

Exercise 2: Demonstrate the Docker socket escape (lab only)¶

# Run container with socket mounted
docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock \
    docker:cli sh

# From inside, create a host-root container
docker run --rm -it --privileged -v /:/hostfs alpine sh
ls /hostfs/etc/shadow  # You now have host root access

Exercise 3: Harden a Dockerfile (refactor)¶

This Dockerfile has multiple security issues. Fix them all:

FROM ubuntu:latest
RUN apt-get update && apt-get install -y python3 curl vim
COPY . /app
WORKDIR /app
ENV DB_PASSWORD=supersecret
CMD ["python3", "app.py"]

Solution

FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
COPY . .

FROM python:3.11-slim
WORKDIR /app
RUN useradd -r -u 1000 appuser
COPY --from=builder --chown=appuser /root/.local /home/appuser/.local
COPY --from=builder --chown=appuser /app .
USER appuser
ENV PATH=/home/appuser/.local/bin:$PATH
# DB_PASSWORD should come from secrets mount, not ENV
CMD ["python3", "app.py"]

Fixes: pinned base (not `:latest`), multi-stage (no vim/curl in prod), non-root user, no hardcoded secrets, `.dockerignore` needed for COPY.

Cheat Sheet¶

Container Security Quick Check¶

Check	Command
What user is PID 1?	`docker exec ctr id`
Capabilities	`docker exec ctr cat /proc/1/status \\| grep Cap`
Seccomp active?	`docker inspect ctr \\| jq '.[0].HostConfig.SecurityOpt'`
Privileged?	`docker inspect ctr \\| jq '.[0].HostConfig.Privileged'`
Read-only FS?	`docker inspect ctr \\| jq '.[0].HostConfig.ReadonlyRootfs'`
UID mapping	`docker exec ctr cat /proc/1/uid_map`
Image vulns	`trivy image --severity HIGH,CRITICAL image:tag`

Hardening Flags¶

docker run \
    --cap-drop=ALL \
    --cap-add=NET_BIND_SERVICE \
    --read-only \
    --tmpfs /tmp \
    --security-opt=no-new-privileges \
    --init \
    --memory=512m \
    --user 1000:1000 \
    myimage

Takeaways¶

Containers share the host kernel. Kernel exploit = game over. This is the fundamental difference from VMs.
--privileged and Docker socket = host root. Never use either in production without documented, reviewed justification.
Defaults are decent but not hardened. Docker drops many capabilities and applies seccomp, but still runs as root and allows privilege escalation by default.
User namespaces are the strongest isolation. Rootless containers (Podman) map UID 0 to an unprivileged host UID. Even container escape gives limited access.
/proc lies inside containers. It shows host values for memory, CPU, etc. Use cgroup-aware tooling and -XX:MaxRAMPercentage for JVMs.

Permission Denied — container security contexts and how isolation causes access errors
What Happens When You docker build — how images are constructed
The Hanging Deploy — PID 1 in containers and signal handling