The Container Escape
- lesson
- namespaces
- capabilities
- seccomp
- docker-security
- container-isolation
- attack-surface
- l2 ---# The Container Escape
Topics: namespaces, capabilities, seccomp, Docker security, container isolation, attack surface Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic container understanding helpful but not required
The Mission¶
Your security team asks: "If an attacker gets code execution inside a container, what can they reach?" The honest answer, for most default Docker configurations, is: more than you'd like.
Containers are not VMs. They share the host kernel. The isolation comes from Linux kernel features — namespaces, cgroups, capabilities, seccomp — and each one has limits, edge cases, and well-known bypasses.
This lesson breaks container security by examining each isolation layer, understanding what it prevents, and seeing what happens when it's misconfigured. You'll understand container isolation deeply enough to harden it — or to explain to your CISO exactly what the risk is.
The Isolation Stack¶
A container is a process with restrictions applied by the kernel:
Normal process → Container process
↓ ↓
Full filesystem → Mount namespace (restricted view)
All PIDs visible → PID namespace (only sees its own processes)
Host network → Network namespace (isolated network stack)
All capabilities → Dropped capabilities (reduced privileges)
All syscalls → Seccomp filter (blocked dangerous syscalls)
No cgroup limits → Cgroup limits (CPU, memory, I/O)
Root UID = root → User namespace (UID 0 in container ≠ UID 0 on host)
Each layer is a defense. Remove any one, and you open an attack path.
Layer 1: Namespaces — What the Container Can See¶
Linux has 7 namespace types. Each one isolates a different kernel resource:
| Namespace | Isolates | Kernel version |
|---|---|---|
| PID | Process IDs | 2.6.24 (2008) |
| NET | Network stack (interfaces, routing, ports) | 2.6.29 (2009) |
| MNT | Filesystem mount points | 2.4.19 (2002) |
| UTS | Hostname and domain name | 2.6.19 (2006) |
| IPC | Inter-process communication (shared memory, semaphores) | 2.6.19 (2006) |
| USER | User and group IDs | 3.8 (2013) |
| CGROUP | Cgroup hierarchy view | 4.6 (2016) |
What namespaces DON'T isolate¶
The container still shares the host's:
- Kernel — same kernel, same vulnerabilities. A kernel exploit inside the container
= host compromise.
- /proc and /sys contents — /proc/meminfo, /proc/cpuinfo show HOST values,
not container limits. This has caused countless JVM misconfigurations where the JVM
reads host RAM (64GB) and sets its heap accordingly, while the container limit is 512MB.
- Time — containers share the host clock. Changing the system time inside a container
(if you have the capability) affects the host.
- Kernel parameters — most /proc/sys/ values are global. A container that can write
to them affects all containers on the host.
# Inside a container — these show HOST values
cat /proc/meminfo | head -3
# → MemTotal: 65536000 kB ← this is the HOST's RAM, not the container's limit
# The container's actual limit is in cgroups
cat /sys/fs/cgroup/memory.max
# → 536870912 ← 512MB (the real limit)
Gotcha: JVMs before Java 10 read
/proc/meminfofor memory sizing. A JVM in a 512MB container on a 64GB host would try to allocate ~16GB of heap, get OOM killed instantly, and the error message was unhelpful. Java 10+ added container awareness via cgroup detection. Always use-XX:MaxRAMPercentage=75.0or explicit-Xmxin containers.
Layer 2: Capabilities — What the Container Can Do¶
Root on a traditional Linux system can do anything. Capabilities split root's power into 41 individual privileges. Docker drops most of them by default.
What Docker keeps:
| Capability | What it allows | Risk if exploited |
|---|---|---|
CHOWN |
Change file ownership | Moderate |
NET_BIND_SERVICE |
Bind to ports < 1024 | Low |
KILL |
Send signals to processes | Low (within container) |
NET_RAW |
Raw sockets (ping) | Medium — can sniff traffic on shared network |
SETUID/SETGID |
Change user/group ID | High — privilege escalation inside container |
What Docker drops (among others):
| Capability | What it allows | Why it's dropped |
|---|---|---|
SYS_ADMIN |
Mount filesystems, load modules, everything | Effectively root on host |
SYS_PTRACE |
Trace other processes | Can inspect/modify other containers' memory |
NET_ADMIN |
Configure networking | Can modify routing, sniff traffic |
SYS_MODULE |
Load kernel modules | Direct kernel access |
SYS_RAWIO |
Raw I/O port access | Direct hardware access |
The --privileged flag¶
# This gives the container ALL capabilities, disables seccomp, and grants device access
docker run --privileged myimage
--privileged is the container equivalent of chmod 777. The container can:
- Mount the host filesystem
- Load kernel modules
- Access all devices
- Modify kernel parameters
- Essentially do anything root on the host can do
Mental Model: A privileged container is a process running as host root with a fancy filesystem. The namespace boundaries still exist technically, but with all capabilities and device access, breaking out is trivial.
# From inside a privileged container — mount the host filesystem
mount /dev/sda1 /mnt
# You now have full read/write access to the host's root filesystem
# Add your SSH key, modify /etc/shadow, install a backdoor
War Story: Tesla's Kubernetes cluster (2018) had an unauthenticated Kubernetes dashboard with
cluster-adminaccess. Attackers deployed cryptocurrency miners. But the real damage was that the pods had access to an S3 bucket containing proprietary telemetry data. The fix was simple: RBAC. But the exposure lasted until external researchers reported it.
Layer 3: Seccomp — What Syscalls the Container Can Make¶
Even without special capabilities, the container's process can make system calls. Seccomp (secure computing) filters which syscalls are allowed.
Docker's default seccomp profile blocks ~44 dangerous syscalls:
| Blocked syscall | What it does | Why blocked |
|---|---|---|
reboot |
Reboots the system | Host reboot |
kexec_load |
Loads a new kernel | Kernel replacement |
mount |
Mounts filesystems | Filesystem escape |
umount |
Unmounts filesystems | Filesystem manipulation |
swapon/swapoff |
Manages swap | Resource manipulation |
init_module/delete_module |
Loads/unloads kernel modules | Kernel access |
clock_settime |
Changes system clock | Time manipulation |
settimeofday |
Changes system time | Time manipulation |
War Story: A Go service worked in dev but crashed in production with a cryptic error. Same image SHA everywhere. Root cause: production had a custom seccomp profile that blocked
mmapwithPROT_EXECflags — something the default Docker profile allows but the hardened production profile didn't. The Go runtime needed executable memory mappings for its garbage collector. "Same image" does NOT mean "same runtime environment" — security policies, seccomp profiles, and capability sets differ between environments.
Layer 4: User Namespaces — The UID Question¶
Inside the container, you're root (UID 0). On the host, what UID are you?
Without user namespaces (default Docker): - Container root = host root (UID 0) - If you escape the container, you have host root privileges
With user namespaces (rootless Docker/Podman): - Container root (UID 0) maps to unprivileged host UID (e.g., 100000) - Even if you escape, you're an unprivileged user on the host
# Check UID mapping from inside container
cat /proc/1/uid_map
# Without user namespaces: 0 0 4294967295 (container 0 = host 0)
# With user namespaces: 0 100000 65536 (container 0 = host 100000)
Rootless containers (Podman default, Docker rootless mode) are the strongest isolation available without a VM. The trade-off: can't bind ports below 1024, and some operations are slower (FUSE-based OverlayFS before kernel 5.11).
Layer 5: The Docker Socket — The Master Key¶
# This is the most dangerous volume mount in all of Docker
docker run -v /var/run/docker.sock:/var/run/docker.sock myimage
The Docker socket gives the container full control over the Docker daemon. From inside, you can:
# Create a new privileged container that mounts the host filesystem
docker run -it --privileged -v /:/host alpine chroot /host
# You now have a host root shell
Mounting the Docker socket is equivalent to giving the container root access to the host. Period. CI/CD tools, monitoring agents, and log collectors that "need Docker access" should use alternatives (rootless Docker-in-Docker, Kaniko for builds, read-only APIs).
The Hardened Container Checklist¶
# 1. Non-root user
FROM python:3.11-slim
RUN useradd -r -u 1000 appuser
USER appuser
# 2. Minimal base image (less attack surface)
FROM gcr.io/distroless/static:nonroot
# 3. Drop ALL capabilities, add only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage
# 4. Read-only filesystem
docker run --read-only --tmpfs /tmp myimage
# 5. No new privileges (prevent setuid binaries)
docker run --security-opt=no-new-privileges myimage
# 6. PID 1 init (signal handling + zombie reaping)
docker run --init myimage
# 7. Memory limits (prevent host OOM)
docker run --memory=512m myimage
# 8. Never mount the Docker socket
# 9. Never use --privileged
# 10. Scan images for vulnerabilities
trivy image --severity HIGH,CRITICAL myimage:v1
In Kubernetes:
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]
Flashcard Check¶
Q1: Containers share _____ with the host that VMs don't.
The kernel. A kernel exploit inside a container compromises the host. VMs have their own kernel, isolated by the hypervisor.
Q2: /proc/meminfo inside a container shows 64GB but the limit is 512MB. Why?
/proc/meminfoshows host values, not container limits. The actual limit is in/sys/fs/cgroup/memory.max. JVMs before Java 10 misread this and OOM'd.
Q3: What does --privileged do?
Grants ALL capabilities, disables seccomp, gives access to all host devices. The container can mount the host filesystem, load kernel modules, and effectively operate as host root.
Q4: Why is mounting /var/run/docker.sock dangerous?
It gives the container full control over the Docker daemon. From inside, you can create a new privileged container that mounts the host filesystem — instant host root.
Q5: Container root (UID 0) without user namespaces = what on the host?
Also UID 0 (root). Container escape = host root. With user namespaces, container UID 0 maps to an unprivileged host UID.
Q6: What does Docker's default seccomp profile block?
~44 dangerous syscalls including
reboot,mount,kexec_load,init_module. Custom profiles can restrict further based on your app's needs.
Exercises¶
Exercise 1: Inspect your container's isolation (hands-on)¶
# Run a container and inspect its capabilities
docker run --rm alpine cat /proc/1/status | grep Cap
# Decode the capability bitmask
docker run --rm alpine sh -c 'apk add -q libcap && capsh --decode=$(cat /proc/1/status | grep CapEff | awk "{print \$2}")'
# Compare with a privileged container
docker run --rm --privileged alpine cat /proc/1/status | grep Cap
Exercise 2: Demonstrate the Docker socket escape (lab only)¶
# Run container with socket mounted
docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock \
docker:cli sh
# From inside, create a host-root container
docker run --rm -it --privileged -v /:/hostfs alpine sh
ls /hostfs/etc/shadow # You now have host root access
Exercise 3: Harden a Dockerfile (refactor)¶
This Dockerfile has multiple security issues. Fix them all:
FROM ubuntu:latest
RUN apt-get update && apt-get install -y python3 curl vim
COPY . /app
WORKDIR /app
ENV DB_PASSWORD=supersecret
CMD ["python3", "app.py"]
Solution
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
COPY . .
FROM python:3.11-slim
WORKDIR /app
RUN useradd -r -u 1000 appuser
COPY --from=builder --chown=appuser /root/.local /home/appuser/.local
COPY --from=builder --chown=appuser /app .
USER appuser
ENV PATH=/home/appuser/.local/bin:$PATH
# DB_PASSWORD should come from secrets mount, not ENV
CMD ["python3", "app.py"]
Cheat Sheet¶
Container Security Quick Check¶
| Check | Command |
|---|---|
| What user is PID 1? | docker exec ctr id |
| Capabilities | docker exec ctr cat /proc/1/status \| grep Cap |
| Seccomp active? | docker inspect ctr \| jq '.[0].HostConfig.SecurityOpt' |
| Privileged? | docker inspect ctr \| jq '.[0].HostConfig.Privileged' |
| Read-only FS? | docker inspect ctr \| jq '.[0].HostConfig.ReadonlyRootfs' |
| UID mapping | docker exec ctr cat /proc/1/uid_map |
| Image vulns | trivy image --severity HIGH,CRITICAL image:tag |
Hardening Flags¶
docker run \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--read-only \
--tmpfs /tmp \
--security-opt=no-new-privileges \
--init \
--memory=512m \
--user 1000:1000 \
myimage
Takeaways¶
-
Containers share the host kernel. Kernel exploit = game over. This is the fundamental difference from VMs.
-
--privilegedand Docker socket = host root. Never use either in production without documented, reviewed justification. -
Defaults are decent but not hardened. Docker drops many capabilities and applies seccomp, but still runs as root and allows privilege escalation by default.
-
User namespaces are the strongest isolation. Rootless containers (Podman) map UID 0 to an unprivileged host UID. Even container escape gives limited access.
-
/proclies inside containers. It shows host values for memory, CPU, etc. Use cgroup-aware tooling and-XX:MaxRAMPercentagefor JVMs.
Related Lessons¶
- Permission Denied — container security contexts and how isolation causes access errors
- What Happens When You
docker build— how images are constructed - The Hanging Deploy — PID 1 in containers and signal handling