Portal | Level: L2: Operations | Topics: Docker / Containers, Container Runtimes | Domain: DevOps & Tooling
Containers - How They Really Work¶
Scope¶
This document explains containers as Linux mechanisms, not as marketing. It covers:
- namespaces
- cgroups
- capabilities
- seccomp
- root filesystem layering
- OCI images and runtimes
- Docker, containerd, and
runcroles - common misconceptions and failure modes
This is the core knowledge behind Docker and most modern container platforms.
Big picture¶
A container is not a tiny VM. It is usually:
- a process or process tree
- running on the host kernel
- inside isolated namespaces
- with resource controls from cgroups
- with a prepared root filesystem
- with reduced privileges and policy restrictions
Minimum mental model¶
container image
-> unpacked / mounted as root filesystem
-> runtime creates namespaces
-> runtime configures cgroups
-> runtime applies capabilities / seccomp / uid mappings
-> runtime execs process inside that environment
-> process runs on the host kernel
The host kernel is shared. That is the first thing people gloss over and the first thing that matters.
graph TD
A[Docker CLI] --> B[dockerd]
B --> C[containerd]
C --> D[containerd-shim]
D --> E[runc]
E --> F[Container Process]
What a container is not¶
A container is not:
- a separate kernel
- a hypervisor-managed VM by default
- a guarantee of strong security isolation
- magic packaging
- just a tarball
- just a Docker image
A container is the running execution environment created from several Linux primitives plus some image/runtime conventions.
The main components¶
1. Namespaces¶
Namespaces isolate what a process can see.
Common namespace types:
- mount namespace - filesystem mount view
- PID namespace - process tree view
- network namespace - interfaces, routes, sockets, firewall context
- UTS namespace - hostname/domainname
- IPC namespace - SysV IPC and POSIX message queues
- user namespace - UID/GID mappings and privilege rebasing
- cgroup namespace - cgroup path visibility
- time namespace - clock offsets in some cases
Why namespaces matter¶
Without namespaces, a "container" would just be a normal process with no illusion of separation.
Examples:
- PID namespace makes PID 1 inside the container not be PID 1 on the host
- network namespace gives the container its own interfaces and routes
- mount namespace gives it a different filesystem view
Important limitation¶
A namespace changes visibility and scope. It does not automatically make things safe. If you still grant too much privilege, the process can do real damage.
2. Cgroups¶
Control groups manage and account for resources.
Typical resource controls:
- CPU shares / quotas / weights
- memory limits
- IO throttling
- PID count limits
- device access control in some setups
Why cgroups matter¶
Without cgroups, one noisy container can:
- eat all CPU
- consume all memory
- spawn huge numbers of processes
- starve the node
cgroup v2 model¶
Modern Linux prefers cgroup v2, which provides a unified hierarchy and cleaner controller behavior than the old v1 multi-hierarchy mess.
Typical controllers:
cpumemoryiopidscpuset
Important subtlety¶
A container memory limit is not "reserved memory." It is usually a ceiling. Under pressure, the kernel still decides what to reclaim and what to kill.
3. Root filesystem¶
The container process needs a root filesystem view.
This typically comes from:
- image layers
- a writable upper layer
- bind mounts
- volumes
- tmpfs mounts
- special filesystems like
/proc,/sys,/dev
Overlay-based mental model¶
lower image layers (read-only)
+ upper writable layer
+ workdir
= merged mount presented as container rootfs
This is why image layers are reusable while each running container still has writable state.
Important implication¶
Deleting a container often deletes only the writable layer, not the shared image layers. Volumes live elsewhere.
4. Capabilities¶
Traditional Unix privilege is coarse: root vs not-root. Linux capabilities split many root powers into smaller units.
Examples:
CAP_NET_ADMINCAP_SYS_ADMINCAP_SYS_PTRACECAP_NET_BIND_SERVICE
A container can run as UID 0 inside its namespace yet still lack many dangerous capabilities.
Why this matters¶
"Running as root in a container" is bad, but the exact blast radius depends on:
- user namespace usage
- dropped capabilities
- device access
- seccomp
- mount permissions
- escape vulnerabilities
CAP_SYS_ADMIN¶
This is the junk drawer of power. If you see it granted casually, raise an eyebrow.
5. Seccomp¶
Seccomp filters system calls.
A runtime can say:
- allow normal syscalls
- deny dangerous or rarely needed syscalls
- kill or error on prohibited calls
Why it matters:
- the kernel syscall surface is the real attack surface of a containerized process
- seccomp reduces reachable kernel behavior
- many breakages under "hardened" settings are seccomp-denied syscalls
6. User namespaces¶
User namespaces remap UIDs/GIDs between container and host.
Example mental model:
- container thinks process is UID 0
- host maps that to an unprivileged high UID
This greatly reduces risk because "container root" is not automatically "host root."
Why people avoid them sometimes¶
- compatibility issues
- volume permission friction
- tooling assumptions
- older environments not built around them
But from a security perspective they matter a lot.
OCI stack and who does what¶
The container world is layered.
OCI image spec¶
Defines how images are structured:
- layers
- config
- metadata
- manifests
OCI runtime spec¶
Defines how to run a container:
- process args
- env
- mounts
- namespaces
- cgroups
- capabilities
- seccomp
- hooks
Runtime stack example¶
With Docker on Linux, a common path is:
Roles¶
- Docker CLI: user interface
dockerd: Docker engine managementcontainerd: container lifecycle/image managementrunc: low-level OCI runtime that actually creates the container execution environment- host kernel: does the real isolation and scheduling work
The kernel is still the final authority.
The actual creation flow¶
Here is the simplified lifecycle for starting a container.
1. Image lookup¶
The runtime resolves the image reference, for example:
nginx:latest- digest-pinned image
- local image ID
It fetches metadata and layers if needed.
2. Layer unpack / snapshot prep¶
The runtime prepares the rootfs view using a snapshotter or storage driver such as overlayfs-backed storage.
3. OCI config generation¶
The higher-level runtime builds a spec containing:
- command
- environment
- cwd
- mounts
- namespace settings
- cgroup settings
- Linux capabilities
- seccomp profile
- user mappings
4. Namespace creation¶
The runtime creates the selected namespaces, often via clone() / unshare() style primitives.
5. Cgroup placement¶
The process is attached to cgroups that enforce resource accounting and limits.
6. Filesystem setup¶
The merged rootfs is mounted. Additional mounts are attached:
- bind mounts
- volumes
/proc/sys/dev- secrets/config mounts
7. Network setup¶
If using a separate network namespace:
- a veth pair may be created
- one end stays in host namespace
- one moves into container namespace
- IP, routes, and DNS config are applied
8. Privilege tightening¶
The runtime applies:
- capability drops
- seccomp filter
- no-new-privileges
- read-only mount options
- masked paths
- device restrictions
9. execve() of container process¶
The target process starts. From then on, the "container" is basically that process tree plus the environment around it.
PID 1 problem¶
Inside a container, the main process often becomes PID 1 in its PID namespace.
PID 1 is special on Linux:
- signal handling semantics differ
- orphan reaping responsibility matters
- badly behaved PID 1 processes can leak zombies
This is why people sometimes use:
tini- dumb-init
- proper init wrappers
If the app is not written to act like an init-ish process, container behavior gets weird under shutdown or child-process churn.
Networking in containers¶
A standalone container commonly uses:
Alternative modes:
- host network namespace
- macvlan/ipvlan
- overlay networks
- CNI-based multi-plugin setups in Kubernetes
Important reality: most container networking problems are regular Linux networking problems wearing a fake moustache.
Storage semantics people get wrong¶
Image layers are not the same as runtime data¶
Image layers are build-time, mostly immutable components. Runtime writes go to:
- writable container layer
- mounted volume
- bind mount
- tmpfs
Copy-on-write cost is real¶
Overlay copy-up behavior means modifying a file from a lower layer can trigger extra work. Some workloads suffer badly.
Ephemeral means ephemeral¶
If you write into the container writable layer and delete the container, that data is usually gone. This is not a moral lesson. It is just the architecture.
Security model realities¶
Containers improve isolation and packaging, but do not equal VMs.
Containers are usually weaker isolation than VMs¶
Why:
- shared host kernel
- kernel attack surface still exposed
- misconfigurations are common
- over-broad capabilities are common
- host mounts and Docker socket exposure are catastrophic
Huge footguns¶
- privileged containers
- mounting
/var/run/docker.sock - host network mode without reason
- host PID namespace exposure
- broad device access
- writable host path mounts into sensitive directories
CAP_SYS_ADMIN- running as root without userns and with broad mounts
Better patterns¶
- non-root user
- drop capabilities
- seccomp enabled
- read-only rootfs where possible
- no-new-privileges
- user namespaces where feasible
- minimal images
- avoid Docker socket in containers
Debugging mental model¶
When a container fails, separate the layers.
Layer 1 - process problem¶
Questions:
- did the process start?
- wrong command / entrypoint?
- missing env?
- crashed immediately?
Tools:
- logs
- exit code
ps- runtime inspect output
Layer 2 - filesystem problem¶
Questions:
- missing file?
- wrong mount?
- read-only rootfs?
- permissions mismatch?
- volume path wrong?
Tools:
mountfindmnt- runtime inspect
- host path inspection
Layer 3 - namespace problem¶
Questions:
- wrong PID namespace assumption?
- host cannot see process the same way?
- network namespace isolated?
Tools:
lsnsnsenterip netns-style techniques where relevant
Layer 4 - cgroup/resource problem¶
Questions:
- OOM killed?
- CPU throttled?
- PID limit hit?
- IO throttled?
Tools:
- cgroup files
- runtime inspect
- kernel logs
systemd-cgls/systemd-cgtopon systemd hosts
Layer 5 - security policy problem¶
Questions:
- seccomp denied?
- capability missing?
- SELinux/AppArmor policy blocked?
- device access denied?
This is where many "but I’m root in the container" tantrums go to die.
Common misconceptions¶
"A container is just a process"¶
Almost true, but incomplete. It is a process plus constrained kernel context: namespaces, cgroups, mounts, capabilities, seccomp, and often image-backed rootfs semantics.
"Containers are lightweight VMs"¶
Wrong in an architectural sense. Similar use case sometimes, different isolation mechanism.
"If it runs in Docker, it will run the same everywhere"¶
Only if:
- kernel features line up
- architecture matches
- cgroup mode is compatible
- filesystem and mount behavior line up
- runtime settings match
- security modules do not differ materially
"Root in container is fake root so it is safe"¶
Sometimes less dangerous than host root, but still dangerous enough to wreck your week if configured badly.
Interview angles¶
Strong topics hidden inside this doc:
- difference between namespaces and cgroups
- why containers are not VMs
- what
runcdoes - what an OCI runtime spec is
- why PID 1 matters
- how container networking usually works
- why privileged containers are dangerous
- why overlayfs matters for images
- what seccomp and capabilities do
A strong answer is mechanical, not mystical.
Mental model to keep¶
A container is:
- a normal Linux process tree
- started with a specially prepared rootfs
- shown a constrained view of the system through namespaces
- limited by cgroups
- granted a reduced privilege set
- run by a higher-level runtime that automates all that plumbing
The host kernel is still king.
References¶
- Linux cgroup v2 documentation
- man 7 cgroups
- Docker engine storage drivers
- OverlayFS storage driver
- Docker networking
- OCI runtime concepts in Docker/containerd ecosystem
- man 2 bpf
Practice¶
- Topic primer: Container Runtime Debug
- Drills: Docker Drills, Container Runtime Drills
- Skillcheck: Docker
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Containers Deep Dive (Topic Pack, L1) — Container Runtimes, Docker / Containers
- Interview: Docker Container Debugging (Scenario, L1) — Container Runtimes, Docker / Containers
- AWS ECS (Topic Pack, L2) — Docker / Containers
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers
- Container Images (Topic Pack, L1) — Docker / Containers
- Container Runtime Drills (Drill, L2) — Container Runtimes
- Container Runtime Flashcards (CLI) (flashcard_deck, L1) — Container Runtimes
- Deep Dive: Docker Image Internals (deep_dive, L2) — Docker / Containers
Pages that link here¶
- Container Base Images — Primer
- Container Images
- Container Runtime Debugging - Skill Check
- Container Runtime Debugging Drills
- Containers Deep Dive
- Containers Deep Dive - Primer
- Docker
- Docker - Skill Check
- Docker / Containers - Primer
- Docker / Containers - Street-Level Ops
- Docker Drills
- Docker Image Internals
- Linux Ops
- Scenario: Docker Container Won't Start in Production
- Symptoms