Portal | Level: L2: Operations | Topics: Docker / Containers, Container Runtimes | Domain: DevOps & Tooling

Containers - How They Really Work¶

Scope¶

This document explains containers as Linux mechanisms, not as marketing. It covers:

namespaces
cgroups
capabilities
seccomp
root filesystem layering
OCI images and runtimes
Docker, containerd, and runc roles
common misconceptions and failure modes

This is the core knowledge behind Docker and most modern container platforms.

Big picture¶

A container is not a tiny VM. It is usually:

a process or process tree
running on the host kernel
inside isolated namespaces
with resource controls from cgroups
with a prepared root filesystem
with reduced privileges and policy restrictions

Minimum mental model¶

container image
  -> unpacked / mounted as root filesystem
  -> runtime creates namespaces
  -> runtime configures cgroups
  -> runtime applies capabilities / seccomp / uid mappings
  -> runtime execs process inside that environment
  -> process runs on the host kernel

The host kernel is shared. That is the first thing people gloss over and the first thing that matters.

graph TD
    A[Docker CLI] --> B[dockerd]
    B --> C[containerd]
    C --> D[containerd-shim]
    D --> E[runc]
    E --> F[Container Process]

What a container is not¶

A container is not:

a separate kernel
a hypervisor-managed VM by default
a guarantee of strong security isolation
magic packaging
just a tarball
just a Docker image

A container is the running execution environment created from several Linux primitives plus some image/runtime conventions.

The main components¶

1. Namespaces¶

Namespaces isolate what a process can see.

Common namespace types:

mount namespace - filesystem mount view
PID namespace - process tree view
network namespace - interfaces, routes, sockets, firewall context
UTS namespace - hostname/domainname
IPC namespace - SysV IPC and POSIX message queues
user namespace - UID/GID mappings and privilege rebasing
cgroup namespace - cgroup path visibility
time namespace - clock offsets in some cases

Why namespaces matter¶

Without namespaces, a "container" would just be a normal process with no illusion of separation.

Examples:

PID namespace makes PID 1 inside the container not be PID 1 on the host
network namespace gives the container its own interfaces and routes
mount namespace gives it a different filesystem view

Important limitation¶

A namespace changes visibility and scope. It does not automatically make things safe. If you still grant too much privilege, the process can do real damage.

2. Cgroups¶

Control groups manage and account for resources.

Typical resource controls:

CPU shares / quotas / weights
memory limits
IO throttling
PID count limits
device access control in some setups

Why cgroups matter¶

Without cgroups, one noisy container can:

eat all CPU
consume all memory
spawn huge numbers of processes
starve the node

cgroup v2 model¶

Modern Linux prefers cgroup v2, which provides a unified hierarchy and cleaner controller behavior than the old v1 multi-hierarchy mess.

Typical controllers:

cpu
memory
io
pids
cpuset

Important subtlety¶

A container memory limit is not "reserved memory." It is usually a ceiling. Under pressure, the kernel still decides what to reclaim and what to kill.

3. Root filesystem¶

The container process needs a root filesystem view.

This typically comes from:

image layers
a writable upper layer
bind mounts
volumes
tmpfs mounts
special filesystems like /proc, /sys, /dev

Overlay-based mental model¶

lower image layers (read-only)
  + upper writable layer
  + workdir
  = merged mount presented as container rootfs

This is why image layers are reusable while each running container still has writable state.

Important implication¶

Deleting a container often deletes only the writable layer, not the shared image layers. Volumes live elsewhere.

4. Capabilities¶

Traditional Unix privilege is coarse: root vs not-root. Linux capabilities split many root powers into smaller units.

Examples:

CAP_NET_ADMIN
CAP_SYS_ADMIN
CAP_SYS_PTRACE
CAP_NET_BIND_SERVICE

A container can run as UID 0 inside its namespace yet still lack many dangerous capabilities.

Why this matters¶

"Running as root in a container" is bad, but the exact blast radius depends on:

user namespace usage
dropped capabilities
device access
seccomp
mount permissions
escape vulnerabilities

`CAP_SYS_ADMIN`¶

This is the junk drawer of power. If you see it granted casually, raise an eyebrow.

5. Seccomp¶

Seccomp filters system calls.

A runtime can say:

allow normal syscalls
deny dangerous or rarely needed syscalls
kill or error on prohibited calls

Why it matters:

the kernel syscall surface is the real attack surface of a containerized process
seccomp reduces reachable kernel behavior
many breakages under "hardened" settings are seccomp-denied syscalls

6. User namespaces¶

User namespaces remap UIDs/GIDs between container and host.

Example mental model:

container thinks process is UID 0
host maps that to an unprivileged high UID

This greatly reduces risk because "container root" is not automatically "host root."

Why people avoid them sometimes¶

compatibility issues
volume permission friction
tooling assumptions
older environments not built around them

But from a security perspective they matter a lot.

OCI stack and who does what¶

The container world is layered.

OCI image spec¶

Defines how images are structured:

layers
config
metadata
manifests

OCI runtime spec¶

Defines how to run a container:

process args
env
mounts
namespaces
cgroups
capabilities
seccomp
hooks

Runtime stack example¶

With Docker on Linux, a common path is:

docker CLI
  -> dockerd
  -> containerd
  -> containerd-shim
  -> runc
  -> container process

Roles¶

Docker CLI: user interface
dockerd: Docker engine management
containerd: container lifecycle/image management
runc: low-level OCI runtime that actually creates the container execution environment
host kernel: does the real isolation and scheduling work

The kernel is still the final authority.

The actual creation flow¶

Here is the simplified lifecycle for starting a container.

1. Image lookup¶

The runtime resolves the image reference, for example:

nginx:latest
digest-pinned image
local image ID

It fetches metadata and layers if needed.

2. Layer unpack / snapshot prep¶

The runtime prepares the rootfs view using a snapshotter or storage driver such as overlayfs-backed storage.

3. OCI config generation¶

The higher-level runtime builds a spec containing:

command
environment
cwd
mounts
namespace settings
cgroup settings
Linux capabilities
seccomp profile
user mappings

4. Namespace creation¶

The runtime creates the selected namespaces, often via clone() / unshare() style primitives.

5. Cgroup placement¶

The process is attached to cgroups that enforce resource accounting and limits.

6. Filesystem setup¶

The merged rootfs is mounted. Additional mounts are attached:

bind mounts
volumes
/proc
/sys
/dev
secrets/config mounts

7. Network setup¶

If using a separate network namespace:

a veth pair may be created
one end stays in host namespace
one moves into container namespace
IP, routes, and DNS config are applied

8. Privilege tightening¶

The runtime applies:

capability drops
seccomp filter
no-new-privileges
read-only mount options
masked paths
device restrictions

9. `execve()` of container process¶

The target process starts. From then on, the "container" is basically that process tree plus the environment around it.

PID 1 problem¶

Inside a container, the main process often becomes PID 1 in its PID namespace.

PID 1 is special on Linux:

signal handling semantics differ
orphan reaping responsibility matters
badly behaved PID 1 processes can leak zombies

This is why people sometimes use:

tini
dumb-init
proper init wrappers

If the app is not written to act like an init-ish process, container behavior gets weird under shutdown or child-process churn.

Networking in containers¶

A standalone container commonly uses:

container eth0
  -> veth
  -> host bridge
  -> host routing / NAT
  -> host NIC

Alternative modes:

host network namespace
macvlan/ipvlan
overlay networks
CNI-based multi-plugin setups in Kubernetes

Important reality: most container networking problems are regular Linux networking problems wearing a fake moustache.

Storage semantics people get wrong¶

Image layers are not the same as runtime data¶

Image layers are build-time, mostly immutable components. Runtime writes go to:

writable container layer
mounted volume
bind mount
tmpfs

Copy-on-write cost is real¶

Overlay copy-up behavior means modifying a file from a lower layer can trigger extra work. Some workloads suffer badly.

Ephemeral means ephemeral¶

If you write into the container writable layer and delete the container, that data is usually gone. This is not a moral lesson. It is just the architecture.

Security model realities¶

Containers improve isolation and packaging, but do not equal VMs.

Containers are usually weaker isolation than VMs¶

Why:

shared host kernel
kernel attack surface still exposed
misconfigurations are common
over-broad capabilities are common
host mounts and Docker socket exposure are catastrophic

Huge footguns¶

privileged containers
mounting /var/run/docker.sock
host network mode without reason
host PID namespace exposure
broad device access
writable host path mounts into sensitive directories
CAP_SYS_ADMIN
running as root without userns and with broad mounts

Better patterns¶

non-root user
drop capabilities
seccomp enabled
read-only rootfs where possible
no-new-privileges
user namespaces where feasible
minimal images
avoid Docker socket in containers

Debugging mental model¶

When a container fails, separate the layers.

Layer 1 - process problem¶

Questions:

did the process start?
wrong command / entrypoint?
missing env?
crashed immediately?

Tools:

logs
exit code
ps
runtime inspect output

Layer 2 - filesystem problem¶

Questions:

missing file?
wrong mount?
read-only rootfs?
permissions mismatch?
volume path wrong?

Tools:

mount
findmnt
runtime inspect
host path inspection

Layer 3 - namespace problem¶

Questions:

wrong PID namespace assumption?
host cannot see process the same way?
network namespace isolated?

Tools:

lsns
nsenter
ip netns-style techniques where relevant

Layer 4 - cgroup/resource problem¶

Questions:

OOM killed?
CPU throttled?
PID limit hit?
IO throttled?

Tools:

cgroup files
runtime inspect
kernel logs
systemd-cgls / systemd-cgtop on systemd hosts

Layer 5 - security policy problem¶

Questions:

seccomp denied?
capability missing?
SELinux/AppArmor policy blocked?
device access denied?

This is where many "but I’m root in the container" tantrums go to die.

Common misconceptions¶

"A container is just a process"¶

Almost true, but incomplete. It is a process plus constrained kernel context: namespaces, cgroups, mounts, capabilities, seccomp, and often image-backed rootfs semantics.

"Containers are lightweight VMs"¶

Wrong in an architectural sense. Similar use case sometimes, different isolation mechanism.

"If it runs in Docker, it will run the same everywhere"¶

Only if:

kernel features line up
architecture matches
cgroup mode is compatible
filesystem and mount behavior line up
runtime settings match
security modules do not differ materially

"Root in container is fake root so it is safe"¶

Sometimes less dangerous than host root, but still dangerous enough to wreck your week if configured badly.

Interview angles¶

Strong topics hidden inside this doc:

difference between namespaces and cgroups
why containers are not VMs
what runc does
what an OCI runtime spec is
why PID 1 matters
how container networking usually works
why privileged containers are dangerous
why overlayfs matters for images
what seccomp and capabilities do

A strong answer is mechanical, not mystical.

Mental model to keep¶

A container is:

a normal Linux process tree
started with a specially prepared rootfs
shown a constrained view of the system through namespaces
limited by cgroups
granted a reduced privilege set
run by a higher-level runtime that automates all that plumbing

The host kernel is still king.

References¶

Practice¶

Topic primer: Container Runtime Debug
Drills: Docker Drills, Container Runtime Drills
Skillcheck: Docker

Prerequisites¶

Linux Ops (Topic Pack, L0)

Containers Deep Dive (Topic Pack, L1) — Container Runtimes, Docker / Containers
Interview: Docker Container Debugging (Scenario, L1) — Container Runtimes, Docker / Containers
AWS ECS (Topic Pack, L2) — Docker / Containers
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers
Container Images (Topic Pack, L1) — Docker / Containers
Container Runtime Drills (Drill, L2) — Container Runtimes
Container Runtime Flashcards (CLI) (flashcard_deck, L1) — Container Runtimes
Deep Dive: Docker Image Internals (deep_dive, L2) — Docker / Containers

Containers - How They Really Work¶

Scope¶

Big picture¶

Minimum mental model¶

What a container is not¶

The main components¶

1. Namespaces¶

Why namespaces matter¶

Important limitation¶

2. Cgroups¶

Why cgroups matter¶

cgroup v2 model¶

Important subtlety¶

3. Root filesystem¶

Overlay-based mental model¶

Important implication¶

4. Capabilities¶

Why this matters¶

CAP_SYS_ADMIN¶

5. Seccomp¶

6. User namespaces¶

Why people avoid them sometimes¶

OCI stack and who does what¶

OCI image spec¶

OCI runtime spec¶

Runtime stack example¶

Roles¶

The actual creation flow¶

1. Image lookup¶

2. Layer unpack / snapshot prep¶

3. OCI config generation¶

4. Namespace creation¶

5. Cgroup placement¶

6. Filesystem setup¶

7. Network setup¶

8. Privilege tightening¶

9. execve() of container process¶

PID 1 problem¶

Networking in containers¶

Storage semantics people get wrong¶

Image layers are not the same as runtime data¶

Copy-on-write cost is real¶

Ephemeral means ephemeral¶

Security model realities¶

Containers are usually weaker isolation than VMs¶

Huge footguns¶

Better patterns¶

Debugging mental model¶

Layer 1 - process problem¶

Layer 2 - filesystem problem¶

Layer 3 - namespace problem¶

Layer 4 - cgroup/resource problem¶

Layer 5 - security policy problem¶

Common misconceptions¶

"A container is just a process"¶

"Containers are lightweight VMs"¶

"If it runs in Docker, it will run the same everywhere"¶

"Root in container is fake root so it is safe"¶

Interview angles¶

Mental model to keep¶

References¶

Practice¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶

`CAP_SYS_ADMIN`¶

9. `execve()` of container process¶