Skip to content

Portal | Level: L2: Operations | Topics: Docker / Containers, Container Runtimes | Domain: DevOps & Tooling

Containers - How They Really Work

Scope

This document explains containers as Linux mechanisms, not as marketing. It covers:

  • namespaces
  • cgroups
  • capabilities
  • seccomp
  • root filesystem layering
  • OCI images and runtimes
  • Docker, containerd, and runc roles
  • common misconceptions and failure modes

This is the core knowledge behind Docker and most modern container platforms.


Big picture

A container is not a tiny VM. It is usually:

  • a process or process tree
  • running on the host kernel
  • inside isolated namespaces
  • with resource controls from cgroups
  • with a prepared root filesystem
  • with reduced privileges and policy restrictions

Minimum mental model

container image
  -> unpacked / mounted as root filesystem
  -> runtime creates namespaces
  -> runtime configures cgroups
  -> runtime applies capabilities / seccomp / uid mappings
  -> runtime execs process inside that environment
  -> process runs on the host kernel

The host kernel is shared. That is the first thing people gloss over and the first thing that matters.

graph TD
    A[Docker CLI] --> B[dockerd]
    B --> C[containerd]
    C --> D[containerd-shim]
    D --> E[runc]
    E --> F[Container Process]

What a container is not

A container is not:

  • a separate kernel
  • a hypervisor-managed VM by default
  • a guarantee of strong security isolation
  • magic packaging
  • just a tarball
  • just a Docker image

A container is the running execution environment created from several Linux primitives plus some image/runtime conventions.


The main components

1. Namespaces

Namespaces isolate what a process can see.

Common namespace types:

  • mount namespace - filesystem mount view
  • PID namespace - process tree view
  • network namespace - interfaces, routes, sockets, firewall context
  • UTS namespace - hostname/domainname
  • IPC namespace - SysV IPC and POSIX message queues
  • user namespace - UID/GID mappings and privilege rebasing
  • cgroup namespace - cgroup path visibility
  • time namespace - clock offsets in some cases

Why namespaces matter

Without namespaces, a "container" would just be a normal process with no illusion of separation.

Examples:

  • PID namespace makes PID 1 inside the container not be PID 1 on the host
  • network namespace gives the container its own interfaces and routes
  • mount namespace gives it a different filesystem view

Important limitation

A namespace changes visibility and scope. It does not automatically make things safe. If you still grant too much privilege, the process can do real damage.


2. Cgroups

Control groups manage and account for resources.

Typical resource controls:

  • CPU shares / quotas / weights
  • memory limits
  • IO throttling
  • PID count limits
  • device access control in some setups

Why cgroups matter

Without cgroups, one noisy container can:

  • eat all CPU
  • consume all memory
  • spawn huge numbers of processes
  • starve the node

cgroup v2 model

Modern Linux prefers cgroup v2, which provides a unified hierarchy and cleaner controller behavior than the old v1 multi-hierarchy mess.

Typical controllers:

  • cpu
  • memory
  • io
  • pids
  • cpuset

Important subtlety

A container memory limit is not "reserved memory." It is usually a ceiling. Under pressure, the kernel still decides what to reclaim and what to kill.


3. Root filesystem

The container process needs a root filesystem view.

This typically comes from:

  • image layers
  • a writable upper layer
  • bind mounts
  • volumes
  • tmpfs mounts
  • special filesystems like /proc, /sys, /dev

Overlay-based mental model

lower image layers (read-only)
  + upper writable layer
  + workdir
  = merged mount presented as container rootfs

This is why image layers are reusable while each running container still has writable state.

Important implication

Deleting a container often deletes only the writable layer, not the shared image layers. Volumes live elsewhere.


4. Capabilities

Traditional Unix privilege is coarse: root vs not-root. Linux capabilities split many root powers into smaller units.

Examples:

  • CAP_NET_ADMIN
  • CAP_SYS_ADMIN
  • CAP_SYS_PTRACE
  • CAP_NET_BIND_SERVICE

A container can run as UID 0 inside its namespace yet still lack many dangerous capabilities.

Why this matters

"Running as root in a container" is bad, but the exact blast radius depends on:

  • user namespace usage
  • dropped capabilities
  • device access
  • seccomp
  • mount permissions
  • escape vulnerabilities

CAP_SYS_ADMIN

This is the junk drawer of power. If you see it granted casually, raise an eyebrow.


5. Seccomp

Seccomp filters system calls.

A runtime can say:

  • allow normal syscalls
  • deny dangerous or rarely needed syscalls
  • kill or error on prohibited calls

Why it matters:

  • the kernel syscall surface is the real attack surface of a containerized process
  • seccomp reduces reachable kernel behavior
  • many breakages under "hardened" settings are seccomp-denied syscalls

6. User namespaces

User namespaces remap UIDs/GIDs between container and host.

Example mental model:

  • container thinks process is UID 0
  • host maps that to an unprivileged high UID

This greatly reduces risk because "container root" is not automatically "host root."

Why people avoid them sometimes

  • compatibility issues
  • volume permission friction
  • tooling assumptions
  • older environments not built around them

But from a security perspective they matter a lot.


OCI stack and who does what

The container world is layered.

OCI image spec

Defines how images are structured:

  • layers
  • config
  • metadata
  • manifests

OCI runtime spec

Defines how to run a container:

  • process args
  • env
  • mounts
  • namespaces
  • cgroups
  • capabilities
  • seccomp
  • hooks

Runtime stack example

With Docker on Linux, a common path is:

docker CLI
  -> dockerd
  -> containerd
  -> containerd-shim
  -> runc
  -> container process

Roles

  • Docker CLI: user interface
  • dockerd: Docker engine management
  • containerd: container lifecycle/image management
  • runc: low-level OCI runtime that actually creates the container execution environment
  • host kernel: does the real isolation and scheduling work

The kernel is still the final authority.


The actual creation flow

Here is the simplified lifecycle for starting a container.

1. Image lookup

The runtime resolves the image reference, for example:

  • nginx:latest
  • digest-pinned image
  • local image ID

It fetches metadata and layers if needed.

2. Layer unpack / snapshot prep

The runtime prepares the rootfs view using a snapshotter or storage driver such as overlayfs-backed storage.

3. OCI config generation

The higher-level runtime builds a spec containing:

  • command
  • environment
  • cwd
  • mounts
  • namespace settings
  • cgroup settings
  • Linux capabilities
  • seccomp profile
  • user mappings

4. Namespace creation

The runtime creates the selected namespaces, often via clone() / unshare() style primitives.

5. Cgroup placement

The process is attached to cgroups that enforce resource accounting and limits.

6. Filesystem setup

The merged rootfs is mounted. Additional mounts are attached:

  • bind mounts
  • volumes
  • /proc
  • /sys
  • /dev
  • secrets/config mounts

7. Network setup

If using a separate network namespace:

  • a veth pair may be created
  • one end stays in host namespace
  • one moves into container namespace
  • IP, routes, and DNS config are applied

8. Privilege tightening

The runtime applies:

  • capability drops
  • seccomp filter
  • no-new-privileges
  • read-only mount options
  • masked paths
  • device restrictions

9. execve() of container process

The target process starts. From then on, the "container" is basically that process tree plus the environment around it.


PID 1 problem

Inside a container, the main process often becomes PID 1 in its PID namespace.

PID 1 is special on Linux:

  • signal handling semantics differ
  • orphan reaping responsibility matters
  • badly behaved PID 1 processes can leak zombies

This is why people sometimes use:

  • tini
  • dumb-init
  • proper init wrappers

If the app is not written to act like an init-ish process, container behavior gets weird under shutdown or child-process churn.


Networking in containers

A standalone container commonly uses:

container eth0
  -> veth
  -> host bridge
  -> host routing / NAT
  -> host NIC

Alternative modes:

  • host network namespace
  • macvlan/ipvlan
  • overlay networks
  • CNI-based multi-plugin setups in Kubernetes

Important reality: most container networking problems are regular Linux networking problems wearing a fake moustache.


Storage semantics people get wrong

Image layers are not the same as runtime data

Image layers are build-time, mostly immutable components. Runtime writes go to:

  • writable container layer
  • mounted volume
  • bind mount
  • tmpfs

Copy-on-write cost is real

Overlay copy-up behavior means modifying a file from a lower layer can trigger extra work. Some workloads suffer badly.

Ephemeral means ephemeral

If you write into the container writable layer and delete the container, that data is usually gone. This is not a moral lesson. It is just the architecture.


Security model realities

Containers improve isolation and packaging, but do not equal VMs.

Containers are usually weaker isolation than VMs

Why:

  • shared host kernel
  • kernel attack surface still exposed
  • misconfigurations are common
  • over-broad capabilities are common
  • host mounts and Docker socket exposure are catastrophic

Huge footguns

  • privileged containers
  • mounting /var/run/docker.sock
  • host network mode without reason
  • host PID namespace exposure
  • broad device access
  • writable host path mounts into sensitive directories
  • CAP_SYS_ADMIN
  • running as root without userns and with broad mounts

Better patterns

  • non-root user
  • drop capabilities
  • seccomp enabled
  • read-only rootfs where possible
  • no-new-privileges
  • user namespaces where feasible
  • minimal images
  • avoid Docker socket in containers

Debugging mental model

When a container fails, separate the layers.

Layer 1 - process problem

Questions:

  • did the process start?
  • wrong command / entrypoint?
  • missing env?
  • crashed immediately?

Tools:

  • logs
  • exit code
  • ps
  • runtime inspect output

Layer 2 - filesystem problem

Questions:

  • missing file?
  • wrong mount?
  • read-only rootfs?
  • permissions mismatch?
  • volume path wrong?

Tools:

  • mount
  • findmnt
  • runtime inspect
  • host path inspection

Layer 3 - namespace problem

Questions:

  • wrong PID namespace assumption?
  • host cannot see process the same way?
  • network namespace isolated?

Tools:

  • lsns
  • nsenter
  • ip netns-style techniques where relevant

Layer 4 - cgroup/resource problem

Questions:

  • OOM killed?
  • CPU throttled?
  • PID limit hit?
  • IO throttled?

Tools:

  • cgroup files
  • runtime inspect
  • kernel logs
  • systemd-cgls / systemd-cgtop on systemd hosts

Layer 5 - security policy problem

Questions:

  • seccomp denied?
  • capability missing?
  • SELinux/AppArmor policy blocked?
  • device access denied?

This is where many "but I’m root in the container" tantrums go to die.


Common misconceptions

"A container is just a process"

Almost true, but incomplete. It is a process plus constrained kernel context: namespaces, cgroups, mounts, capabilities, seccomp, and often image-backed rootfs semantics.

"Containers are lightweight VMs"

Wrong in an architectural sense. Similar use case sometimes, different isolation mechanism.

"If it runs in Docker, it will run the same everywhere"

Only if:

  • kernel features line up
  • architecture matches
  • cgroup mode is compatible
  • filesystem and mount behavior line up
  • runtime settings match
  • security modules do not differ materially

"Root in container is fake root so it is safe"

Sometimes less dangerous than host root, but still dangerous enough to wreck your week if configured badly.


Interview angles

Strong topics hidden inside this doc:

  • difference between namespaces and cgroups
  • why containers are not VMs
  • what runc does
  • what an OCI runtime spec is
  • why PID 1 matters
  • how container networking usually works
  • why privileged containers are dangerous
  • why overlayfs matters for images
  • what seccomp and capabilities do

A strong answer is mechanical, not mystical.


Mental model to keep

A container is:

  • a normal Linux process tree
  • started with a specially prepared rootfs
  • shown a constrained view of the system through namespaces
  • limited by cgroups
  • granted a reduced privilege set
  • run by a higher-level runtime that automates all that plumbing

The host kernel is still king.


References

Practice


Wiki Navigation

Prerequisites