Portal | Level: L1: Foundations | Topics: Containers Deep Dive, Docker / Containers, Container Runtimes | Domain: DevOps & Tooling
Containers Deep Dive - Primer¶
Why This Matters¶
Containers are not VMs. They are not "lightweight virtual machines." They are processes running on a shared Linux kernel, isolated by kernel primitives that have existed for over a decade. If you only know docker run, you are driving a car without understanding the engine, transmission, or brakes. When something breaks — and it will — you will be helpless.
This primer covers the Linux primitives that make containers work, the runtime stack that orchestrates them, the image format that packages them, and the networking and storage models that connect them. Everything here is foundational. Every debugging session, every performance investigation, every security audit comes back to these building blocks.
Linux Namespaces¶
Namespaces are the isolation mechanism. Each namespace type restricts what a process can see of a particular system resource. A container is a process (or group of processes) running inside a set of namespaces.
The Seven Namespace Types¶
| Namespace | Flag | What it isolates | Kernel version |
|---|---|---|---|
| PID | CLONE_NEWPID |
Process IDs — container sees its own PID tree, PID 1 is the entrypoint | 2.6.24 (2008) |
| NET | CLONE_NEWNET |
Network stack — interfaces, routes, iptables, sockets | 2.6.29 (2009) |
| MNT | CLONE_NEWNS |
Mount points — container has its own filesystem view | 2.4.19 (2002, the first namespace!) |
| UTS | CLONE_NEWUTS |
Hostname and NIS domain name | 2.6.19 (2006) |
| IPC | CLONE_NEWIPC |
System V IPC, POSIX message queues | 2.6.19 (2006) |
| USER | CLONE_NEWUSER |
User and group IDs — root inside container maps to unprivileged user outside | 3.8 (2013) |
| CGROUP | CLONE_NEWCGROUP |
Cgroup root directory — container sees only its own cgroup hierarchy | 4.6 (2016) |
Inspecting namespaces¶
# List namespaces for a process
ls -la /proc/<pid>/ns/
# lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 root root 0 ... ipc -> 'ipc:[4026532456]'
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532454]'
# lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026532459]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532457]'
# lrwxrwxrwx 1 root root 0 ... user -> 'user:[4026531837]'
# lrwxrwxrwx 1 root root 0 ... uts -> 'uts:[4026532455]'
# Enter a container's network namespace
nsenter -t <pid> -n ip addr
# Enter all namespaces of a container process
nsenter -t <pid> -m -u -i -n -p -- /bin/sh
# Create a new namespace manually (for learning)
unshare --pid --fork --mount-proc /bin/bash
# You're now in a new PID namespace. `ps aux` shows only your shell.
PID namespace details¶
PID 1 inside the container is special. It is the init process. If PID 1 exits, the entire container stops. If PID 1 does not handle SIGCHLD, zombie processes accumulate. This is why tini and dumb-init exist — they act as proper init processes that reap children.
# Inside a container
ps aux
# PID 1 is your ENTRYPOINT
# From the host, the same process has a different, higher PID
# Host view
ps aux | grep <container-process>
# PID is something like 28473 on the host
NET namespace details¶
Each container gets its own network stack: interfaces, IP addresses, routing table, iptables rules, and socket listings. The container's eth0 is one end of a veth pair; the other end lives in the host namespace and is plugged into a bridge (usually docker0 or cni0).
# From host: find the veth pair for a container
PID=$(docker inspect --format '{{.State.Pid}}' <container>)
nsenter -t $PID -n ip link show eth0
# Note the interface index (e.g., "if4")
# On host: ip link | grep "if4" shows the veth peer
USER namespace details¶
User namespaces allow mapping UID 0 inside the container to an unprivileged UID on the host. This is the foundation of rootless containers. Inside the container, the process thinks it is root. On the host, it is UID 100000 (or similar). If the process escapes the container, it has no privileges.
# Check if a container uses user namespaces
cat /proc/<pid>/uid_map
# Format: <inside-uid> <outside-uid> <range>
# "0 100000 65536" means container root maps to host UID 100000
Cgroups: Resource Limits¶
Namespaces isolate visibility. Cgroups (control groups) limit resources. Without cgroups, a container could consume all CPU, memory, or I/O on the host and starve everything else.
Cgroups v1 vs v2¶
| Aspect | cgroups v1 | cgroups v2 |
|---|---|---|
| Hierarchy | Separate hierarchy per controller (cpu, memory, blkio, pids, etc.) | Single unified hierarchy |
| Mount point | /sys/fs/cgroup/<controller>/ |
/sys/fs/cgroup/ (unified) |
| Controller enablement | Per-hierarchy | Per-subtree via cgroup.subtree_control |
| Delegation | Complex, error-prone | Clean delegation to unprivileged processes |
| PSI (pressure stall info) | Not available | Built-in (cpu.pressure, memory.pressure, io.pressure) |
| Memory accounting | Approximate | More accurate, includes kernel memory |
| Default in modern distros | Legacy | Ubuntu 22.04+, Fedora 31+, RHEL 9+ |
Key cgroup controllers¶
# CPU — limit CPU time
# v2: cpu.max = "<quota> <period>" in microseconds
echo "50000 100000" > /sys/fs/cgroup/<group>/cpu.max # 50% of one CPU
# Memory — hard limit (OOM kill) and soft limit (reclaim pressure)
echo 536870912 > /sys/fs/cgroup/<group>/memory.max # 512MB hard limit
echo 268435456 > /sys/fs/cgroup/<group>/memory.high # 256MB soft limit (reclaim starts)
# PIDs — prevent fork bombs
echo 256 > /sys/fs/cgroup/<group>/pids.max
# I/O — limit block device throughput
# Format: "<major>:<minor> rbps=<bytes> wbps=<bytes>"
echo "8:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/<group>/io.max # 10MB/s
Checking container cgroup limits¶
# Docker: inspect resource limits
docker inspect --format '{{.HostConfig.Memory}}' <container>
docker inspect --format '{{.HostConfig.NanoCpus}}' <container>
# From inside the container (cgroups v2)
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/cpu.max
cat /sys/fs/cgroup/pids.max
# Current usage
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/cpu.stat
Pressure Stall Information (cgroups v2 only)¶
PSI tells you whether a cgroup is resource-starved, not just how much it is using.
cat /sys/fs/cgroup/<group>/memory.pressure
# some avg10=5.23 avg60=2.10 avg300=0.87 total=123456
# "some" = at least one task stalled waiting for memory
# "full" = all tasks stalled (severe)
cat /sys/fs/cgroup/<group>/cpu.pressure
cat /sys/fs/cgroup/<group>/io.pressure
Union Filesystems and Image Layers¶
Container images are not monolithic disk images. They are stacks of read-only layers, composed into a single filesystem view using a union filesystem (usually OverlayFS on modern Linux).
How OverlayFS works¶
+-------------------+
| Container Layer | ← writable (upperdir), changes go here
+-------------------+
| Image Layer N | ← read-only (lowerdir)
+-------------------+
| Image Layer N-1 | ← read-only (lowerdir)
+-------------------+
| ... |
+-------------------+
| Base Layer | ← read-only (lowerdir)
+-------------------+
|
[merged view] ← what the container process sees (mount point)
Debug clue: If a container is using unexpectedly high disk I/O, check if it is modifying large files from image layers. Each modification triggers a full file copy-up to the writable layer. A common culprit: application log files written to a path that exists in the image layer, causing multi-megabyte copy-ups on first write.
Copy-on-write: When a container modifies a file from a lower layer, the file is copied up to the writable layer first. The original in the lower layer is untouched. This is why modifying large files in containers is expensive — the entire file is copied, even for a one-byte change.
Whiteout files: Deleting a file from a lower layer creates a whiteout marker in the upper layer. The file still exists in the image layer, consuming space. This is why RUN apt-get install && apt-get clean must be in the same layer.
# Inspect OverlayFS mount for a running container
docker inspect --format '{{.GraphDriver.Data}}' <container>
# Shows: LowerDir, UpperDir, MergedDir, WorkDir
# See what the container has written (upperdir = container changes)
ls /var/lib/docker/overlay2/<id>/diff/
# Check layer sizes
docker history <image>
# Shows each layer, its command, and its size
Image layer best practices¶
Each Dockerfile instruction that modifies the filesystem creates a new layer. Layer ordering determines cache efficiency:
# BAD: copying source first invalidates all subsequent layers on any code change
COPY . /app
RUN pip install -r requirements.txt
# GOOD: copy requirements first, install, then copy source
COPY requirements.txt /app/
RUN pip install -r /app/requirements.txt
COPY . /app
# Now source changes don't re-trigger pip install
Container Runtimes¶
The "container runtime" is not one thing. It is a stack with clearly defined boundaries.
The runtime stack¶
High-level runtime (container management)
├── containerd — manages container lifecycle, image pulls, storage, networking setup
├── CRI-O — lightweight alternative, built specifically for Kubernetes CRI
│
Low-level runtime (actually creates the container)
├── runc — reference OCI runtime, creates namespaces + cgroups + executes process
├── crun — C implementation, faster startup, lower memory
├── gVisor — sandboxed runtime, intercepts syscalls (security-focused)
├── Kata — micro-VM runtime, each container is a lightweight VM
containerd architecture¶
# containerd is the most common high-level runtime
# Docker uses containerd internally since Docker 1.11
# Interact with containerd directly
ctr containers list
ctr images list
ctr tasks list
# On Kubernetes nodes, use crictl instead
crictl ps
crictl pods
crictl images
runc: what actually happens¶
When you docker run, here is what runc does at the lowest level:
- Creates a new set of Linux namespaces (PID, NET, MNT, UTS, IPC, optionally USER, CGROUP)
- Sets up cgroups with resource limits
- Pivots the root filesystem to the container's rootfs (prepared from image layers)
- Drops capabilities, applies seccomp filters, sets up AppArmor/SELinux profiles
- Executes the container's entrypoint process as PID 1 in the new namespace set
# You can run runc manually (for learning)
mkdir -p mycontainer/rootfs
docker export $(docker create busybox) | tar -C mycontainer/rootfs -xf -
cd mycontainer
runc spec # generates config.json (OCI runtime spec)
runc run mycontainer
OCI Specification¶
Who made it: The OCI was founded in 2015 by Docker, CoreOS, Google, Microsoft, and others under the Linux Foundation. Docker donated its container image format and runtime (
runc) as the starting point. The OCI exists because the industry needed container portability guarantees — an image built with Docker must run on any OCI-compliant runtime.
The Open Container Initiative (OCI) defines two specs that make containers portable across runtimes.
Image Spec¶
An OCI image consists of:
| Component | Purpose |
|---|---|
| Manifest | JSON listing all layers (digests) and the config |
| Config | JSON with environment, entrypoint, exposed ports, labels, architecture |
| Layers | Compressed tar archives (gzip or zstd), each representing a filesystem diff |
| Index (optional) | Multi-architecture manifest list |
# Inspect an image manifest
docker manifest inspect nginx:1.25
# Shows: mediaType, config digest, layer digests, platform info
# Inspect image config
docker inspect nginx:1.25
# Shows: Env, Cmd, Entrypoint, ExposedPorts, Labels, Architecture
# Inspect layers
docker save nginx:1.25 | tar -tf -
# Shows: manifest.json, config JSON, layer tar.gz files
Runtime Spec¶
The OCI runtime spec (config.json) defines:
- Root filesystem path
- Mounts
- Process (args, env, cwd, capabilities, rlimits)
- Linux-specific: namespaces, cgroups path, seccomp profile, sysctl, apparmor profile
- Hooks: prestart, createRuntime, createContainer, startContainer, poststart, poststop
Digests vs tags¶
Tags are mutable pointers. nginx:1.25 can point to different images on different days. Digests are immutable content hashes.
# Pull by tag (mutable — NOT reproducible)
docker pull nginx:1.25
# Pull by digest (immutable — reproducible)
docker pull nginx@sha256:6db391d1c0cfb...
# Get the digest of an image
docker inspect --format '{{.RepoDigests}}' nginx:1.25
Docker Architecture¶
Docker is a user-facing tool. Under the hood, it delegates to containerd and runc.
docker CLI (client)
|
| REST API (usually /var/run/docker.sock)
|
dockerd (Docker daemon)
|
| gRPC
|
containerd
|
| OCI runtime spec
|
containerd-shim → runc → [container process]
containerd-shim: A per-container process that parents the container process. It allows containerd to restart without killing running containers and reports exit status back.
# See the process tree
pstree -p $(pgrep dockerd)
# dockerd → containerd → containerd-shim → <your process>
# Docker socket location
ls -la /var/run/docker.sock
# Docker storage driver
docker info | grep "Storage Driver"
# Usually: overlay2
Dockerfile Best Practices¶
Layer ordering for cache efficiency¶
# Least-frequently-changing content goes first
FROM python:3.11-slim
# System deps change rarely
RUN apt-get update && \
apt-get install -y --no-install-recommends curl && \
rm -rf /var/lib/apt/lists/*
# App deps change occasionally
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# App code changes frequently — last
COPY . .
Multi-stage builds¶
# Build stage — has compilers, dev tools
FROM golang:1.22 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app/server .
# Runtime stage — minimal image
FROM gcr.io/distroless/static:nonroot
COPY --from=builder /app/server /server
USER nonroot:nonroot
ENTRYPOINT ["/server"]
Result: final image has no compiler, no source code, no package manager. Attack surface is minimal.
COPY vs ADD¶
| Feature | COPY |
ADD |
|---|---|---|
| Copy files from build context | Yes | Yes |
| Copy from URL | No | Yes (but don't — use curl in RUN) |
| Auto-extract tar archives | No | Yes |
| Predictable behavior | Yes | Surprising |
Rule: Always use COPY unless you specifically need tar auto-extraction.
RUN consolidation¶
# BAD: three layers, apt cache persists in layer 1
RUN apt-get update
RUN apt-get install -y curl
RUN rm -rf /var/lib/apt/lists/*
# GOOD: one layer, clean up in the same layer
RUN apt-get update && \
apt-get install -y --no-install-recommends curl && \
rm -rf /var/lib/apt/lists/*
ENTRYPOINT vs CMD¶
# ENTRYPOINT = the executable (hard to override, use exec form)
# CMD = default arguments (easy to override)
ENTRYPOINT ["python", "app.py"]
CMD ["--port", "8000"]
# docker run myapp → python app.py --port 8000
# docker run myapp --port 9000 → python app.py --port 9000
# docker run --entrypoint sh myapp → sh (overrides entrypoint)
Shell form vs exec form:
# Exec form (preferred) — process is PID 1, receives signals directly
ENTRYPOINT ["python", "app.py"]
# Shell form — wraps in /bin/sh -c, PID 1 is sh, signals don't reach your process
ENTRYPOINT python app.py
Non-root USER¶
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
USER appuser
WORKDIR /app
.dockerignore¶
# .dockerignore
.git
.github
__pycache__
*.pyc
.env
.env.*
node_modules
.vscode
*.md
!README.md
Dockerfile
docker-compose*.yml
Without .dockerignore, COPY . . sends the entire build context (including .git, node_modules, secrets) to the daemon. This slows builds and risks leaking secrets.
Container Networking¶
Network drivers¶
| Driver | Use case | How it works |
|---|---|---|
| bridge (default) | Single-host, container-to-container | Virtual bridge (docker0), veth pairs, NAT for external access |
| host | Performance-critical, no isolation | Container shares host network namespace directly |
| none | Security-sensitive, custom networking | No networking at all |
| overlay | Multi-host (Swarm, some K8s setups) | VXLAN encapsulation between hosts |
| macvlan | Direct L2 access needed | Container gets its own MAC address on the physical network |
Bridge networking internals¶
# Default bridge
docker network inspect bridge
# Shows: subnet (usually 172.17.0.0/16), gateway, connected containers
# How traffic flows:
# Container eth0 → veth pair → docker0 bridge → iptables NAT → host eth0 → internet
# See the bridge
brctl show docker0 # or: ip link show docker0
# See veth pairs
ip link show type veth
# See NAT rules
iptables -t nat -L -n | grep MASQUERADE
iptables -t nat -L -n | grep DNAT # port mappings
DNS resolution in user-defined networks¶
# Default bridge: no DNS, containers must use --link (deprecated) or IPs
# User-defined bridge: built-in DNS at 127.0.0.11
docker network create mynet
docker run -d --name db --network mynet postgres:16
docker run -it --network mynet busybox nslookup db
# Resolves to the container's IP on mynet
Port mapping¶
# Map host port 8080 to container port 80
docker run -p 8080:80 nginx
# Map to specific host interface
docker run -p 127.0.0.1:8080:80 nginx
# Random host port
docker run -p 80 nginx
docker port <container> # shows the assigned host port
# How it works: iptables DNAT rule
iptables -t nat -L DOCKER -n
# DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80
Container Storage¶
Three storage options¶
| Type | Managed by Docker | Persists after container removal | Use case |
|---|---|---|---|
| Volumes | Yes (/var/lib/docker/volumes/) |
Yes | Database data, shared config |
| Bind mounts | No (any host path) | Yes (it's the host filesystem) | Development, host config files |
| tmpfs | No (memory-backed) | No | Secrets, temp data, scratch space |
# Named volume
docker volume create pgdata
docker run -v pgdata:/var/lib/postgresql/data postgres:16
# Bind mount
docker run -v /host/path:/container/path:ro nginx
# :ro = read-only inside container
# tmpfs
docker run --tmpfs /tmp:rw,noexec,nosuid,size=100m myapp
# Inspect a volume
docker volume inspect pgdata
# Shows: Mountpoint, Driver, Labels
Volume gotchas¶
# Anonymous volumes are created when Dockerfile has VOLUME directive
# They persist but are easy to lose track of
docker volume ls -f dangling=true # orphaned volumes
# Volume mount vs image data: if the volume is empty, Docker copies
# image data into it (first run only). If the volume has data, it
# shadows the image data completely.
Security¶
Linux capabilities¶
Containers do not run with full root capabilities. Docker drops dangerous capabilities by default and keeps a minimal set.
# Default capabilities (Docker)
# CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID,
# SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE
# Drop all, add only what you need
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp
# Check capabilities of a running container
docker inspect --format '{{.HostConfig.CapAdd}}' <container>
docker inspect --format '{{.HostConfig.CapDrop}}' <container>
# Inside the container
cat /proc/1/status | grep Cap
# Decode: capsh --decode=<hex-value>
Seccomp profiles¶
Seccomp filters restrict which system calls a container can make. Docker's default profile blocks ~44 syscalls (including reboot, kexec_load, mount, umount).
# Run with default seccomp profile (automatic)
docker run myapp
# Run with custom seccomp profile
docker run --security-opt seccomp=profile.json myapp
# Run with no seccomp (dangerous, for debugging only)
docker run --security-opt seccomp=unconfined myapp
# Check if seccomp is active
docker inspect --format '{{.HostConfig.SecurityOpt}}' <container>
AppArmor and SELinux¶
# AppArmor (Debian/Ubuntu)
docker run --security-opt apparmor=docker-default myapp
# Check: cat /proc/<pid>/attr/current
# SELinux (RHEL/CentOS/Fedora)
docker run --security-opt label=type:container_t myapp
# Check: ps -eZ | grep <container-process>
Rootless containers¶
Run the entire Docker daemon as an unprivileged user. No root anywhere in the stack.
# Install rootless Docker
dockerd-rootless-setuptool.sh install
# Verify
docker info | grep "rootless"
# "rootless" should appear in Security Options
# Limitations: no binding to ports < 1024, no AppArmor,
# limited storage drivers (overlay2 requires kernel 5.11+)
Read-only root filesystem¶
docker run --read-only --tmpfs /tmp --tmpfs /var/run myapp
# Container cannot write to any path except /tmp and /var/run
# Prevents malware from modifying binaries or writing to disk
BuildKit and Build Cache¶
BuildKit is the modern Docker build engine (default since Docker 23.0). It is faster, supports concurrent layer building, and has better caching.
Key BuildKit features¶
# Enable BuildKit (if not default)
DOCKER_BUILDKIT=1 docker build .
# Build with cache mount (don't re-download pip packages)
# syntax=docker/dockerfile:1
FROM python:3.11-slim
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt
# Build with secret mount (don't leak secrets in layers)
RUN --mount=type=secret,id=mytoken \
TOKEN=$(cat /run/secrets/mytoken) && \
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/install.sh | sh
# Build command with secrets
docker build --secret id=mytoken,src=./token.txt .
# Cache export/import (for CI)
docker build --cache-to type=local,dest=/tmp/cache .
docker build --cache-from type=local,src=/tmp/cache .
Build cache invalidation¶
Cache is invalidated when:
- The instruction changes (different RUN command)
- A COPY/ADD source file changes (content hash, not timestamp)
- Any parent layer is invalidated (cascade)
# See build cache usage
docker builder prune --all --dry-run
# Clear build cache
docker builder prune --all
Podman: Docker Alternative¶
Podman is a daemonless, rootless container engine. It is CLI-compatible with Docker but architecturally different.
Key differences from Docker¶
| Aspect | Docker | Podman |
|---|---|---|
| Daemon | dockerd (always running) |
No daemon (fork/exec model) |
| Root requirement | Requires root by default | Rootless by default |
| Socket | /var/run/docker.sock |
No socket needed (but can emulate) |
| Systemd integration | Requires extra config | Native (generates systemd units) |
| Pod concept | No native pods | First-class pods (like K8s) |
| OCI compliance | Yes | Yes |
| Compose | docker-compose / docker compose | podman-compose or podman compose |
# Drop-in replacement for Docker CLI
alias docker=podman
# Run a container
podman run -d --name web -p 8080:80 nginx:1.25
# Create a pod (Kubernetes-style)
podman pod create --name mypod -p 8080:80
podman run -d --pod mypod --name web nginx:1.25
podman run -d --pod mypod --name app myapp:latest
# Both containers share the same network namespace (like K8s)
# Generate a systemd unit file
podman generate systemd --new --name web > ~/.config/systemd/user/container-web.service
systemctl --user enable --now container-web
# Generate a Kubernetes YAML
podman generate kube mypod > pod.yaml
Podman rootless internals¶
Podman uses user namespaces to map the container's UID 0 to your unprivileged UID. It uses slirp4netns (or pasta) for network namespace connectivity without root.
# Check subuid/subgid mappings
cat /etc/subuid
# youruser:100000:65536
# Verify rootless mode
podman info | grep rootless
Pulling It All Together¶
When you run docker run -d -p 8080:80 --memory=512m --name web nginx:1.25, here is what happens:
- Docker CLI parses flags, sends request to
dockerdvia REST API - dockerd resolves the image, checks local cache, pulls if needed
- containerd prepares the rootfs (assembles overlay layers), creates an OCI runtime spec
- containerd-shim is spawned; it invokes runc
- runc creates namespaces (PID, NET, MNT, UTS, IPC), sets up cgroups (512MB memory limit), pivots root to the overlay mount, drops capabilities, applies seccomp
- runc executes
nginxas PID 1 inside the new namespace set - Docker networking creates a veth pair, plugs one end into
docker0bridge, assigns an IP, adds iptables DNAT rule for port 8080 → container:80 - containerd-shim remains as the parent process, reporting status back to containerd
Every debugging technique in the street ops and footguns files traces back to one of these layers. Know which layer you are operating at, and you will know which tools to reach for.
Runtime Debugging¶
When kubectl logs isn't enough, you need to go deeper. Container runtime debugging is the ability to inspect containers at the OS level -- examining namespaces, filesystems, network stacks, and processes.
crictl -- Container Runtime CLI¶
crictl talks directly to the container runtime (containerd/CRI-O), bypassing Kubernetes. Essential when kubelet is down.
crictl ps # list all containers
crictl pods # list pods
crictl inspect <container-id> # inspect a container
crictl logs <container-id> # get logs (bypasses kubelet)
crictl exec -it <container-id> sh # execute in container
crictl stats # container stats
nsenter -- Enter Container Namespaces¶
# Find the container's PID
PID=$(crictl inspect <container-id> | jq '.info.pid')
# Enter network namespace (most common)
nsenter -t $PID -n -- ip addr
nsenter -t $PID -n -- ss -tlnp
nsenter -t $PID -n -- curl localhost:8000/health
# Enter mount namespace to inspect files
nsenter -t $PID -m -- ls -la /app/
nsenter -t $PID -m -- cat /etc/resolv.conf
| Flag | Namespace | Isolates |
|---|---|---|
-m |
Mount | Filesystem view |
-u |
UTS | Hostname |
-n |
Network | Network stack |
-p |
PID | Process tree |
kubectl debug -- Ephemeral Containers¶
# Debug with network tools
kubectl debug -it pod/myapp -n ns --image=nicolaka/netshoot -- bash
# Share process namespace (see app's processes)
kubectl debug -it pod/myapp -n ns --image=busybox:1.36 --target=myapp -- sh
# Debug a node
kubectl debug node/worker-1 -it --image=busybox:1.36
strace -- System Call Tracing¶
When a process crashes or hangs with no useful logs:
strace -p $PID -f -e trace=network # network calls
strace -p $PID -f -e trace=open,read # file operations
| Pattern | Meaning |
|---|---|
connect() = -1 ECONNREFUSED |
Target service down |
connect() = -1 ETIMEDOUT |
Network/firewall issue |
open("...") = -1 ENOENT |
Missing file/mount |
futex(FUTEX_WAIT) hanging |
Deadlock |
Common Runtime Debugging Pitfalls¶
- Distroless images -- No shell. Use
kubectl debugwith a debug image. - Read-only filesystems -- Can't write temp files. Use ephemeral containers.
- Dropped capabilities -- Container may lack
NET_ADMIN,SYS_PTRACE. - PID namespace isolation -- Use
--targetflag withkubectl debug.
Wiki Navigation¶
Prerequisites¶
- Docker (Topic Pack, L1)
Next Steps¶
- Container Runtime Drills (Drill, L2)
- Skillcheck: Container Runtime Debug (Assessment, L2)
Related Content¶
- Deep Dive: Containers How They Really Work (deep_dive, L2) — Container Runtimes, Docker / Containers
- Interview: Docker Container Debugging (Scenario, L1) — Container Runtimes, Docker / Containers
- AWS ECS (Topic Pack, L2) — Docker / Containers
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers
- Container Images (Topic Pack, L1) — Docker / Containers
- Container Runtime Drills (Drill, L2) — Container Runtimes
- Container Runtime Flashcards (CLI) (flashcard_deck, L1) — Container Runtimes
- Deep Dive: Docker Image Internals (deep_dive, L2) — Docker / Containers
Pages that link here¶
- Anti-Primer: Containers Deep Dive
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKS — Certified Kubernetes Security Specialist
- Comparison: Container Orchestrators
- Comparison: Local Dev for Kubernetes
- Container Images
- Container Runtime Debugging - Skill Check
- Container Runtime Debugging Drills
- Containers - How They Really Work
- Containers Deep Dive
- Docker
- Docker Drills
- Docker Image Internals
- Incident Replay: Node Pressure Evictions
- Master Curriculum: 40 Weeks