cgroups & Namespaces Footguns¶

Mistakes that cause OOM kills, resource leaks, broken containers, and hours of debugging.

1. cgroups v1 vs v2 path confusion¶

You write to /sys/fs/cgroup/memory/docker/<id>/memory.limit_in_bytes — nothing happens. Your system runs cgroups v2 and that path doesn't exist. On v2, the file is /sys/fs/cgroup/<path>/memory.max. Different names, different value formats, different semantics.

Fix: Check first: stat -f --format=%T /sys/fs/cgroup/. Returns cgroup2fs for v2, tmpfs for v1. Never assume which version you're on.

Default trap: RHEL 9, Ubuntu 22.04+, Fedora 31+, and Debian 11+ default to cgroups v2. RHEL 8, Ubuntu 20.04, and Amazon Linux 2 default to v1. If your Ansible playbooks or container tooling was written for v1 paths, it will silently fail or write to the wrong files on v2 systems. Kubernetes 1.25+ fully supports v2, but older versions require v1 or a hybrid setup.

2. Docker cgroupfs vs systemd driver mismatch¶

Docker defaults to the cgroupfs cgroup driver, creating cgroups directly and bypassing systemd. Now systemd-cgtop shows wrong numbers. If kubelet uses systemd driver but the container runtime uses cgroupfs, pods fail with cgroup creation errors.

Fix: Always use the systemd driver on systemd-based systems. Set "exec-opts": ["native.cgroupdriver=systemd"] in Docker's daemon.json or SystemdCgroup = true in containerd config. Kubelet and runtime must match.

3. `memory.limit_in_bytes` vs `memory.max` — not just a rename¶

v2 introduced memory.high — a throttling threshold with no v1 equivalent. On v1, hitting the limit means OOM kill. On v2, memory.high triggers reclaim and throttling before memory.max triggers OOM. If you only set memory.max, you lose the soft throttling safety net.

Fix: On v2, set both memory.high (80-90% of max) and memory.max. In Kubernetes, set both requests (influences memory.high) and limits (memory.max).

4. Not accounting for kernel memory in limits¶

You set 512MB based on heap usage. The app uses 400MB. Then the kernel allocates page tables, socket buffers, dentry caches inside the same cgroup — the container OOM-kills even though the app thinks it's within budget.

cat /sys/fs/cgroup/system.slice/docker-<id>.scope/memory.stat | grep kernel
# kernel 67108864    ← 64MB you didn't budget for
# sock 2097152      ← socket buffers (high on network-heavy services)

Fix: Add 10-20% above expected app usage. For network-heavy services, monitor sock in memory.stat and budget accordingly.

5. OOM killer vs cgroup OOM — different beasts¶

The system OOM killer fires when the host runs out of memory. The cgroup OOM killer fires when a cgroup exceeds memory.max. Kubernetes OOMKilled = cgroup OOM. Unexplained dmesg OOM kills with no container reporting it = system OOM, likely from containers without limits.

Fix: Always set memory limits. Monitor memory.events (cgroup) and dmesg (system). Set memory.high to throttle before memory.max kills.

6. `cpu.shares` is relative, not absolute¶

cpu.shares=512 doesn't mean 512 units of CPU. Shares only matter relative to other cgroups competing at the same time. If nobody else is competing, cpu.shares=1 gets 100% of available CPU. On v2, cpu.weight (1-10000) has the same relative semantics.

Fix: For a hard CPU ceiling, use cpu.cfs_quota_us (v1) or cpu.max (v2). In Kubernetes, requests.cpu = shares (proportional), limits.cpu = quota (hard ceiling).

7. PID 1 in a PID namespace doesn't handle signals¶

Your app is PID 1 in the container. docker stop sends SIGTERM. But the kernel doesn't deliver signals to PID 1 unless it registered a handler. SIGTERM is silently ignored. Docker waits 10 seconds, then SIGKILL — no graceful shutdown, no connection draining.

time docker stop myapp
# real    0m10.003s    ← SIGTERM ignored, waited for SIGKILL timeout

Fix: Use tini, dumb-init, or docker run --init as PID 1 to forward signals. Or add signal handlers to your application.

Under the hood: The kernel treats PID 1 specially: signals without an explicit handler are silently dropped (except SIGKILL and SIGSTOP). This is by design for the host's init process — you don't want a stray SIGTERM to kill PID 1 and crash the system. But in a PID namespace (container), your application inherits this special treatment. tini registers handlers for all signals and forwards them to your application as a regular child process, where they behave normally.

8. User namespace UID mapping confusion¶

With Docker userns-remap, container root (UID 0) maps to an unprivileged host UID (e.g., 100000). Bind-mounted host directories become inaccessible because the container's mapped UID doesn't own the host files.

Fix: Adjust bind mount ownership to match the remapped UID range, or use named volumes. Not all workloads work with user namespaces — some need real root for privileged ports or network operations.

9. `/proc` inside containers leaking host info¶

/proc/meminfo shows host RAM, not the container's cgroup limit. /proc/cpuinfo shows all host CPUs. nproc returns the host's core count. Applications that auto-tune from these files (Java heap sizing, Go's GOMAXPROCS) misconfigure themselves.

# Inside a container limited to 256MB and 1 CPU:
cat /proc/meminfo | head -1
# MemTotal: 65536000 kB    ← host's 64GB, not 256MB
nproc
# 32                        ← host's 32 cores, not 1

Fix: Modern JVMs (10+) use -XX:+UseContainerSupport (default on). For Go, use go.uber.org/automaxprocs. Ensure apps read cgroup files, not /proc/meminfo.

Gotcha: Python's os.cpu_count() and Node.js's os.cpus().length both read /proc/cpuinfo and return host CPU count, not the container's cgroup limit. Python worker pools (gunicorn's --workers=$(nproc)) and Node.js cluster module will spawn 32 workers on a 32-core host even if the container is limited to 1 CPU. Use nproc --all only in cgroup-aware contexts, or hardcode worker counts in container environments.

10. cgroup memory accounting surprises¶

memory.current shows 200MB, limit is 256MB. Looks safe. Then OOM. The hidden cost: tmpfs mounts (/dev/shm, emptyDir with medium: Memory) count against the memory cgroup. Shared memory segments count. Huge pages may or may not count depending on configuration.

cat /sys/fs/cgroup/<path>/memory.stat
# anon 104857600     ← 100MB heap
# file 52428800      ← 50MB page cache
# shmem 83886080     ← 80MB shared memory/tmpfs ← THE SURPRISE

Fix: Monitor memory.stat breakdowns, not just memory.current. Budget tmpfs and shared memory into limits. In Kubernetes, emptyDir with medium: Memory counts against the container's cgroup.