Thinking Out Loud: Containers Deep Dive¶
A senior SRE's internal monologue while working through a real container debugging task. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
A containerized Python application performs 3x slower inside a Docker container than running directly on the host. The team says "it must be container overhead" and wants to switch back to VM-based deployment. I need to figure out what's actually causing the slowdown before we abandon containers.
The Monologue¶
3x slower in a container. Containers add negligible overhead for CPU-bound work (they're just cgroups and namespaces — no hypervisor). 3x is WAY too much to be "container overhead." Something else is going on. Let me prove it.
First, let me establish the baseline. What does "3x slower" actually mean?
# Run the benchmark on the host
python3 /opt/app/benchmark.py
# Run the same benchmark in the container
docker run -v /opt/app:/app myapp:latest python3 /app/benchmark.py
Host: 12.3 seconds. Container: 38.1 seconds. That's actually 3.1x slower. Let me check what kind of work the benchmark does.
It's CPU-bound work (matrix multiplication with numpy) plus some file I/O (writing results to disk). The CPU work should be identical in a container. Let me isolate it.
# CPU-only benchmark
docker run myapp:latest python3 -c "
import numpy as np
import time
start = time.time()
for i in range(100):
np.random.rand(1000,1000) @ np.random.rand(1000,1000)
print(f'Time: {time.time()-start:.2f}s')
"
CPU benchmark in container: 15.1 seconds. On host: 4.9 seconds. Still 3x slower for pure CPU work. That shouldn't happen unless... let me check the CPU limits on the container.
docker inspect myapp:latest --format '{{.Config}}' 2>/dev/null
docker run --rm myapp:latest cat /sys/fs/cgroup/cpu.max
100000 100000. That's a CPU quota of 100ms per 100ms period — effectively 1 CPU. The host has 4 cores. The numpy matrix multiplication is multi-threaded and uses all 4 cores on the host, but only gets 1 core's worth of CPU time in the container.
Mental Model: Container CPU Limits vs Multi-Threaded Apps¶
Containers don't limit the number of threads — they limit the total CPU time. A container with
--cpus=1can still spawn 4 threads, but they collectively get only 1 CPU-second per wall-clock-second. Multi-threaded workloads that see 4 cores (from/proc/cpuinfo) and spawn 4 threads will hit throttling. The threads compete for a single CPU's worth of time, and context switching makes it even worse than running single-threaded.
This is the issue. Numpy sees 4 CPUs (from the host's /proc/cpuinfo leaked into the container) and spawns 4 threads. But the container only gets 1 CPU's worth of time. The 4 threads fight for 1 CPU, adding context-switch overhead.
Let me verify by checking what numpy thinks it has.
"CPUs visible: 4." Numpy sees 4 CPUs and parallelizes accordingly, but the cgroup limit caps total CPU time to 1. The fix: either increase the CPU limit or tell the application how many CPUs it actually has.
# Option 1: Run with more CPUs allocated
docker run --cpus=4 myapp:latest python3 /app/benchmark.py
# Option 2: Set correct CPU count in the environment
docker run --cpus=1 -e OMP_NUM_THREADS=1 -e OPENBLAS_NUM_THREADS=1 myapp:latest python3 /app/benchmark.py
With --cpus=4: 5.2 seconds. Almost matches the host (4.9 seconds). The remaining 0.3 second difference is actual container overhead — negligible.
With --cpus=1 and OMP_NUM_THREADS=1: 12.8 seconds. Slower than the host (because 1 core vs 4), but not 3x slower anymore. It's actually faster than the original container run because there's no thread contention.
Mental Model: The /proc/cpuinfo Lie¶
Inside a container,
/proc/cpuinfoshows ALL host CPUs regardless of cgroup limits. Applications that read/proc/cpuinfoto determine parallelism (Java, numpy, Go runtime, Node.js) will over-parallelize and hit throttling. Modern runtimes have fixes: Java 10+ reads cgroup limits, Go 1.19+ has GOMAXPROCS respect cgroup quotas. For others, set environment variables (OMP_NUM_THREADS, GOMAXPROCS) or use--cpuset-cpusto limit visible CPUs.
But wait — the Dockerfile uses FROM python:3.11. Let me check if this base image sets any CPU-aware defaults.
4. Python's multiprocessing.cpu_count() reads from the OS, which shows all host CPUs. In Python 3.13+, there's os.process_cpu_count() which respects cgroup limits. But we're on 3.11.
The proper fix for the Dockerfile:
# In the Dockerfile or entrypoint
ENV OMP_NUM_THREADS=${CPU_LIMIT:-1}
ENV OPENBLAS_NUM_THREADS=${CPU_LIMIT:-1}
ENV MKL_NUM_THREADS=${CPU_LIMIT:-1}
But that's static. For Kubernetes, the container should read its own cgroup limits dynamically.
# Script that reads the cgroup CPU limit
cat <<'EOF' > /opt/app/set-cpu-env.sh
#!/bin/bash
if [ -f /sys/fs/cgroup/cpu.max ]; then
quota=$(cut -d' ' -f1 /sys/fs/cgroup/cpu.max)
period=$(cut -d' ' -f2 /sys/fs/cgroup/cpu.max)
if [ "$quota" != "max" ]; then
cpus=$((quota / period))
[ $cpus -lt 1 ] && cpus=1
export OMP_NUM_THREADS=$cpus
export OPENBLAS_NUM_THREADS=$cpus
fi
fi
exec "$@"
EOF
Now let me also check the file I/O part of the benchmark. The original benchmark included disk writes.
docker run -v /tmp/benchmark:/tmp/output myapp:latest dd if=/dev/zero of=/tmp/output/testfile bs=1M count=100 oflag=direct
100MB in 0.34 seconds. That's 294 MB/s — fine. The file I/O isn't the bottleneck. The overlay filesystem adds some overhead for writes, but it's negligible for sequential I/O.
Let me also check memory.
docker run myapp:latest python3 -c "
import resource
print(f'Max RSS: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.1f} MB')
"
Memory is fine — no swapping, no cgroup memory pressure.
Summary for the team: the 3x slowdown was caused by numpy spawning 4 threads (seeing 4 host CPUs) but the container only having 1 CPU's worth of quota. The fix is to either allocate more CPU to the container or set OMP_NUM_THREADS to match the CPU limit. Container overhead itself is less than 5%.
The team does NOT need to abandon containers. They need to set CPU-aware environment variables in their Dockerfile.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Accept "container overhead" as the explanation for 3x slowdown | Know that containers add <5% overhead and investigate the real cause | Containers are cgroups + namespaces, not VMs. 3x overhead is a misconfiguration, not an inherent limitation |
| Not think about CPU limits vs thread count | Check the cgroup CPU quota and how many threads the application spawns | Multi-threaded apps over-parallelize when they see host CPUs but are limited by cgroup quotas |
| Just increase the CPU limit | Also fix the application to be cgroup-aware via environment variables | Increasing the limit works but wastes resources. The app should adapt to its actual CPU allocation |
| Not check /proc/cpuinfo vs cgroup limits | Know that /proc/cpuinfo leaks host info into containers and most runtimes don't compensate | This is the #1 cause of container performance issues with multi-threaded applications |
Key Heuristics Used¶
- Containers =/= VMs: Container CPU overhead is <5%. If you see 3x+ slowdown, it's a misconfiguration (usually CPU limits vs thread count), not inherent overhead.
- The /proc/cpuinfo Lie: Containers see all host CPUs in /proc/cpuinfo but are limited by cgroup quotas. Multi-threaded apps over-parallelize and context-switch.
- Set CPU-Aware Environment Variables: For libraries that don't read cgroup limits (numpy/OpenBLAS, older JVMs), set OMP_NUM_THREADS, GOMAXPROCS, etc. to match the container's CPU allocation.
Cross-References¶
- Primer — Container internals: namespaces, cgroups, and the difference from VMs
- Street Ops — Container performance debugging, cgroup inspection, and resource limit tuning
- Footguns — /proc/cpuinfo leaking host info, multi-threaded apps in CPU-limited containers, and overlay filesystem write amplification