Skip to content

Thinking Out Loud: Containers Deep Dive

A senior SRE's internal monologue while working through a real container debugging task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

A containerized Python application performs 3x slower inside a Docker container than running directly on the host. The team says "it must be container overhead" and wants to switch back to VM-based deployment. I need to figure out what's actually causing the slowdown before we abandon containers.

The Monologue

3x slower in a container. Containers add negligible overhead for CPU-bound work (they're just cgroups and namespaces — no hypervisor). 3x is WAY too much to be "container overhead." Something else is going on. Let me prove it.

First, let me establish the baseline. What does "3x slower" actually mean?

# Run the benchmark on the host
python3 /opt/app/benchmark.py

# Run the same benchmark in the container
docker run -v /opt/app:/app myapp:latest python3 /app/benchmark.py

Host: 12.3 seconds. Container: 38.1 seconds. That's actually 3.1x slower. Let me check what kind of work the benchmark does.

head -20 /opt/app/benchmark.py

It's CPU-bound work (matrix multiplication with numpy) plus some file I/O (writing results to disk). The CPU work should be identical in a container. Let me isolate it.

# CPU-only benchmark
docker run myapp:latest python3 -c "
import numpy as np
import time
start = time.time()
for i in range(100):
    np.random.rand(1000,1000) @ np.random.rand(1000,1000)
print(f'Time: {time.time()-start:.2f}s')
"

CPU benchmark in container: 15.1 seconds. On host: 4.9 seconds. Still 3x slower for pure CPU work. That shouldn't happen unless... let me check the CPU limits on the container.

docker inspect myapp:latest --format '{{.Config}}' 2>/dev/null
docker run --rm myapp:latest cat /sys/fs/cgroup/cpu.max

100000 100000. That's a CPU quota of 100ms per 100ms period — effectively 1 CPU. The host has 4 cores. The numpy matrix multiplication is multi-threaded and uses all 4 cores on the host, but only gets 1 core's worth of CPU time in the container.

Mental Model: Container CPU Limits vs Multi-Threaded Apps

Containers don't limit the number of threads — they limit the total CPU time. A container with --cpus=1 can still spawn 4 threads, but they collectively get only 1 CPU-second per wall-clock-second. Multi-threaded workloads that see 4 cores (from /proc/cpuinfo) and spawn 4 threads will hit throttling. The threads compete for a single CPU's worth of time, and context switching makes it even worse than running single-threaded.

This is the issue. Numpy sees 4 CPUs (from the host's /proc/cpuinfo leaked into the container) and spawns 4 threads. But the container only gets 1 CPU's worth of time. The 4 threads fight for 1 CPU, adding context-switch overhead.

Let me verify by checking what numpy thinks it has.

docker run myapp:latest python3 -c "import os; print(f'CPUs visible: {os.cpu_count()}')"

"CPUs visible: 4." Numpy sees 4 CPUs and parallelizes accordingly, but the cgroup limit caps total CPU time to 1. The fix: either increase the CPU limit or tell the application how many CPUs it actually has.

# Option 1: Run with more CPUs allocated
docker run --cpus=4 myapp:latest python3 /app/benchmark.py

# Option 2: Set correct CPU count in the environment
docker run --cpus=1 -e OMP_NUM_THREADS=1 -e OPENBLAS_NUM_THREADS=1 myapp:latest python3 /app/benchmark.py

With --cpus=4: 5.2 seconds. Almost matches the host (4.9 seconds). The remaining 0.3 second difference is actual container overhead — negligible.

With --cpus=1 and OMP_NUM_THREADS=1: 12.8 seconds. Slower than the host (because 1 core vs 4), but not 3x slower anymore. It's actually faster than the original container run because there's no thread contention.

Mental Model: The /proc/cpuinfo Lie

Inside a container, /proc/cpuinfo shows ALL host CPUs regardless of cgroup limits. Applications that read /proc/cpuinfo to determine parallelism (Java, numpy, Go runtime, Node.js) will over-parallelize and hit throttling. Modern runtimes have fixes: Java 10+ reads cgroup limits, Go 1.19+ has GOMAXPROCS respect cgroup quotas. For others, set environment variables (OMP_NUM_THREADS, GOMAXPROCS) or use --cpuset-cpus to limit visible CPUs.

But wait — the Dockerfile uses FROM python:3.11. Let me check if this base image sets any CPU-aware defaults.

docker run myapp:latest python3 -c "import multiprocessing; print(multiprocessing.cpu_count())"

4. Python's multiprocessing.cpu_count() reads from the OS, which shows all host CPUs. In Python 3.13+, there's os.process_cpu_count() which respects cgroup limits. But we're on 3.11.

The proper fix for the Dockerfile:

# In the Dockerfile or entrypoint
ENV OMP_NUM_THREADS=${CPU_LIMIT:-1}
ENV OPENBLAS_NUM_THREADS=${CPU_LIMIT:-1}
ENV MKL_NUM_THREADS=${CPU_LIMIT:-1}

But that's static. For Kubernetes, the container should read its own cgroup limits dynamically.

# Script that reads the cgroup CPU limit
cat <<'EOF' > /opt/app/set-cpu-env.sh
#!/bin/bash
if [ -f /sys/fs/cgroup/cpu.max ]; then
    quota=$(cut -d' ' -f1 /sys/fs/cgroup/cpu.max)
    period=$(cut -d' ' -f2 /sys/fs/cgroup/cpu.max)
    if [ "$quota" != "max" ]; then
        cpus=$((quota / period))
        [ $cpus -lt 1 ] && cpus=1
        export OMP_NUM_THREADS=$cpus
        export OPENBLAS_NUM_THREADS=$cpus
    fi
fi
exec "$@"
EOF

Now let me also check the file I/O part of the benchmark. The original benchmark included disk writes.

docker run -v /tmp/benchmark:/tmp/output myapp:latest dd if=/dev/zero of=/tmp/output/testfile bs=1M count=100 oflag=direct

100MB in 0.34 seconds. That's 294 MB/s — fine. The file I/O isn't the bottleneck. The overlay filesystem adds some overhead for writes, but it's negligible for sequential I/O.

Let me also check memory.

docker run myapp:latest python3 -c "
import resource
print(f'Max RSS: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.1f} MB')
"

Memory is fine — no swapping, no cgroup memory pressure.

Summary for the team: the 3x slowdown was caused by numpy spawning 4 threads (seeing 4 host CPUs) but the container only having 1 CPU's worth of quota. The fix is to either allocate more CPU to the container or set OMP_NUM_THREADS to match the CPU limit. Container overhead itself is less than 5%.

The team does NOT need to abandon containers. They need to set CPU-aware environment variables in their Dockerfile.

What Made This Senior-Level

Junior Would... Senior Does... Why
Accept "container overhead" as the explanation for 3x slowdown Know that containers add <5% overhead and investigate the real cause Containers are cgroups + namespaces, not VMs. 3x overhead is a misconfiguration, not an inherent limitation
Not think about CPU limits vs thread count Check the cgroup CPU quota and how many threads the application spawns Multi-threaded apps over-parallelize when they see host CPUs but are limited by cgroup quotas
Just increase the CPU limit Also fix the application to be cgroup-aware via environment variables Increasing the limit works but wastes resources. The app should adapt to its actual CPU allocation
Not check /proc/cpuinfo vs cgroup limits Know that /proc/cpuinfo leaks host info into containers and most runtimes don't compensate This is the #1 cause of container performance issues with multi-threaded applications

Key Heuristics Used

  1. Containers =/= VMs: Container CPU overhead is <5%. If you see 3x+ slowdown, it's a misconfiguration (usually CPU limits vs thread count), not inherent overhead.
  2. The /proc/cpuinfo Lie: Containers see all host CPUs in /proc/cpuinfo but are limited by cgroup quotas. Multi-threaded apps over-parallelize and context-switch.
  3. Set CPU-Aware Environment Variables: For libraries that don't read cgroup limits (numpy/OpenBLAS, older JVMs), set OMP_NUM_THREADS, GOMAXPROCS, etc. to match the container's CPU allocation.

Cross-References

  • Primer — Container internals: namespaces, cgroups, and the difference from VMs
  • Street Ops — Container performance debugging, cgroup inspection, and resource limit tuning
  • Footguns — /proc/cpuinfo leaking host info, multi-threaded apps in CPU-limited containers, and overlay filesystem write amplification