cgroups & Linux Namespaces - Street Ops¶

What experienced operators know about cgroups and namespaces that the documentation doesn't emphasize enough.

Finding What cgroup a Process Belongs To¶

Every troubleshooting session starts here:

Remember: cgroups v1 = multiple hierarchies (one file per controller). cgroups v2 = single unified hierarchy. Mnemonic: v1 = many trees, v2 = one tree.

# Direct lookup for any PID
cat /proc/$(pgrep nginx | head -1)/cgroup
# v1: 12:memory:/system.slice/nginx.service (multiple lines, one per controller)
# v2: 0::/system.slice/nginx.service       (single line, unified)

# Docker container
PID=$(docker inspect --format '{{.State.Pid}}' myapp)
cat /proc/$PID/cgroup

# Kubernetes pod
crictl ps --name myapp -o json | jq '.containers[].pid'

Inspecting Container Resource Limits from the Host¶

Read cgroup files directly — no exec into the container needed:

CGROUP_PATH=$(docker inspect --format '{{.Id}}' myapp)

# v2
cat /sys/fs/cgroup/system.slice/docker-${CGROUP_PATH}.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CGROUP_PATH}.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-${CGROUP_PATH}.scope/cpu.max

# v1
cat /sys/fs/cgroup/memory/docker/${CGROUP_PATH}/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/${CGROUP_PATH}/memory.usage_in_bytes
cat /sys/fs/cgroup/cpu/docker/${CGROUP_PATH}/cpu.cfs_quota_us

Quick dump of all container limits:

for cid in $(docker ps -q); do
  name=$(docker inspect --format '{{.Name}}' $cid | tr -d '/')
  mem=$(docker inspect --format '{{.HostConfig.Memory}}' $cid)
  cpu=$(docker inspect --format '{{.HostConfig.NanoCpus}}' $cid)
  printf "%-30s mem=%-12s cpu=%s\n" "$name" "$mem" "$cpu"
done

Debugging OOMKilled Containers¶

# 1. Confirm OOM via kernel log
dmesg | grep -i "oom\|killed process" | tail -10

# 2. Check cgroup memory events
PID=$(docker inspect --format '{{.State.Pid}}' myapp)
CGROUP=$(cat /proc/$PID/cgroup | cut -d: -f3)

# v2: event counters
cat /sys/fs/cgroup${CGROUP}/memory.events
# high 1234       ← times memory.high exceeded (throttling)
# max 5           ← times memory.max hit
# oom 3           ← OOM events
# oom_kill 3      ← processes killed

# 3. Memory breakdown — where is it going?
cat /sys/fs/cgroup${CGROUP}/memory.stat
# anon 1073741824     ← heap/stack (usually the culprit)
# file 134217728      ← page cache
# kernel 67108864     ← kernel memory (often forgotten)
# shmem 0             ← shared memory/tmpfs
# sock 4096           ← socket buffers

# v1 equivalent
cat /sys/fs/cgroup/memory${CGROUP}/memory.failcnt    # times limit was hit
cat /sys/fs/cgroup/memory${CGROUP}/memory.oom_control # OOM kill count

Investigation pattern: confirm OOM via events, check memory.stat to find where memory went (anon vs file vs kernel vs shmem), then decide if the limit is too low or the app is leaking.

Debug clue: If anon is the dominant consumer in memory.stat, your application heap is growing (likely a leak). If file is large, your app is reading lots of files and the page cache is filling the cgroup. If kernel is unexpectedly high, check for socket buffers or cgroup metadata overhead from many short-lived processes.

Setting Up systemd Resource Limits¶

# Check current limits
systemctl show nginx.service | grep -E '(Memory|CPU|Tasks)(Max|High|Quota|Limit)'

# Set limits via drop-in override
sudo systemctl edit nginx.service

[Service]
MemoryMax=1G
MemoryHigh=768M
CPUQuota=200%
TasksMax=512

# Verify
sudo systemctl daemon-reload
sudo systemctl restart nginx.service
MAIN_PID=$(systemctl show nginx.service -p MainPID --value)
cat /sys/fs/cgroup/system.slice/nginx.service/memory.max
# 1073741824

Investigating v1 vs v2 Migration Issues¶

# Step 1: What version am I on?
stat -f --format=%T /sys/fs/cgroup/
# "cgroup2fs" = v2, "tmpfs" = v1

# Step 2: What does the container runtime expect?
docker info 2>/dev/null | grep -i cgroup
# Cgroup Driver: systemd
# Cgroup Version: 2

# Step 3: Check kubelet driver matches
cat /var/lib/kubelet/config.yaml | grep cgroupDriver

# Step 4: Check for hybrid mode (v1+v2 coexisting — causes the most confusion)
mount | grep cgroup
# Both "type cgroup" AND "type cgroup2" = hybrid mode

Common failure: OS upgraded to v2, Docker still configured for v1. Error: "failed to start daemon: Devices cgroup isn't mounted". Fix: switch Docker to systemd driver and ensure daemon.json has "exec-opts": ["native.cgroupdriver=systemd"].

Monitoring PSI Metrics for Capacity Planning¶

PSI (cgroups v2 only) is the best signal for resource pressure:

# System-wide
cat /proc/pressure/cpu
# some avg10=2.50 avg60=1.80 avg300=1.20 total=1234567890
cat /proc/pressure/memory
# some avg10=0.30 avg60=0.10 avg300=0.05 total=12345678
cat /proc/pressure/io
# some avg10=15.00 avg60=10.50 avg300=8.20 total=9876543210

# Per-service
cat /sys/fs/cgroup/system.slice/myapp.service/cpu.pressure

Metric	Healthy	Warning	Action
`cpu some avg60`	< 5%	5-25%	> 25%: add CPU
`memory some avg60`	< 5%	5-20%	> 20%: add RAM
`memory full avg60`	0%	> 0%	> 5%: thrashing, urgent
`io some avg60`	< 10%	10-30%	> 30%: upgrade storage

Debugging CPU Throttling¶

CPU throttling is invisible unless you look for it:

# v2
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/cpu.stat
# nr_periods 50000         ← total scheduling periods
# nr_throttled 8500        ← periods where throttling occurred
# throttled_usec 4250000   ← total time throttled (microseconds)

# Throttle rate: nr_throttled / nr_periods = 8500/50000 = 17%
# If > 5-10%, raise CPU limit or optimize the workload

# Check current limit
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/cpu.max
# 50000 100000    ← 50% of one core

Common mistake: setting tight CPU limits on bursty workloads (web servers). They need brief spikes above average. Consider removing CPU limits and using only requests (proportional weight) — Google's recommendation for most workloads.

Default trap: Kubernetes sets cpu.max to quota period format (e.g., 50000 100000). A 100m CPU limit becomes a 10ms quota per 100ms period. For a web server handling a burst of requests, 10ms of CPU per 100ms means 90% of the time it is forcibly idle. This is why tail latency spikes under CPU limits.

One-liner: Throttle rate = nr_throttled / nr_periods. Above 5% means your CPU limit is actively hurting performance.

Using nsenter to Debug Container Networking¶

nsenter runs host tools in a container's namespace — no need to install anything in the container:

PID=$(docker inspect --format '{{.State.Pid}}' myapp)

# tcpdump in container's network (using host's tcpdump binary)
sudo nsenter -t $PID -n tcpdump -i eth0 -nn -c 50 port 80

# Check routing table
sudo nsenter -t $PID -n ip route

# Test connectivity from container's perspective
sudo nsenter -t $PID -n curl -s http://other-service:8080/health

# DNS resolution
sudo nsenter -t $PID -n dig kubernetes.default.svc.cluster.local

# Listening ports
sudo nsenter -t $PID -n ss -tlnp

The key: -n enters only the network namespace. You keep the host's mount namespace, so you have all host debugging tools but see the container's network stack.

Remember: nsenter flags map to namespace types: -n = network, -m = mount, -p = PID, -u = UTS (hostname), -i = IPC. Mnemonic: Network, Mount, PID — the three you use 95% of the time.