Skip to content

Investigation: Job Queue Backlog, Worker Pod CPU Throttled, Fix Is cgroup Config

Phase 1: Observability Investigation (Dead End)

Check worker processing rates:

# Prometheus query: rate(jobs_processed_total{service="email-worker"}[5m])
# Result: 8.2 per worker (total 41/s across 5 workers)

Check external dependencies:

# SMTP server response time
$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- \
    curl -s -w "%{time_total}\n" -o /dev/null smtp://smtp.internal:25
0.003

# Database response time
$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- \
    psql -h db.prod -U app -c "SELECT 1" -t 2>/dev/null
 1
# (< 5ms)

SMTP and database are fast. The bottleneck is not in external dependencies. Check the worker's internal timing:

$ kubectl logs email-worker-6b5d8c9f-x2k4j -n prod --tail=10
2026-03-19T10:45:12Z INFO  Job 89247: template_render=95ms smtp_send=3ms db_update=2ms total=100ms
2026-03-19T10:45:12Z INFO  Job 89248: template_render=97ms smtp_send=4ms db_update=2ms total=103ms
2026-03-19T10:45:12Z INFO  Job 89249: template_render=94ms smtp_send=3ms db_update=3ms total=100ms

Template rendering takes 95ms per job — this is the CPU-bound part. The worker can process ~10 jobs/second per thread. But the worker has 4 threads, so it should handle 40 jobs/second. With 5 workers, that is 200 jobs/second minimum. But the observed rate is only 8/worker (40 total).

The timing shows 100ms per job, but the throughput is 8/s, implying only 1 job processed concurrently despite 4 threads. Check if threads are blocked:

$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
app          1  98.5  1.2 1234568 48192 ?       Ssl  09:00  105:00 /usr/bin/email-worker --threads=4

98.5% CPU. The process is CPU-bound and using all available CPU. But kubectl top shows only 100m. Something is capping the CPU.

The Pivot

Check CPU throttling:

$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- cat /sys/fs/cgroup/cpu/cpu.stat
nr_periods 84291
nr_throttled 79847
throttled_time 4182947291847

79,847 out of 84,291 CFS periods were throttled. That is 94.7% throttle rate. The container is being heavily CPU-throttled despite showing "100m" in kubectl top. The CPU metrics dashboard shows 20% utilization, but the reality is 94.7% throttling.

Phase 2: Kubernetes Investigation (Root Cause)

Check the container's CPU limits:

$ kubectl get pod email-worker-6b5d8c9f-x2k4j -n prod \
    -o jsonpath='{.spec.containers[0].resources}'
{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"100m","memory":"128Mi"}}

CPU limit is 500m (half a core). But the worker runs 4 threads doing CPU-intensive template rendering. 4 threads on 500m means each thread gets ~125m — barely enough for one thread to render a template (which needs ~100ms of CPU time per render, consuming 100m at single-threaded).

But wait — the limit is 500m, and the worker was fine 2 weeks ago with the same limit and same workload. What changed?

$ kubectl get node worker-node-04 -o jsonpath='{.status.capacity.cpu}'
4

$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000

$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
50000

50,000us quota per 100,000us period = 500m. This is the correct limit. But check:

$ kubectl exec email-worker-6b5d8c9f-x2k4j -n prod -- \
    cat /proc/sys/kernel/sched_cfs_bandwidth_slice_ns
5000000

The CFS bandwidth slice is 5ms (5,000,000ns). With a 500m limit, the container gets 50ms of CPU time per 100ms period. But with a 5ms slice, the quota is distributed in 5ms chunks. If a thread bursts through its slice in a tight CPU loop, it gets throttled immediately and has to wait for the next slice.

The node was recently upgraded from kernel 5.15 to 6.1. The newer kernel changed the default CFS bandwidth slice behavior, and a sysctl override was not applied during the upgrade:

$ ssh worker-node-04 "uname -r"
6.1.0-18-amd64

$ ssh worker-node-04 "sysctl kernel.sched_cfs_bandwidth_slice_us"
kernel.sched_cfs_bandwidth_slice_us = 5000
# Default on 6.1 is 5000 (5ms). On 5.15 it was effectively 8000 (8ms)

The kernel upgrade tightened the CFS bandwidth accounting, causing more aggressive throttling at the same CPU limit. Combined with the multi-threaded worker pattern, each thread gets less of the 500m budget per scheduling period.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was a job queue backlog visible in observability dashboards (observability), the root cause was CPU throttling due to CFS bandwidth limits in the container's cgroup (kubernetes_ops), exacerbated by a kernel upgrade that changed CFS scheduling behavior (linux_ops). This is common because: Kubernetes CPU limits use CFS bandwidth control, which is a Linux kernel feature. Kernel upgrades can change scheduling behavior in ways that affect containerized workloads. CPU throttling is poorly reflected in standard Kubernetes metrics — kubectl top shows average usage, not throttle rate.

Root Cause

A kernel upgrade from 5.15 to 6.1 on the worker node changed CFS bandwidth slice behavior, causing more aggressive CPU throttling for containers with CPU limits. The email-worker pods have a 500m CPU limit with 4 threads doing CPU-intensive work. The tighter scheduling caused 94.7% of CFS periods to be throttled, effectively limiting the workers to ~100m of usable CPU. Standard Kubernetes metrics showed 20% utilization (100m/500m), masking the throttling.