Skip to content

Symptoms: Job Queue Backlog, Worker Pod CPU Throttled, Fix Is cgroup Config

Domains: observability | kubernetes_ops | linux_ops Level: L2 Estimated time: 30-45 min

Initial Alert

Application monitoring fires at 10:45 UTC:

CRITICAL: job_queue_depth > 10000
  queue: email-notifications
  depth: 14,287 messages
  processing_rate: 42 msg/s (expected: 500 msg/s)
  consumer_count: 5 workers

Follow-up alerts:

WARNING: email-worker — processing latency p99 = 2.4s (baseline: 50ms)
CRITICAL: email delivery SLA breach — 89% of emails delayed > 15 minutes
WARNING: RabbitMQ — email-notifications queue growing at 200 msg/s

Observable Symptoms

  • The email notification queue has 14,287 messages and is growing.
  • 5 email-worker pods are running but each is processing only ~8 messages/second (expected: 100 msg/s per worker).
  • Worker pod CPU usage shows 100m consistently (the CPU request is 100m, limit is 500m).
  • The workers are not crashing, not OOMKilled, and not reporting errors.
  • Each email job does: template rendering (CPU-bound), SMTP send, and database status update.
  • The queue depth was normal (< 100) until 09:00 UTC when it started growing steadily.
  • No code changes to the email-worker in the past 2 weeks.

The Misleading Signal

A growing job queue with healthy but slow workers looks like an observability/application performance problem — maybe the SMTP server is slow, the database is under contention, or there is a deadlock in the worker code. The fact that CPU shows exactly 100m (the request value, not the limit) and workers are "alive but slow" makes engineers suspect an external dependency bottleneck, not a CPU issue. The observability dashboards show CPU utilization at 20% of the limit (100m/500m), which looks like the workers have plenty of headroom.