Skip to content

Grading Checklist

  • Explains what a soft lockup is: a CPU is stuck in kernel code without yielding for longer than the watchdog threshold.
  • Reads the kernel stack trace from dmesg to identify the stuck function.
  • Identifies the root cause area from the stack trace (e.g., memory compaction, I/O wait, filesystem journaling).
  • Checks if transparent hugepages (THP) is enabled and recognizes it as a common cause of soft lockups on database servers.
  • Recommends disabling THP: echo never > /sys/kernel/mm/transparent_hugepage/enabled.
  • Checks vm.dirty_ratio and vm.dirty_background_ratio for excessive dirty page accumulation.
  • Investigates storage subsystem health (dmesg for I/O errors, smartctl, iostat).
  • Considers kernel upgrade as the fix if a known bug is identified.
  • Suggests tuning kernel.watchdog_thresh only as a diagnostic step, not a fix.
  • Mentions checking /proc/interrupts for interrupt imbalances across CPUs.
  • Notes that PostgreSQL's checkpoint_completion_target and WAL settings can exacerbate dirty page flushes.
  • Recommends monitoring with perf record during lockup windows to capture the hot path.