Grading Checklist¶
- Explains what a soft lockup is: a CPU is stuck in kernel code without yielding for longer than the watchdog threshold.
- Reads the kernel stack trace from dmesg to identify the stuck function.
- Identifies the root cause area from the stack trace (e.g., memory compaction, I/O wait, filesystem journaling).
- Checks if transparent hugepages (THP) is enabled and recognizes it as a common cause of soft lockups on database servers.
- Recommends disabling THP:
echo never > /sys/kernel/mm/transparent_hugepage/enabled. - Checks vm.dirty_ratio and vm.dirty_background_ratio for excessive dirty page accumulation.
- Investigates storage subsystem health (dmesg for I/O errors, smartctl, iostat).
- Considers kernel upgrade as the fix if a known bug is identified.
- Suggests tuning
kernel.watchdog_threshonly as a diagnostic step, not a fix. - Mentions checking
/proc/interruptsfor interrupt imbalances across CPUs. - Notes that PostgreSQL's
checkpoint_completion_targetand WAL settings can exacerbate dirty page flushes. - Recommends monitoring with
perf recordduring lockup windows to capture the hot path.