Questions to Determine¶
- What kernel function is shown in the stack trace of the soft lockup?
- Is the lockup related to I/O, memory management, or a specific kernel driver?
- Is the kernel version known to have bugs that cause soft lockups?
- What does the stack trace in dmesg reveal about where the CPU is stuck?
- Are the lockups correlated with specific workloads (writes, fsync, memory allocation)?
- Is the storage subsystem (RAID controller, NVMe, SAN) experiencing issues?
- Are any kernel parameters (vm.dirty_ratio, vm.dirty_background_ratio) misconfigured?
- Is transparent hugepages (THP) enabled and could compaction be causing stalls?