If a Kernel Panic Occurs Due to a BMC Restart Analysis
If a Kernel Panic Occurs Due to a BMC Restart Analysis
Yes, your conclusion is very plausible — the kernel crash you observed appears to be directly linked to the BMC (or BMI, likely a typo for BMC) restarting, especially given the presence of USB-related stalls and a subsequent RCU stall. Here’s a breakdown of what’s happening:
🧠 Understanding the Crash
🔹 Key Indicators in Your Logs
-
RCU Preempt Stall
- The kernel’s RCU (Read-Copy-Update) subsystem timed out waiting for grace periods to complete. This happens when some CPUs/tasks are unable to make forward progress.
- In your case, the
rcu_preemptkthread was starved for ~10 minutes (59518 jiffies ~ 9.9 minutes assuming HZ=1000), which is extreme.
-
CPU 1 Idling
- An NMI (non-maskable interrupt) was sent to CPU 1, which was found in
intel_idle_irq, meaning it was deep in a C-state and unresponsive to softirqs.
- An NMI (non-maskable interrupt) was sent to CPU 1, which was found in
-
Kworker Blocked on USB
- The kernel thread
kworker/1:1was blocked inusb_start_wait_urb()→usb_control_msg()→usbnet_cdc_update_filter(). - This strongly suggests a USB management/network interface (CDC-ETHER class) stopped responding.
- The kernel thread
-
Kernel Panic
- Triggered by
RCU Stall, which is often a symptom of a broader system stall, not the root cause.
- Triggered by
🔌 How the BMC Reboot Can Trigger This
Modern servers often expose USB devices to the host via the BMC, for purposes like:
- Virtual KVM
- Virtual media (USB mass storage over IPMI)
- Management NICs exposed as USB CDC Ethernet devices (like
usbnet/cdc_ether)
When the BMC resets, these virtual USB devices disappear temporarily, possibly without proper detachment, which leads to:
- URBs (USB Request Blocks) being sent to a non-responsive USB controller
- usbnet and
kworkerthreads waiting indefinitely for completions that will never arrive - RCU threads not progressing due to blocked workers or softirq starvation
- CPU stuck in idle C-state, not handling softirqs → leads to RCU panic
This is consistent with your backtrace, where:
kworker/1:1is stuck inusbnet_cdc_update_filter(), waiting onusb_control_msg()- USB timeouts do not always cleanly abort, especially if underlying hardware interface is suddenly gone (e.g. BMC USB detach)
✅ Conclusion
Yes, a BMC reboot in your system configuration can very realistically:
- Remove/disable a virtual USB device
- Cause kernel threads using that USB (e.g. usbnet) to block indefinitely
- Trigger an RCU stall due to downstream starvation and softirq failures
- Lead to a kernel panic
🛠️ Recommendations
1. Mitigate USB Dependency on BMC
- Avoid using BMC-exposed virtual USB devices in critical host kernel operations (e.g. networking).
- If possible, use physical interfaces instead of USB CDC-based virtual ones for system management.
2. Harden USB Stack
- Set USB URB timeouts explicitly and conservatively (
usb_control_msg()has a timeout argument). - Apply kernel patches related to
usbnetandcdc_etherrobustness if available (especially for EL8 kernels).
3. Update Kernel & BMC Firmware
- This is a known issue on older kernels (you are on
4.18.0-372.32.1.rt7.189.el8_6), where USB URB handling is brittle. - BMC firmware upgrades can improve hotplug detection and stability.
4. Monitor with NMI Watchdog
- Enable NMI watchdog (
nmi_watchdog=1) if not already active to capture CPU lockups earlier.