Kernel panic - not syncing: hung_task: blocked tasks

https://access.redhat.com/solutions/146183

Solution Verified - Updated August 6 2024 at 7:50 AM - English Environment Red Hat Enterprise Linux 5 Kernel-2.6.18-308.8.2.el5PAE Issue System crashed with following Call Traces. Raw INFO: task kjournald:437 blocked for more than 120 seconds. “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message. kjournald D 000009B5 2992 437 19 462 416 (L-TLB) cf7fded4 00000046 a53cb5f8 000009b5 c04e19df 00000046 cf699b60 0000000a f7fbbaa0 a53cc8ef 000009b5 000012f7 00000000 f7fbbbac cf392600 f4894040 0146c25c cf392fa0 cf49fe2c cf49fe3c 0146c25c c042d81f cf49fe1c ffffffff Call Trace: [] elv_next_request+0x2c/0x134 [] getnstimeofday+0x30/0xb6 [] io_schedule+0x36/0x59 [] sync_buffer+0x30/0x33 [] __wait_on_bit+0x33/0x58 [] sync_buffer+0x0/0x33 [] sync_buffer+0x0/0x33 [] out_of_line_wait_on_bit+0x62/0x6a [] wake_bit_function+0x0/0x3c [] __wait_on_buffer+0x1c/0x1f [] journal_commit_transaction+0x4d6/0xf60 [jbd] [] lock_timer_base+0x15/0x2f [] try_to_del_timer_sync+0x65/0x6c [] kjournald+0xa1/0x1c2 [jbd] [] autoremove_wake_function+0x0/0x2d [] kjournald+0x0/0x1c2 [jbd] [] kthread+0xc0/0xee [] kthread+0x0/0xee [] kernel_thread_helper+0x7/0x10 ======================= Kernel panic - not syncing: hung_task: blocked tasks Resolution Upgrade to the latest firmware from HP to ensure all corrected firmware issues related to stuck or lost io are covered. Upgrade to the latest version of kernel available. Root Cause A number of I/O requests seem "stuck" or "lost" within the storage adapter for long periods of time and never complete. Diagnostic Steps Collect sosreport of the system.

Collect vmcore at the time of crash.

Vmcore Analysis

Raw KERNEL: /cores/20120619070936/work/vmlinux DUMPFILE: /cores/20120619070936/work/vmcore-654238 CPUS: 4 DATE: Mon Jun 18 08:47:37 2012 UPTIME: 03:02:01 LOAD AVERAGE: 64.76, 25.04, 9.55 TASKS: 409 NODENAME: server1 RELEASE: 2.6.18-308.8.2.el5PAE VERSION: #1 SMP Tue May 29 12:17:25 EDT 2012 MACHINE: i686 (3200 Mhz) MEMORY: 12.4 GB PANIC: “Kernel panic - not syncing: hung_task: blocked tasks” <=== PID: 197 COMMAND: “khungtaskd” TASK: f7f71550 [THREAD_INFO: f7f6f000] CPU: 1 STATE: TASK_RUNNING (PANIC) Raw crash > log … INFO: task kjournald:437 blocked for more than 120 seconds. “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message. kjournald D 000009B5 2992 437 19 462 416 (L-TLB) cf7fded4 00000046 a53cb5f8 000009b5 c04e19df 00000046 cf699b60 0000000a f7fbbaa0 a53cc8ef 000009b5 000012f7 00000000 f7fbbbac cf392600 f4894040 0146c25c cf392fa0 cf49fe2c cf49fe3c 0146c25c c042d81f cf49fe1c ffffffff Call Trace: [] elv_next_request+0x2c/0x134 [] getnstimeofday+0x30/0xb6 [] io_schedule+0x36/0x59 [] sync_buffer+0x30/0x33 [] __wait_on_bit+0x33/0x58 [] sync_buffer+0x0/0x33 [] sync_buffer+0x0/0x33 [] out_of_line_wait_on_bit+0x62/0x6a [] wake_bit_function+0x0/0x3c [] __wait_on_buffer+0x1c/0x1f [] journal_commit_transaction+0x4d6/0xf60 [jbd] [] lock_timer_base+0x15/0x2f [] try_to_del_timer_sync+0x65/0x6c [] kjournald+0xa1/0x1c2 [jbd] [] autoremove_wake_function+0x0/0x2d [] kjournald+0x0/0x1c2 [jbd] [] kthread+0xc0/0xee [] kthread+0x0/0xee [] kernel_thread_helper+0x7/0x10 ======================= Kernel panic - not syncing: hung_task: blocked tasks <== One "kjournald" process is in the UN-interruptible sleep (D) state. Raw crash> ps|grep kjournald 437 19 0 f7fbbaa0 UN 0.0 0 0 [kjournald] <== 1517 19 1 f7a18aa0 IN 0.0 0 0 [kjournald] 1519 19 1 f7909aa0 IN 0.0 0 0 [kjournald]

crash> bt 437 PID: 437 TASK: f7fbbaa0 CPU: 0 COMMAND: “kjournald” #0 [cf7fde60] schedule at c0622e8a #1 [cf7fded8] io_schedule at c062352b #2 [cf7fdee4] sync_buffer at c0479526 #3 [cf7fdee8] __wait_on_bit at c0623704 #4 [cf7fdf00] out_of_line_wait_on_bit at c0623789 #5 [cf7fdf38] __wait_on_buffer at c04794a3 #6 [cf7fdf44] journal_commit_transaction at f88534b5 [jbd] #7 [cf7fdfa8] kjournald at f8856d27 [jbd] #8 [cf7fdfcc] kthread at c04372e0 #9 [cf7fdfe4] kernel_thread_helper at c0405c85 Following are the tasks which are waiting for a long time. Raw [ 0 00:02:54.358] UN+ PID: 27575 TASK: ddac8000 CPU: 3 COMMAND: “mysqld” [ 0 00:02:56.949] UN+ PID: 5664 TASK: f531caa0 CPU: 3 COMMAND: “mysqld” [ 0 00:03:05.371] UN+ PID: 6755 TASK: f7f61aa0 CPU: 3 COMMAND: “mysqld” [ 0 00:03:10.666] UN+ PID: 28175 TASK: dc354550 CPU: 2 COMMAND: “mysqld” [ 0 00:03:25.446] UN+ PID: 5843 TASK: f4ff8aa0 CPU: 1 COMMAND: “mysqld” [ 0 00:03:27.752] UN+ PID: 4186 TASK: f5ce9550 CPU: 0 COMMAND: “cmaidad” [ 0 00:03:30.949] UN+ PID: 4080 TASK: f5ceb000 CPU: 1 COMMAND: “cmaeventd” [ 0 00:03:33.514] UN+ PID: 437 TASK: f7fbbaa0 CPU: 0 COMMAND: “kjournald” [ 0 00:03:34.206] UN+ PID: 5639 TASK: f6ef7550 CPU: 1 COMMAND: “mysqld” [ 0 00:03:35.007] UN+ PID: 7930 TASK: d7b74000 CPU: 1 COMMAND: “sendmail” [ 0 00:03:35.149] UN+ PID: 2026 TASK: f745c550 CPU: 2 COMMAND: “syslogd” [ 0 00:03:35.154] UN+ PID: 2004 TASK: f7a8d000 CPU: 0 COMMAND: “auditd” Raw crash> bt 2004 PID: 2004 TASK: f7a8d000 CPU: 0 COMMAND: “auditd” #0 [f7b3fe5c] schedule at c0622e8a #1 [f7b3fed4] log_wait_commit at f88562e0 [jbd] #2 [f7b3fefc] journal_stop at f8851674 [jbd] #3 [f7b3ff18] __writeback_single_inode at c0495f83 #4 [f7b3ff54] sync_inode at c049661f #5 [f7b3ff60] ext3_sync_file at f8889014 [ext3] #6 [f7b3ff94] do_fsync at c0479062 #7 [f7b3ffac] __do_fsync at c04790bf #8 [f7b3ffb8] system_call at c0404f44 There are 77 processes waiting on I/O. Raw crash> foreach bt | grep io_schedule | wc -l 77 Raw crash> ctlr_info_t 0xcf4f0000 struct ctlr_info_t { ctlr = 0, devname = “cciss0\000”, product_name = 0xf88ce60c “Smart Array 6i”, firm_ver = “2.32”, pdev = 0xf7f46000, board_id = 1083248145, vaddr = 0xf8806000, paddr = 4261347328, nr_cmds = 12, cfgtable = 0xf881a000, interrupts_enabled = 1, major = 104, max_commands = 1024, commands_outstanding = 14, <— max_outstanding = 14, num_luns = 2, highest_lun = 1, usage_count = 7, scatter_list = 0xcf529000, maxsgentries = 543, chainsize = 513, max_cmd_sgentries = 32, cmd_sg_list = 0xf7fad9c0, intr = {201, 0, 0, 0}, msix_vector = 0, msi_vector = 0, intr_mode = 0, cciss_read = 40 ‘(‘, cciss_write = 42 ‘*’, cciss_max_sectors = 8192,

reqQ = { next = 0xcf4f0898, prev = 0xcf4f0898 }, cmpQ = { next = 0xcf6da248, prev = 0xd7bce248 }, Load 32 bit ext-cciss module in vmcore and verify. Raw crash> extend ext-cciss-56-i386.so ./ext-cciss-56-i386.so: shared object loaded

crash> cciss -v V0.01 cmd starting… switch = ‘v’ DBG: jiffies = 0000000000A211F2 DBG: hba[00] 0x00000000CF4F0000 DBG: cciss0.ctlr = ‘0’ DBG: cciss0.devname = ‘cciss0’ DBG: cciss0.firm_ver = ‘2.32’ DBG: cciss0.product_name = ‘00000000F88CE60C’ (Smart Array 6i) DBG: cciss0.board_id = ‘40910E11’ DBG: cciss0.pci_dev = ‘00000000F7F46000’ DBG: cciss0.major = ‘104’ DBG: cciss0.max_commands = ‘1024’ DBG: cciss0.nr_cmds = ‘12’ DBG: cciss0.commands_outstanding = ‘14’ DBG: cciss0.max_outstanding = ‘14’ DBG: cciss0.num_luns = ‘2’ DBG: cciss0.highest_lun = ‘1’ DBG: cciss0.cciss_read = ‘0’ DBG: cciss0.cciss_write = ‘0’ DBG: cciss0.drv{list} = ‘00000000BFD97270’ DBG: cciss0.reqQ = ‘00000000F88C9316’ DBG: cciss0.cmpQ = ‘00000000CF4F0898’ [00000000CF4F0650] DBG: cciss0.maxQsinceinit = ‘00000000CF6DA248’ DBG: cciss0.maxSG = ‘-675487160’ DBG: cciss0.cmd_pool = ‘000000000000000C’ DBG: cciss0.errinfo_pool = ‘00000000CF6DA000’ DBG: cciss0.cmd_pool_bits = ‘00000000CF681000’ DBG: cciss0.nr_allocs = ‘258478080’ DBG: cciss0.nr_frees = ‘0’ (258478080) ccissQ request CDB started —————- —————- ————————————- ———— [ 0] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 <=== Waiting for almost 3 hours [ 1] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 2] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 3] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 4] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 5] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 6] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 7] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 8] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 9] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 10] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 11] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 12] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424 [ 13] 00000000CF4F0650 00000000CF6DA248 00h ??? 02:57:01.424

  UPTIME: 03:02:01 The above report indicates that the requests are being waiting for some time. Raw crash> ctlr_info_t 0xcf4f0000|grep -A 2 cmpQ   cmpQ = {
next = 0xcf6da248, <===
prev = 0xd7bce248  Raw crash> request 0xcf6da248|grep start   start_time = 2,  Raw crash> pd jiffies0xcf6da248 jiffies = $14 = 10621426 crash> pd 10621426-2 $15 = 10621424 Product(s) Red Hat Enterprise LinuxComponent kernelCategory TroubleshootTags panic rhel_5 This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Updated: