Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
MDS installed with Bull 2.5.3 version
-
3
-
9223372036854775807
Description
The MDS crashed in kiblnd_pool_alloc_node, with the message "unable to handle kernel NULL pointer dereference at 0000000000000010".
This looks exactly the same as LU-5678, but since patch http://review.whamcloud.com/12852 was already applied, I am opening this new ticket.
crash> sys SYSTEM MAP: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/boot/System.map-2.6.32-504.8.1.el6.Bull.70.x86_64 DEBUG KERNEL: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/modules/vmlinux (2.6.32-504.8.1.el6.Bull.70.x86_64) DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 48 [OFFLINE: 24] DATE: Wed Apr 1 16:58:18 2015 UPTIME: 00:54:42 LOAD AVERAGE: 0.64, 5.04, 7.86 TASKS: 682 NODENAME: taurusmds6 RELEASE: 2.6.32-504.8.1.el6.Bull.70.x86_64 VERSION: #1 SMP Tue Feb 10 14:51:21 CET 2015 MACHINE: x86_64 (2399 Mhz) MEMORY: 128 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000010" crash> bt PID: 9622 TASK: ffff881066c50080 CPU: 1 COMMAND: "kiblnd_sd_00_02" #0 [ffff880f23ee3630] machine_kexec at ffffffff8103b71b #1 [ffff880f23ee3690] crash_kexec at ffffffff810c9852 #2 [ffff880f23ee3760] oops_end at ffffffff8152ec30 #3 [ffff880f23ee3790] no_context at ffffffff8104c80b #4 [ffff880f23ee37e0] __bad_area_nosemaphore at ffffffff8104ca95 #5 [ffff880f23ee3830] bad_area_nosemaphore at ffffffff8104cb63 #6 [ffff880f23ee3840] __do_page_fault at ffffffff8104d2bf #7 [ffff880f23ee3960] do_page_fault at ffffffff81530b7e #8 [ffff880f23ee3990] page_fault at ffffffff8152df35 [exception RIP: kiblnd_pool_alloc_node+73] RIP: ffffffffa0b77439 RSP: ffff880f23ee3a40 RFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff880fec59ce40 RCX: 000000000000003f RDX: 0000000000000010 RSI: 0000000000000002 RDI: ffff880fec59ce40 RBP: ffff880f23ee3a80 R8: 72f8000000000000 R9: 97c0000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880fec59ce70 R13: ffff880f23ee3a48 R14: ffff880fec59ce50 R15: 0000000000000012 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880f23ee3a88] kiblnd_get_idle_tx at ffffffffa0b81fa9 [ko2iblnd] #10 [ffff880f23ee3aa8] kiblnd_check_sends at ffffffffa0b857b5 [ko2iblnd] #11 [ffff880f23ee3b08] kiblnd_post_rx at ffffffffa0b87dd8 [ko2iblnd] #12 [ffff880f23ee3b58] kiblnd_recv at ffffffffa0b882c6 [ko2iblnd] #13 [ffff880f23ee3be8] lnet_ni_recv at ffffffffa05f9ecb [lnet] #14 [ffff880f23ee3c38] lnet_drop_message at ffffffffa05facf1 [lnet] #15 [ffff880f23ee3c78] lnet_parse at ffffffffa05ff672 [lnet] #16 [ffff880f23ee3d58] kiblnd_handle_rx at ffffffffa0b889db [ko2iblnd] #17 [ffff880f23ee3da8] kiblnd_rx_complete at ffffffffa0b896c3 [ko2iblnd] #18 [ffff880f23ee3df8] kiblnd_complete at ffffffffa0b89872 [ko2iblnd] #19 [ffff880f23ee3e08] kiblnd_scheduler at ffffffffa0b89c2a [ko2iblnd] #20 [ffff880f23ee3ee8] kthread at ffffffff8109e66e #21 [ffff880f23ee3f48] kernel_thread at ffffffff8100c20a crash> struct kib_poolset_t ffff880fec59ce40 struct kib_poolset_t { ps_lock = { raw_lock = { slock = 131072 } }, ps_net = 0x0, ps_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0 00\000\000\000\000", ps_pool_list = { next = 0x0, prev = 0x0 }, ps_failed_pool_list = { next = 0x0, prev = 0x0 }, ps_next_retry = 0, ps_increasing = 0, ps_pool_size = 0, ps_cpt = 0, ps_pool_create = 0x0, ps_pool_destroy = 0x0, ps_node_init = 0x0, ps_node_fini = 0x0 }
I will upload the dump shortly for analysis.
Attachments
Issue Links
- is related to
-
LU-5678 kernel crash due to NULL pointer dereference in kiblnd_pool_alloc_node()
- Resolved