Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
MDS installed with Bull 2.5.3 version
-
3
-
9223372036854775807
Description
The MDS crashed in kiblnd_pool_alloc_node, with the message "unable to handle kernel NULL pointer dereference at 0000000000000010".
This looks exactly the same as LU-5678, but since patch http://review.whamcloud.com/12852 was already applied, I am opening this new ticket.
crash> sys
SYSTEM MAP: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/boot/System.map-2.6.32-504.8.1.el6.Bull.70.x86_64
DEBUG KERNEL: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/modules/vmlinux (2.6.32-504.8.1.el6.Bull.70.x86_64)
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 48 [OFFLINE: 24]
DATE: Wed Apr 1 16:58:18 2015
UPTIME: 00:54:42
LOAD AVERAGE: 0.64, 5.04, 7.86
TASKS: 682
NODENAME: taurusmds6
RELEASE: 2.6.32-504.8.1.el6.Bull.70.x86_64
VERSION: #1 SMP Tue Feb 10 14:51:21 CET 2015
MACHINE: x86_64 (2399 Mhz)
MEMORY: 128 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000010"
crash> bt
PID: 9622 TASK: ffff881066c50080 CPU: 1 COMMAND: "kiblnd_sd_00_02"
#0 [ffff880f23ee3630] machine_kexec at ffffffff8103b71b
#1 [ffff880f23ee3690] crash_kexec at ffffffff810c9852
#2 [ffff880f23ee3760] oops_end at ffffffff8152ec30
#3 [ffff880f23ee3790] no_context at ffffffff8104c80b
#4 [ffff880f23ee37e0] __bad_area_nosemaphore at ffffffff8104ca95
#5 [ffff880f23ee3830] bad_area_nosemaphore at ffffffff8104cb63
#6 [ffff880f23ee3840] __do_page_fault at ffffffff8104d2bf
#7 [ffff880f23ee3960] do_page_fault at ffffffff81530b7e
#8 [ffff880f23ee3990] page_fault at ffffffff8152df35
[exception RIP: kiblnd_pool_alloc_node+73]
RIP: ffffffffa0b77439 RSP: ffff880f23ee3a40 RFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff880fec59ce40 RCX: 000000000000003f
RDX: 0000000000000010 RSI: 0000000000000002 RDI: ffff880fec59ce40
RBP: ffff880f23ee3a80 R8: 72f8000000000000 R9: 97c0000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880fec59ce70
R13: ffff880f23ee3a48 R14: ffff880fec59ce50 R15: 0000000000000012
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880f23ee3a88] kiblnd_get_idle_tx at ffffffffa0b81fa9 [ko2iblnd]
#10 [ffff880f23ee3aa8] kiblnd_check_sends at ffffffffa0b857b5 [ko2iblnd]
#11 [ffff880f23ee3b08] kiblnd_post_rx at ffffffffa0b87dd8 [ko2iblnd]
#12 [ffff880f23ee3b58] kiblnd_recv at ffffffffa0b882c6 [ko2iblnd]
#13 [ffff880f23ee3be8] lnet_ni_recv at ffffffffa05f9ecb [lnet]
#14 [ffff880f23ee3c38] lnet_drop_message at ffffffffa05facf1 [lnet]
#15 [ffff880f23ee3c78] lnet_parse at ffffffffa05ff672 [lnet]
#16 [ffff880f23ee3d58] kiblnd_handle_rx at ffffffffa0b889db [ko2iblnd]
#17 [ffff880f23ee3da8] kiblnd_rx_complete at ffffffffa0b896c3 [ko2iblnd]
#18 [ffff880f23ee3df8] kiblnd_complete at ffffffffa0b89872 [ko2iblnd]
#19 [ffff880f23ee3e08] kiblnd_scheduler at ffffffffa0b89c2a [ko2iblnd]
#20 [ffff880f23ee3ee8] kthread at ffffffff8109e66e
#21 [ffff880f23ee3f48] kernel_thread at ffffffff8100c20a
crash> struct kib_poolset_t ffff880fec59ce40
struct kib_poolset_t {
ps_lock = {
raw_lock = {
slock = 131072
}
},
ps_net = 0x0,
ps_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
00\000\000\000\000",
ps_pool_list = {
next = 0x0,
prev = 0x0
},
ps_failed_pool_list = {
next = 0x0,
prev = 0x0
},
ps_next_retry = 0,
ps_increasing = 0,
ps_pool_size = 0,
ps_cpt = 0,
ps_pool_create = 0x0,
ps_pool_destroy = 0x0,
ps_node_init = 0x0,
ps_node_fini = 0x0
}
I will upload the dump shortly for analysis.
Attachments
Issue Links
- is related to
-
LU-5678 kernel crash due to NULL pointer dereference in kiblnd_pool_alloc_node()
-
- Resolved
-