[LU-7099] Crash in kiblnd_pool_alloc_node Created: 03/Sep/15 Updated: 14/Jun/18 Resolved: 12/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sebastien Piechurski | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4b | ||
| Environment: |
MDS installed with Bull 2.5.3 version |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The MDS crashed in kiblnd_pool_alloc_node, with the message "unable to handle kernel NULL pointer dereference at 0000000000000010". This looks exactly the same as crash> sys
SYSTEM MAP: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/boot/System.map-2.6.32-504.8.1.el6.Bull.70.x86_64
DEBUG KERNEL: /dumps/lib/kernel-debuginfo/2.6.32-504.8.1.el6.Bull.70.x86_64/modules/vmlinux (2.6.32-504.8.1.el6.Bull.70.x86_64)
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 48 [OFFLINE: 24]
DATE: Wed Apr 1 16:58:18 2015
UPTIME: 00:54:42
LOAD AVERAGE: 0.64, 5.04, 7.86
TASKS: 682
NODENAME: taurusmds6
RELEASE: 2.6.32-504.8.1.el6.Bull.70.x86_64
VERSION: #1 SMP Tue Feb 10 14:51:21 CET 2015
MACHINE: x86_64 (2399 Mhz)
MEMORY: 128 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000010"
crash> bt
PID: 9622 TASK: ffff881066c50080 CPU: 1 COMMAND: "kiblnd_sd_00_02"
#0 [ffff880f23ee3630] machine_kexec at ffffffff8103b71b
#1 [ffff880f23ee3690] crash_kexec at ffffffff810c9852
#2 [ffff880f23ee3760] oops_end at ffffffff8152ec30
#3 [ffff880f23ee3790] no_context at ffffffff8104c80b
#4 [ffff880f23ee37e0] __bad_area_nosemaphore at ffffffff8104ca95
#5 [ffff880f23ee3830] bad_area_nosemaphore at ffffffff8104cb63
#6 [ffff880f23ee3840] __do_page_fault at ffffffff8104d2bf
#7 [ffff880f23ee3960] do_page_fault at ffffffff81530b7e
#8 [ffff880f23ee3990] page_fault at ffffffff8152df35
[exception RIP: kiblnd_pool_alloc_node+73]
RIP: ffffffffa0b77439 RSP: ffff880f23ee3a40 RFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff880fec59ce40 RCX: 000000000000003f
RDX: 0000000000000010 RSI: 0000000000000002 RDI: ffff880fec59ce40
RBP: ffff880f23ee3a80 R8: 72f8000000000000 R9: 97c0000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880fec59ce70
R13: ffff880f23ee3a48 R14: ffff880fec59ce50 R15: 0000000000000012
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880f23ee3a88] kiblnd_get_idle_tx at ffffffffa0b81fa9 [ko2iblnd]
#10 [ffff880f23ee3aa8] kiblnd_check_sends at ffffffffa0b857b5 [ko2iblnd]
#11 [ffff880f23ee3b08] kiblnd_post_rx at ffffffffa0b87dd8 [ko2iblnd]
#12 [ffff880f23ee3b58] kiblnd_recv at ffffffffa0b882c6 [ko2iblnd]
#13 [ffff880f23ee3be8] lnet_ni_recv at ffffffffa05f9ecb [lnet]
#14 [ffff880f23ee3c38] lnet_drop_message at ffffffffa05facf1 [lnet]
#15 [ffff880f23ee3c78] lnet_parse at ffffffffa05ff672 [lnet]
#16 [ffff880f23ee3d58] kiblnd_handle_rx at ffffffffa0b889db [ko2iblnd]
#17 [ffff880f23ee3da8] kiblnd_rx_complete at ffffffffa0b896c3 [ko2iblnd]
#18 [ffff880f23ee3df8] kiblnd_complete at ffffffffa0b89872 [ko2iblnd]
#19 [ffff880f23ee3e08] kiblnd_scheduler at ffffffffa0b89c2a [ko2iblnd]
#20 [ffff880f23ee3ee8] kthread at ffffffff8109e66e
#21 [ffff880f23ee3f48] kernel_thread at ffffffff8100c20a
crash> struct kib_poolset_t ffff880fec59ce40
struct kib_poolset_t {
ps_lock = {
raw_lock = {
slock = 131072
}
},
ps_net = 0x0,
ps_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
00\000\000\000\000",
ps_pool_list = {
next = 0x0,
prev = 0x0
},
ps_failed_pool_list = {
next = 0x0,
prev = 0x0
},
ps_next_retry = 0,
ps_increasing = 0,
ps_pool_size = 0,
ps_cpt = 0,
ps_pool_create = 0x0,
ps_pool_destroy = 0x0,
ps_node_init = 0x0,
ps_node_fini = 0x0
}
I will upload the dump shortly for analysis. |
| Comments |
| Comment by Sebastien Piechurski [ 03/Sep/15 ] |
|
The dump with all required objects is currently uploading on the ftp site under uploads/ |
| Comment by Joseph Gmitter (Inactive) [ 03/Sep/15 ] |
|
Hi Amir, |
| Comment by Chris Horn [ 24/Nov/15 ] |
|
Can you share any details on the problem here and the proposed solution? Is there a corresponding fix for master? |
| Comment by Gerrit Updater [ 18/May/16 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/20322 |
| Comment by Gerrit Updater [ 11/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20322/ |
| Comment by Peter Jones [ 12/Jul/16 ] |
|
Landed for 2.9 |