Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.5.3
-
None
-
kernel 2.6.32-431.23.3 + bull fix
lustre 2.5.3 + bull fix
-
3
-
16654
Description
During a restart (umount/mount) of OSTs, we meet an OSS crash
due to a null pointer :
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa06b37c0>] lnet_ptl_match_md+0x250/0x870 [lnet]
PGD 0 Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.1/host7/rport-7:0-0/target7:0:0/7:0:0:3/state
CPU 12
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lqu
ota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic crc32c_intel nfs lockd
fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr
(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) dm_round_robin scsi_dh_rda
c dm_multipath uinput sg lpc_ich mfd_core ioatdma compat(U) igb dca i2c_algo_bit i2c_core ptp pps_core lpfc scsi_transport_fc scsi_tgt
ext4 jbd2 mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: libcfs]
Pid: 25204, comm: kiblnd_sd_00_01 Tainted: G W --------------- 2.6.32-431.23.3.el6.Bull.56.x86_64 #1 BULL bullx super-node
RIP: 0010:[<ffffffffa06b37c0>] [<ffffffffa06b37c0>] lnet_ptl_match_md+0x250/0x870 [lnet]
RSP: 0018:ffff880c70589bf0 EFLAGS: 00010287
RAX: ffffffffd4888cbe RBX: ffff880c70589cf0 RCX: 00000000d4888cbd
RDX: fffffffffffffffe RSI: ffff880c5787b7d0 RDI: 0000000000000003
RBP: ffff880c70589c70 R08: 8980000000000000 R09: 4c00000000000000
R10: 000000000000002c R11: 0000000000000012 R12: ffff880434b24000
R13: ffff880c40941f40 R14: ffff880c40941f40 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001a85000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kiblnd_sd_00_01 (pid: 25204, threadinfo ffff880c70588000, task ffff880c77ac9500)
Stack:
ffff8803ab5eb278 ffff880c70589cb8 ffff8803ab5eb140 ffff8804d4888cbd
<d> ffff880c70589c70 ffffffffa06c5c36 ffff880c70589c70 0000000000000246
<d> ffff880c70589c70 ffff8803c94d1580 0000000000000000 ffff880434b24000
Call Trace:
[<ffffffffa06bb05b>] lnet_parse+0xb9b/0x18c0 [lnet]
[<ffffffffa08947fb>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
[<ffffffffa08954e3>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
[<ffffffffa0895692>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
[<ffffffffa0895a4a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
[<ffffffff81099f56>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
Code: 00 00 00 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 90 49 8b 45 30 4c 8b 38 4d 85 ff 0f 84 39 fe ff ff
<41> 8b 37 48 8b 3d c6 62 02 00 e8 01 6d f9 ff 8b 0d 77 64 02 00
crash> sys
KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.23.3.el6.Bull.56.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Mon Nov 3 17:30:13 2014
UPTIME: 28 days, 01:29:06
LOAD AVERAGE: 75.19, 18.37, 8.71
TASKS: 2258
NODENAME: bigfoot27
RELEASE: 2.6.32-431.23.3.el6.Bull.56.x86_64
VERSION: #1 SMP Thu Jul 31 16:27:31 CEST 2014
MACHINE: x86_64 (2266 Mhz)
MEMORY: 64 GB
PANIC: "Oops: 0000 [#1] SMP " (check log for details)
crash>
crash> bt
PID: 25204 TASK: ffff880c77ac9500 CPU: 12 COMMAND: "kiblnd_sd_00_01"
#0 [ffff880c705897e0] machine_kexec at ffffffff8103914b
#1 [ffff880c70589840] crash_kexec at ffffffff810c6042
#2 [ffff880c70589910] oops_end at ffffffff8152d9d0
#3 [ffff880c70589940] no_context at ffffffff8104a19b
#4 [ffff880c70589990] __bad_area_nosemaphore at ffffffff8104a425
#5 [ffff880c705899e0] bad_area_nosemaphore at ffffffff8104a4f3
#6 [ffff880c705899f0] __do_page_fault at ffffffff8104ac4f
#7 [ffff880c70589b10] do_page_fault at ffffffff8152f91e
#8 [ffff880c70589b40] page_fault at ffffffff8152ccd5
[exception RIP: lnet_ptl_match_md+592]
RIP: ffffffffa06b37c0 RSP: ffff880c70589bf0 RFLAGS: 00010287
RAX: ffffffffd4888cbe RBX: ffff880c70589cf0 RCX: 00000000d4888cbd
RDX: fffffffffffffffe RSI: ffff880c5787b7d0 RDI: 0000000000000003
RBP: ffff880c70589c70 R8: 8980000000000000 R9: 4c00000000000000
R10: 000000000000002c R11: 0000000000000012 R12: ffff880434b24000
R13: ffff880c40941f40 R14: ffff880c40941f40 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880c70589c78] lnet_parse at ffffffffa06bb05b [lnet]
#10 [ffff880c70589d58] kiblnd_handle_rx at ffffffffa08947fb [ko2iblnd]
#11 [ffff880c70589da8] kiblnd_rx_complete at ffffffffa08954e3 [ko2iblnd]
#12 [ffff880c70589df8] kiblnd_complete at ffffffffa0895692 [ko2iblnd]
#13 [ffff880c70589e08] kiblnd_scheduler at ffffffffa0895a4a [ko2iblnd]
#14 [ffff880c70589ee8] kthread at ffffffff81099f56
#15 [ffff880c70589f48] kernel_thread at ffffffff8100c20a
We can find the ptl variable use by the function lnet_ptl_match_md()
and the crash occur because ptl_rotor is negatif
crash> struct lnet_portal 0xffff880c40941f40
struct lnet_portal {
ptl_lock = {
raw_lock = {
slock = 409081954
}
},
ptl_index = 28,
ptl_options = 5,
ptl_msg_stealing = {
next = 0xffff880c40941f50,
prev = 0xffff880c40941f50
},
ptl_msg_delayed = {
next = 0xffff880c40941f60,
prev = 0xffff880c40941f60
},
ptl_mtables = 0xffff880c5787b7d0,
ptl_rotor = -729246580,
ptl_mt_nmaps = 4,
ptl_mt_maps = 0xffff880c40941f80
}
Proposal fix :
Nov-04 15:11:34 [root@lascaux0 lustre-2.5.3] # diff -up lnet/lnet/lib-ptl.c lnet/lnet/lib-ptl.c.apr
--- lnet/lnet/lib-ptl.c 2014-09-11 18:04:07.000000000 +0200
+++ lnet/lnet/lib-ptl.c.apr 2014-11-04 15:11:34.935503533 +0100
@@ -773,6 +773,7 @@ lnet_ptl_setup(struct lnet_portal *ptl,
}
ptl->ptl_index = index;
+ ptl->ptl_rotor = 0;
CFS_INIT_LIST_HEAD(&ptl->ptl_msg_delayed);
CFS_INIT_LIST_HEAD(&ptl->ptl_msg_stealing);
#ifdef __KERNEL__
I put my full analyze trace in attachment
Attachments
Issue Links
- duplicates
-
LU-5639 Message is hashed to invalid match-table of LNet request portal
-
- Resolved
-