Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
3
-
9223372036854775807
Description
Error happened during soak testing of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). DNE is enabled. OST have been formatted using zfs, MDTs using _ldiskfs. OSS and MDT nodes are configured in HA active-active failover configuration.
During system boot, a MDS node that had been restarted, crashed with the following error during LNet initialization:
Lustre: Lustre: Build Version: 2.8.51_28_gba2ac35 LNetError: 3247:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 192.168.1.108@o2ib10 on NA (ib0:0:192.168.1.110): bad dst ni d 192.168.1.110@o2ib10 BUG: unable to handle kernel NULL pointer dereference LNet: Added LNI 192.168.1.110@o2ib10 [8/256/0/180] at 0000000000000080 IP: [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] PGD 839067067 PUD 8383e1067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/module/lnet/initstate CPU 0 Modules linked in: ko2iblnd(U) ptlrpc(+)(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad r dma_cm ib_cm iw_cm dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate za vl(P)(U) zunicode(P)(U) sb_edac edac_core joydev lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t1 0dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod scsi_dh_rdac [last unloaded: scsi_wait_scan] Pid: 3247, comm: ib_cm/0 Tainted: P -- ------------ 2.6.32-573.22.1.el6_lustre.x86_64 #1 Intel Corporation SandyBridge Platform/To be filled by O.E.M. RIP: 0010:[<ffffffffa0b861e6>] [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] RSP: 0018:ffff8804318e7b20 EFLAGS: 00010246 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012 RBP: ffff8804318e7be0 R08: 000000000001b9c2 R09: 00000000fffffffb R10: 0000000000000003 R11: 0000000000000000 R12: ffff880835a6dc20 R13: ffffffffa0b92263 R14: ffff880432df7800 R15: ffffffffa06a1020 FS: 0000000000000000(0000) GS:ffff880038600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000080 CR3: 0000000835bd0000 CR4: 00000000000407f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ib_cm/0 (pid: 3247, threadinfo ffff8804318e4000, task ffff880431015520) Stack: ffff880835a6dc20 ffffffffa06a1020 0000000000000004 ffff8808369be5d0 <d> ffff8804318e7b80 00000000814e7731 0005000ac0a8016c ffff880800000012 <d> 000300120be91b91 0000000000000000 0000100000000008 ffffffffa011bcbc Call Trace: [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core] [<ffffffffa0b87c3d>] kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd] [<ffffffffa034a011>] cma_req_handler+0x371/0x640 [rdma_cm] [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core] [<ffffffffa0322b27>] cm_process_work+0x27/0x110 [ib_cm] [<ffffffffa0323735>] cm_req_handler+0x6b5/0xac0 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffffa0324275>] cm_work_handler+0x135/0x1206 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffff8109ab40>] worker_thread+0x170/0x2a0 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0 [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Code: e8 90 ff a6 ff 0f b7 95 78 ff ff ff 8b bd 78 ff ff ff 48 89 de 66 89 55 84 e8 27 46 00 00 83 bd 78 ff ff ff 11 66 89 45 90 74 10 <48> 8b 83 80 00 00 00 8b 50 1c 85 d2 89 d0 75 05 b8 00 01 00 00 RIP [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] RSP <ffff8804318e7b20> CR2: 0000000000000080 ---[ end trace 01db8c57e9900e3f ]--- Kernel panic - not syncing: Fatal exception Pid: 3247, comm: ib_cm/0 Tainted: P D -- ------------ 2.6.32-573.22.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff815394d1>] ? panic+0xa7/0x16f [<ffffffff8153e2d4>] ? oops_end+0xe4/0x100 [<ffffffff8104e8cb>] ? no_context+0xfb/0x260 [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0 [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20 [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500 [<ffffffff81336a9f>] ? extract_buf+0x9f/0x130 [<ffffffff815401fe>] ? do_page_fault+0x3e/0xa0 [<ffffffff8153d5a5>] ? page_fault+0x25/0x30 [<ffffffffa0b861e6>] ? kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core] [<ffffffffa0b87c3d>] ? kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd] [<ffffffffa034a011>] ? cma_req_handler+0x371/0x640 [rdma_cm] [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core] [<ffffffffa0322b27>] ? cm_process_work+0x27/0x110 [ib_cm] [<ffffffffa0323735>] ? cm_req_handler+0x6b5/0xac0 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffffa0324275>] ? cm_work_handler+0x135/0x1206 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffff8109ab40>] ? worker_thread+0x170/0x2a0 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0 [<ffffffff810a138e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Unfortunately no crash dump was written. The only error message available was extracted from console log of the node affected (lola-10).
Therefore only the console log of MDS has been attached.
Attachments
Issue Links
- is related to
-
LU-7101 Lnet: Support per NI map-on-demand
- Resolved