Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
Lustre 2.8.0
-
None
-
Cray routers running Lustre 2.7.61 in an SLES11 SP3 environment.
-
2
-
9223372036854775807
Description
In our testing on our medium size Cray system we encountered the following crash while attempting to bring up LNet on the routers.
[2015-10-26 15:51:14][c0-0c0s6n0]Lustre: kgnilnd build version: 2.7.61-DNE2-1.0502.0.2.7-jsimmons-Unknown-2015-10-21-11:16
[2015-10-26 15:51:14][c0-0c0s6n0]LNet: Added LNI 12@gni2 [16/8192/0/0]
[2015-10-26 15:51:15][c0-0c0s6n0]LNetError: 149:0:(o2iblnd_cb.c:2239:kiblnd_passive_connect()) Can't accept conn from 10.36.226.4@o2ib on NA (ib0:0:10.36.223.1): bad dst nid 10.36.223.1@o2ib
[2015-10-26 15:51:15][c0-0c0s6n0]BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[2015-10-26 15:51:15][c0-0c0s6n0]IP: [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0]PGD 3dc9ae067 PUD 3ddc0c067 PMD 0
[2015-10-26 15:51:15][c0-0c0s6n0]Oops: 0000 1 SMP
[2015-10-26 15:51:15][c0-0c0s6n0]CPU 5
[2015-10-26 15:51:15][c0-0c0s6n0]Modules linked in: ko2iblnd kgnilnd lnet crc32c libcfs binfmt_misc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlx4_en mlx4_ib ib_sa ib_mad ib_core mlx4_core compat nic_compat dm_mod kdreg gpcd_gem ipogif_gem kgni_gem hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_gem cgm craytrace
[2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
[2015-10-26 15:51:15][c0-0c0s6n0]RIP: 0010:[<ffffffffa03e3e5b>] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0]RSP: 0018:ffff8803f1b0fb10 EFLAGS: 00010246
[2015-10-26 15:51:15][c0-0c0s6n0]RAX: 000000000000003f RBX: ffffffffa03ee513 RCX: ffffffff81368c50
[2015-10-26 15:51:15][c0-0c0s6n0]RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff8803e8ac7680
[2015-10-26 15:51:15][c0-0c0s6n0]RBP: ffff8803f1b0fbd0 R08: 0000000000000005 R09: 0000000000000005
[2015-10-26 15:51:15][c0-0c0s6n0]R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000012
[2015-10-26 15:51:15][c0-0c0s6n0]R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803c0e58620
[2015-10-26 15:51:15][c0-0c0s6n0]FS: 00007f9e343b7700(0000) GS:ffff880407d40000(0000) knlGS:0000000000000000
[2015-10-26 15:51:15][c0-0c0s6n0]CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080 CR3: 00000003dc9bd000 CR4: 00000000000007e0
[2015-10-26 15:51:15][c0-0c0s6n0]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2015-10-26 15:51:15][c0-0c0s6n0]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[2015-10-26 15:51:15][c0-0c0s6n0]LNet: Added LNI 10.36.223.1@o2ib [63/2560/0/180]
[2015-10-26 15:51:15][c0-0c0s6n0]Process kworker/5:1 (pid: 149, threadinfo ffff8803f1b0c000, task ffff8803f1b09040)
[2015-10-26 15:51:15][c0-0c0s6n0]Stack:
[2015-10-26 15:51:15][c0-0c0s6n0] ffff8803c0e58620 ffffffffa02f6080 0000000000000000 ffff8803ea8bcc00
[2015-10-26 15:51:15][c0-0c0s6n0] ffffffffa02f6080 000500000a24e204 0000000000000001 ffff8803e4740000
[2015-10-26 15:51:15][c0-0c0s6n0] 000300120be91b91 0000000000000000 000010000000003f ffffffffa017b636
[2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
[2015-10-26 15:51:15][c0-0c0s6n0]Code: 0f 84 da 01 00 00 66 3d 00 11 0f 84 d0 01 00 00 66 c7 45 84 12 00 45 31 f6 48 8b 05 38 03 01 00 ba 00 01 00 00 8b 00 66 89 45 90
[2015-10-26 15:51:15][c0-0c0s6n0] 8b 86 80 00 00 00 85 c0 0f 45 d0 48 8b bd 58 ff ff ff 48 8d
[2015-10-26 15:51:15][c0-0c0s6n0]RIP [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] RSP <ffff8803f1b0fb10>
[2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080
[2015-10-26 15:51:15][c0-0c0s6n0]--[ end trace 311d9fd8dd61b1cf ]--
[2015-10-26 15:51:15][c0-0c0s6n0]Kernel panic - not syncing: Fatal exception
[2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P D 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
[2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81004eb9>] dump_trace+0x89/0x430
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060f5>] show_trace+0x15/0x20
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b31c>] dump_stack+0x79/0x84
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b3bb>] panic+0x94/0x1da
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81005ed8>] oops_end+0xa8/0xe0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027589>] no_context+0xf9/0x260
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027855>] __bad_area_nosemaphore+0x165/0x1f0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810278f3>] bad_area_nosemaphore+0x13/0x20
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027e4e>] do_page_fault+0x2fe/0x440
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148e8cf>] page_fault+0x1f/0x30
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
Attachments
Issue Links
- is related to
-
LU-3322 ko2iblnd support for different map_on_demand and peer_credits between systems
- Resolved