[LU-7351] LNet router crash during bring up of infiniband interface. Created: 28/Oct/15  Updated: 20/Nov/15  Resolved: 20/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Doug Oucharek (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Cray routers running Lustre 2.7.61 in an SLES11 SP3 environment.


Issue Links:
Related
is related to LU-3322 ko2iblnd support for different map_on... Resolved
Epic/Theme: lnet
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

In our testing on our medium size Cray system we encountered the following crash while attempting to bring up LNet on the routers.

[2015-10-26 15:51:14][c0-0c0s6n0]Lustre: kgnilnd build version: 2.7.61-DNE2-1.0502.0.2.7-jsimmons-Unknown-2015-10-21-11:16
[2015-10-26 15:51:14][c0-0c0s6n0]LNet: Added LNI 12@gni2 [16/8192/0/0]
[2015-10-26 15:51:15][c0-0c0s6n0]LNetError: 149:0:(o2iblnd_cb.c:2239:kiblnd_passive_connect()) Can't accept conn from 10.36.226.4@o2ib on NA (ib0:0:10.36.223.1): bad dst nid 10.36.223.1@o2ib
[2015-10-26 15:51:15][c0-0c0s6n0]BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[2015-10-26 15:51:15][c0-0c0s6n0]IP: [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0]PGD 3dc9ae067 PUD 3ddc0c067 PMD 0
[2015-10-26 15:51:15][c0-0c0s6n0]Oops: 0000 1 SMP
[2015-10-26 15:51:15][c0-0c0s6n0]CPU 5
[2015-10-26 15:51:15][c0-0c0s6n0]Modules linked in: ko2iblnd kgnilnd lnet crc32c libcfs binfmt_misc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlx4_en mlx4_ib ib_sa ib_mad ib_core mlx4_core compat nic_compat dm_mod kdreg gpcd_gem ipogif_gem kgni_gem hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_gem cgm craytrace
[2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
[2015-10-26 15:51:15][c0-0c0s6n0]RIP: 0010:[<ffffffffa03e3e5b>] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0]RSP: 0018:ffff8803f1b0fb10 EFLAGS: 00010246
[2015-10-26 15:51:15][c0-0c0s6n0]RAX: 000000000000003f RBX: ffffffffa03ee513 RCX: ffffffff81368c50
[2015-10-26 15:51:15][c0-0c0s6n0]RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff8803e8ac7680
[2015-10-26 15:51:15][c0-0c0s6n0]RBP: ffff8803f1b0fbd0 R08: 0000000000000005 R09: 0000000000000005
[2015-10-26 15:51:15][c0-0c0s6n0]R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000012
[2015-10-26 15:51:15][c0-0c0s6n0]R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803c0e58620
[2015-10-26 15:51:15][c0-0c0s6n0]FS: 00007f9e343b7700(0000) GS:ffff880407d40000(0000) knlGS:0000000000000000
[2015-10-26 15:51:15][c0-0c0s6n0]CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080 CR3: 00000003dc9bd000 CR4: 00000000000007e0
[2015-10-26 15:51:15][c0-0c0s6n0]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2015-10-26 15:51:15][c0-0c0s6n0]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[2015-10-26 15:51:15][c0-0c0s6n0]LNet: Added LNI 10.36.223.1@o2ib [63/2560/0/180]
[2015-10-26 15:51:15][c0-0c0s6n0]Process kworker/5:1 (pid: 149, threadinfo ffff8803f1b0c000, task ffff8803f1b09040)
[2015-10-26 15:51:15][c0-0c0s6n0]Stack:
[2015-10-26 15:51:15][c0-0c0s6n0] ffff8803c0e58620 ffffffffa02f6080 0000000000000000 ffff8803ea8bcc00
[2015-10-26 15:51:15][c0-0c0s6n0] ffffffffa02f6080 000500000a24e204 0000000000000001 ffff8803e4740000
[2015-10-26 15:51:15][c0-0c0s6n0] 000300120be91b91 0000000000000000 000010000000003f ffffffffa017b636
[2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
[2015-10-26 15:51:15][c0-0c0s6n0]Code: 0f 84 da 01 00 00 66 3d 00 11 0f 84 d0 01 00 00 66 c7 45 84 12 00 45 31 f6 48 8b 05 38 03 01 00 ba 00 01 00 00 8b 00 66 89 45 90
[2015-10-26 15:51:15][c0-0c0s6n0] 8b 86 80 00 00 00 85 c0 0f 45 d0 48 8b bd 58 ff ff ff 48 8d
[2015-10-26 15:51:15][c0-0c0s6n0]RIP [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] RSP <ffff8803f1b0fb10>
[2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080
[2015-10-26 15:51:15][c0-0c0s6n0]--[ end trace 311d9fd8dd61b1cf ]--
[2015-10-26 15:51:15][c0-0c0s6n0]Kernel panic - not syncing: Fatal exception
[2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P D 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
[2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81004eb9>] dump_trace+0x89/0x430
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060f5>] show_trace+0x15/0x20
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b31c>] dump_stack+0x79/0x84
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b3bb>] panic+0x94/0x1da
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81005ed8>] oops_end+0xa8/0xe0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027589>] no_context+0xf9/0x260
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027855>] __bad_area_nosemaphore+0x165/0x1f0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810278f3>] bad_area_nosemaphore+0x13/0x20
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027e4e>] do_page_fault+0x2fe/0x440
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148e8cf>] page_fault+0x1f/0x30
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
[2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10



 Comments   
Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ]

James, what IB card is in the router (FDR, EDR)? Looks like mlx5 is being used. Is this upstream OFED or MOFED?

Comment by James A Simmons [ 28/Oct/15 ]

Its a mlx5 FDR card using the OFED 3.12 stack.

Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ]

Is this a production system or a test system? Should it be sev 1 (highest priority)?

Comment by James A Simmons [ 28/Oct/15 ]

This was tested on production system. We had to roll back to 2.5 version to have it work again. I made it a blocker since it prevents Lustre bring up for anyone attempting to the latest pre-2.8 clients.

Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ]

I agree this is a blocker. Adding sev 1 means a production system is down and needs immediate attention to get it back up again. If the system in question is back up and running, can we change this to a sev 2?

Comment by James A Simmons [ 28/Oct/15 ]

Sure you can change it to 2.

Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ]

The o2iblnd_cb.c I have in 2.7.61 does not seem to match yours. Can you attach your copy of o2iblnd_cb.c so I can see the differences?

Comment by Amir Shehata (Inactive) [ 28/Oct/15 ]
(gdb) l *kiblnd_cm_callback+0x5ad
0x1661d is in kiblnd_cm_callback (/home/ashehata/lustre-master/lnet/klnds/o2iblnd/o2iblnd_cb.c:2889).
2884                    kiblnd_peer_decref(peer);
2885                    return rc;                      /* rc != 0 destroys cmid */
2886
2887            case RDMA_CM_EVENT_ROUTE_ERROR:
2888                    peer = (kib_peer_t *)cmid->context;
2889                    CNETERR("%s: ROUTE ERROR %d\n",
2890                            libcfs_nid2str(peer->ibp_nid), event->status);
2891                    kiblnd_peer_connect_failed(peer, 1, -EHOSTUNREACH);
2892                    kiblnd_peer_decref(peer);
2893                    return -EHOSTUNREACH;           /* rc != 0 destroys cmid */

looks like peer might be NULL
If there is a change in the way OFED works that might explain this error.
can you put a few debug statements around this part to see if this is indeed the case? A check to see if cmid or peer are NULL.

Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ]

Amir: the previous line is a call to kiblnd_peer_connect_failed() where peer is dereferenced. I would have thought the crash would have happened there. Are you looking at the proper binary? The files James sent us are different than 2.7.61.

Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ]

James: we suspect that cmid->context is coming back NULL when we don't expect such a thing to happen. Can you verify the line of the crash with your binary? I'm not sure that the binary Amir used is equivalent to yours. Once we know for sure that a NULL cmid->context is the cause, we can start to figure out how such a thing can happen.

Comment by Christopher Morrone [ 30/Oct/15 ]

This sounds like it is a blocker for the 2.8 release as stated. I marked it as such, but you can correct it if I'm wrong.

Comment by James A Simmons [ 09/Nov/15 ]

Nope the problem is not cmd->context being NULL. I'm going to give it another run to see what it is.

Comment by James A Simmons [ 09/Nov/15 ]

I added the latest LU-3322 patch and I'm not seeing the crashes anymore. This is at a smallest scale that I saw this problem before but we don't know if it resolves this at titan scales.

Comment by Doug Oucharek (Inactive) [ 12/Nov/15 ]

James: Can I close this linked to LU-3322?

Comment by James A Simmons [ 12/Nov/15 ]

Can we wait until LU-3322 is settled. I noticed that patch has changed again but is now disliked. I want to wait until a agreed on solution is presented before I will try to test it again. Is that okay?

Comment by Doug Oucharek (Inactive) [ 18/Nov/15 ]

Hi James. I see you successfully tested LU-3322. Does that mean this issue has been resolved?

Comment by Peter Jones [ 20/Nov/15 ]

Seems like this is a duplicate LU-3322. James, please speak up if you think otherwise

Comment by James A Simmons [ 20/Nov/15 ]

I agree. The patch from LU-3322 resolves this.

Comment by Peter Jones [ 20/Nov/15 ]

Great - thanks for confirming James

Generated at Sat Feb 10 07:35:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.