[LU-7351] LNet router crash during bring up of infiniband interface. Created: 28/Oct/15 Updated: 20/Nov/15 Resolved: 20/Nov/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Cray routers running Lustre 2.7.61 in an SLES11 SP3 environment. |
||
| Issue Links: |
|
||||||||
| Epic/Theme: | lnet | ||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
In our testing on our medium size Cray system we encountered the following crash while attempting to bring up LNet on the routers. [2015-10-26 15:51:14][c0-0c0s6n0]Lustre: kgnilnd build version: 2.7.61-DNE2-1.0502.0.2.7-jsimmons-Unknown-2015-10-21-11:16 |
| Comments |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ] |
|
James, what IB card is in the router (FDR, EDR)? Looks like mlx5 is being used. Is this upstream OFED or MOFED? |
| Comment by James A Simmons [ 28/Oct/15 ] |
|
Its a mlx5 FDR card using the OFED 3.12 stack. |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ] |
|
Is this a production system or a test system? Should it be sev 1 (highest priority)? |
| Comment by James A Simmons [ 28/Oct/15 ] |
|
This was tested on production system. We had to roll back to 2.5 version to have it work again. I made it a blocker since it prevents Lustre bring up for anyone attempting to the latest pre-2.8 clients. |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ] |
|
I agree this is a blocker. Adding sev 1 means a production system is down and needs immediate attention to get it back up again. If the system in question is back up and running, can we change this to a sev 2? |
| Comment by James A Simmons [ 28/Oct/15 ] |
|
Sure you can change it to 2. |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ] |
|
The o2iblnd_cb.c I have in 2.7.61 does not seem to match yours. Can you attach your copy of o2iblnd_cb.c so I can see the differences? |
| Comment by Amir Shehata (Inactive) [ 28/Oct/15 ] |
(gdb) l *kiblnd_cm_callback+0x5ad 0x1661d is in kiblnd_cm_callback (/home/ashehata/lustre-master/lnet/klnds/o2iblnd/o2iblnd_cb.c:2889). 2884 kiblnd_peer_decref(peer); 2885 return rc; /* rc != 0 destroys cmid */ 2886 2887 case RDMA_CM_EVENT_ROUTE_ERROR: 2888 peer = (kib_peer_t *)cmid->context; 2889 CNETERR("%s: ROUTE ERROR %d\n", 2890 libcfs_nid2str(peer->ibp_nid), event->status); 2891 kiblnd_peer_connect_failed(peer, 1, -EHOSTUNREACH); 2892 kiblnd_peer_decref(peer); 2893 return -EHOSTUNREACH; /* rc != 0 destroys cmid */ looks like peer might be NULL |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ] |
|
Amir: the previous line is a call to kiblnd_peer_connect_failed() where peer is dereferenced. I would have thought the crash would have happened there. Are you looking at the proper binary? The files James sent us are different than 2.7.61. |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/15 ] |
|
James: we suspect that cmid->context is coming back NULL when we don't expect such a thing to happen. Can you verify the line of the crash with your binary? I'm not sure that the binary Amir used is equivalent to yours. Once we know for sure that a NULL cmid->context is the cause, we can start to figure out how such a thing can happen. |
| Comment by Christopher Morrone [ 30/Oct/15 ] |
|
This sounds like it is a blocker for the 2.8 release as stated. I marked it as such, but you can correct it if I'm wrong. |
| Comment by James A Simmons [ 09/Nov/15 ] |
|
Nope the problem is not cmd->context being NULL. I'm going to give it another run to see what it is. |
| Comment by James A Simmons [ 09/Nov/15 ] |
|
I added the latest |
| Comment by Doug Oucharek (Inactive) [ 12/Nov/15 ] |
|
James: Can I close this linked to |
| Comment by James A Simmons [ 12/Nov/15 ] |
|
Can we wait until |
| Comment by Doug Oucharek (Inactive) [ 18/Nov/15 ] |
|
Hi James. I see you successfully tested |
| Comment by Peter Jones [ 20/Nov/15 ] |
|
Seems like this is a duplicate |
| Comment by James A Simmons [ 20/Nov/15 ] |
|
I agree. The patch from |
| Comment by Peter Jones [ 20/Nov/15 ] |
|
Great - thanks for confirming James |