[LU-7569] IB leaf switch caused LNet routers to crash Created: 16/Dec/15 Updated: 24/Oct/17 Resolved: 18/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
During testing we lost one of the IB leaf switches which caused all of our lustre router to crash with the following error: 2015-12-11T10:53:29.539273-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) ASSERTION( peer->ibp_connecting > 0 || peer->ibp_accepting > 0 || !list_empty(&peer->ibp_conns) ) failed: |
| Comments |
| Comment by Jian Yu [ 17/Dec/15 ] |
|
Hi Amir, Could you please advise? Thank you. |
| Comment by Jeremy Filizetti [ 17/Dec/15 ] |
|
I've seen a handful of at least suspects ways these conditional checks for ibp_connecting, ibp_accepting and ibp_conns can be incorrect if things are slow to respond and connections get rejected. With the inclusion of |
| Comment by Gerrit Updater [ 17/Dec/15 ] |
|
Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17661 |
| Comment by Liang Zhen (Inactive) [ 17/Dec/15 ] |
|
I've submitted a patch which could be helpful, http://review.whamcloud.com/17661 , but I have no environment to test it, so it is only for review for the time being. |
| Comment by James A Simmons [ 17/Dec/15 ] |
|
Liang I just rebooted a system with 17661 and we rebooted the leaf switch. It completely worked. No more routers OOPs on us. Thank you. |
| Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ] |
|
James, did all the clients re-connect ok? None of them got stuck on reconnecting? |
| Comment by James A Simmons [ 04/Jan/16 ] |
|
Yes Doug they did all reconnect okay. I did find a problem with this patch tho. I found if I place the following in my modprobe configuration file I can crash my client nodes. options ko2iblnd timeout=100 credits=2560 ntx=5120 peer_credits=63 concurrent_sends=63 fmr_pool_size=1280 fmr_flush_trigger=1024 map_on_demand=64 and then modprobe lnet;lctl net up You will then see the following back trace: |
| Comment by James A Simmons [ 06/Jan/16 ] |
|
Liang, Doug have you been able to duplicate my crash? |
| Comment by Doug Oucharek (Inactive) [ 06/Jan/16 ] |
|
Just a status update on this patch: There are multiple problems with the reconnection code on o2iblnd which we are trying to address here (see list of related patches). As you can see from Liang's patch, significant changes are being made to the reconnection strategy to address them. At the moment, I know of two issues with the current version of this patch: 1- Reconnections due to different negotiated parameters can cause an LBUG (what you are finding James) I'm working on number 2 with a customer who has run into this. My current theory is that the client tries to reconnect to the router while the router has not completely come up yet. If that connection attempt gets stuck (i.e. client never hears back from it), it can trigger never-ending reconnects. I'd like the patch for this ticket address the above 2 issues so we can kill off many problems at once here. |
| Comment by James A Simmons [ 06/Jan/16 ] |
|
I think I know why number 1 happens. The function kiblnd_check_reconnect() return right away if peer->ibp_connecting != 1. So for the checks to actually happen we need the condition peer->ibp_connecting == 1. But for some of the checks we end up incrementing ibp_connecting again. I think the logic is reversed from what it should be. Since we know peer->ibp_connecting == 1 on critical failure it should decremented. I'm testing this change now. |
| Comment by James A Simmons [ 06/Jan/16 ] |
|
My theory was wrong. Still crashes. |
| Comment by James A Simmons [ 07/Jan/16 ] |
|
Thanks to Jeremy he pointed out I was using a old patch so with my fix the latest version of the patch resolves problem 1 Doug listed. I haven't run into case 2 so no fix for that. |
| Comment by Doug Oucharek (Inactive) [ 07/Jan/16 ] |
|
That is great news! For issue 2, I have a theoretical fix which will be tested today. From the logs, I am seeing this pattern which causes issue 2: 1- LNet router reboots (for any reason). Thus, we have an infinite loop caused by what appears to be a stuck connecting connection. The code assumes we will always hear back from connection attempts so there is no cleanup of the connection by connd. Why the connection is stuck is unknown to me, but the code should be robust enough to detect this and avoid this infinite loop (i.e. self-healing code). Note: I am only seeing this with mlx5-based cards. |
| Comment by Liang Zhen (Inactive) [ 08/Jan/16 ] |
|
as the original patch has defects and been reverted from master, so the above patch is invalid now, I reimplemented the patch, which includes the original feature and improvements : http://review.whamcloud.com/#/c/17892/ For the stuck issue described by Doug, I'm not sure if it is a general issue, or just a bug in a particular version of mlx5 driver. So this patch didn't include the self-healing code mentioned by Doug. If you don't have the stuck issue, then probably you don't need the self-healing code at all. |
| Comment by James A Simmons [ 08/Jan/16 ] |
|
Doug since you don't have a solution just yet for the connection issue could you create a new patch on top of the new one posted by Liang. |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
Ok, will do. I will create a new Jira ticket for that solution so this ticket is free to land Liang's latest patch and close. |
| Comment by James A Simmons [ 08/Jan/16 ] |
|
Perhaps this problem will not exist with Liang latest patch? Its worth a try. |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
Jay: once this has landed to master (inspected, tested, etc) then it can be ported. |
| Comment by James A Simmons [ 11/Jan/16 ] |
|
I'm of the opinion that this should be a blocker. Currently without this work it is possible if a IB leaf switch reboots to take down all the LNet routers. Because of this it needs to be slated for 2.8 landing. |
| Comment by Doug Oucharek (Inactive) [ 11/Jan/16 ] |
|
Is this still true with 14600 reverted? |
| Comment by James A Simmons [ 13/Jan/16 ] |
|
I just tested this patch with 14600 reverted on our Cray system and the routers crashed again. So this is still a serious problem. If we lose a IB leaf switch we lose the entire file system. IMHO this should be a blocker. |
| Comment by Liang Zhen (Inactive) [ 13/Jan/16 ] |
|
what's the crash looks like after reverting of 14600, is it OOM or another assertion? |
| Comment by James A Simmons [ 13/Jan/16 ] |
|
Hmmm. Doesn't appear to LNet related. 2016-01-13T15:56:22.464787-05:00 c0-0c0s7n1 LustreError: 167-0: sultan-OST000e-osc-ffff880405221400: This client was evicted by sultan-OST000e; in progress operations using this service will fail. |
| Comment by Gerrit Updater [ 18/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17892/ |
| Comment by Peter Jones [ 18/Jan/16 ] |
|
Landed for 2.8 |
| Comment by Dmitry Eremin (Inactive) [ 07/Apr/16 ] |
|
Jay, This patch is under review now. So, soon it will be landed to b2_7_fe also. |
| Comment by Doug Oucharek (Inactive) [ 07/Apr/16 ] |
|
Jay, it has been ported to 2.7 FE as: http://review.whamcloud.com/#/c/18051/. |
| Comment by Jay Lan (Inactive) [ 08/Apr/16 ] |
|
Doug, although the patch at http://review.whamcloud.com/#/c/18051/. Probably the patch was not generated against b2_7_fe? I encountered non-trivia conflicts at |