[LU-7569] IB leaf switch caused LNet routers to crash Created: 16/Dec/15  Updated: 24/Oct/17  Resolved: 18/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7390 Router memory leak if we start a new ... Open
is related to LU-7646 Infinite CON RACE Condition after reb... Resolved
is related to LU-5718 RDMA too fragmented with router Resolved
is related to LU-7210 ASSERTION( peer->ibp_connecting == 0 ) Resolved
is related to LU-3322 ko2iblnd support for different map_on... Resolved
is related to LU-7314 In kiblnd_rejected(), NULL pointer 'c... Resolved
is related to LU-7676 OSS Servers stuck in connecting/disco... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During testing we lost one of the IB leaf switches which caused all of our lustre router to crash with the following error:

2015-12-11T10:53:29.539273-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) ASSERTION( peer->ibp_connecting > 0 || peer->ibp_accepting > 0 || !list_empty(&peer->ibp_conns) ) failed:
2015-12-11T10:53:29.539305-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) LBUG
2015-12-11T10:53:29.539313-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02
2015-12-11T10:53:29.539319-05:00 c0-0c0s2n3 Call Trace:
2015-12-11T10:53:29.539324-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
2015-12-11T10:53:29.539332-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
2015-12-11T10:53:29.539339-05:00 c0-0c0s2n3 [<ffffffffa025bac0>] lbug_with_loc+0x90/0x1d0 [libcfs]
2015-12-11T10:53:29.539348-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
2015-12-11T10:53:29.539358-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
2015-12-11T10:53:29.539364-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
2015-12-11T10:53:29.539369-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
2015-12-11T10:53:29.539375-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
2015-12-11T10:53:29.539380-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
2015-12-11T10:53:29.539385-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
2015-12-11T10:53:29.539389-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
2015-12-11T10:53:29.539394-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
2015-12-11T10:53:29.539406-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
2015-12-11T10:53:29.539411-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
2015-12-11T10:53:29.539416-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
2015-12-11T10:53:29.539421-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
2015-12-11T10:53:29.539427-05:00 c0-0c0s2n3 Kernel panic - not syncing: LBUG
2015-12-11T10:53:29.539432-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
2015-12-11T10:53:29.539437-05:00 c0-0c0s2n3 Call Trace:
2015-12-11T10:53:29.539442-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
2015-12-11T10:53:29.539447-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
2015-12-11T10:53:29.539452-05:00 c0-0c0s2n3 [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
2015-12-11T10:53:29.539457-05:00 c0-0c0s2n3 [<ffffffff810060f5>] show_trace+0x15/0x20
2015-12-11T10:53:29.539462-05:00 c0-0c0s2n3 [<ffffffff8148b31c>] dump_stack+0x79/0x84
2015-12-11T10:53:29.539467-05:00 c0-0c0s2n3 [<ffffffff8148b3bb>] panic+0x94/0x1da
2015-12-11T10:53:29.539473-05:00 c0-0c0s2n3 [<ffffffffa025bbf1>] lbug_with_loc+0x1c1/0x1d0 [libcfs]
2015-12-11T10:53:29.539479-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
2015-12-11T10:53:29.539484-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
2015-12-11T10:53:29.539489-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
2015-12-11T10:53:29.539494-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
2015-12-11T10:53:29.539500-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
2015-12-11T10:53:29.539505-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
2015-12-11T10:53:29.539511-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
2015-12-11T10:53:29.539519-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
2015-12-11T10:53:29.539526-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
2015-12-11T10:53:29.539532-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
2015-12-11T10:53:29.539537-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
2015-12-11T10:53:29.539544-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
2015-12-11T10:53:29.539550-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10



 Comments   
Comment by Jian Yu [ 17/Dec/15 ]

Hi Amir,

Could you please advise? Thank you.

Comment by Jeremy Filizetti [ 17/Dec/15 ]

I've seen a handful of at least suspects ways these conditional checks for ibp_connecting, ibp_accepting and ibp_conns can be incorrect if things are slow to respond and connections get rejected. With the inclusion of LU-3322 rejections can be common and I think this is exposing some of these problems with the reconnect logic and more so conn race patch http://review.whamcloud.com/14600/.

Comment by Gerrit Updater [ 17/Dec/15 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17661
Subject: LU-7569 o2iblnd: multiple fixes for reconnection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4777765fc6e4bdc9a9331e139a5884261245eb0

Comment by Liang Zhen (Inactive) [ 17/Dec/15 ]

I've submitted a patch which could be helpful, http://review.whamcloud.com/17661 , but I have no environment to test it, so it is only for review for the time being.

Comment by James A Simmons [ 17/Dec/15 ]

Liang I just rebooted a system with 17661 and we rebooted the leaf switch. It completely worked. No more routers OOPs on us. Thank you.

Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ]

James, did all the clients re-connect ok? None of them got stuck on reconnecting?

Comment by James A Simmons [ 04/Jan/16 ]

Yes Doug they did all reconnect okay. I did find a problem with this patch tho. I found if I place the following in my modprobe configuration file I can crash my client nodes.

options ko2iblnd timeout=100 credits=2560 ntx=5120 peer_credits=63 concurrent_sends=63 fmr_pool_size=1280 fmr_flush_trigger=1024 map_on_demand=64

and then modprobe lnet;lctl net up

You will then see the following back trace:
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.292250] LNetError: 20000:0:(o2iblnd_cb.c:1309:kiblnd_reconnect_pee r()) ASSERTION( peer->ibp_connecting == 1 ) failed:
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.303363] LNetError: 20000:0:(o2iblnd_cb.c:1309:kiblnd_reconnect_pee r()) LBUG Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.310739] Pid: 20000, comm: kiblnd_connd
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.314891]
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.314892] Call Trace: Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.318939] [<ffffffffa0740875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.325962] [<ffffffffa0740e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.332212] [<ffffffffa08431c8>] kiblnd_reconnect_peer+0x118/0x150 [ko2iblnd]
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.339608] [<ffffffffa083aee0>] kiblnd_destroy_conn+0x4c0/0x810 [ko2iblnd]
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.346784] [<ffffffffa08486b1>] kiblnd_connd+0xc1/0xbc0 [ko2iblnd]
Dec 30 15:02:05 spoon17.ccs.ornl.gov kernel: [ 337.353248] [<ffffffff81064d00>] ? default_wake_function+0x0/0x20

Comment by James A Simmons [ 06/Jan/16 ]

Liang, Doug have you been able to duplicate my crash?

Comment by Doug Oucharek (Inactive) [ 06/Jan/16 ]

Just a status update on this patch: There are multiple problems with the reconnection code on o2iblnd which we are trying to address here (see list of related patches). As you can see from Liang's patch, significant changes are being made to the reconnection strategy to address them.

At the moment, I know of two issues with the current version of this patch:

1- Reconnections due to different negotiated parameters can cause an LBUG (what you are finding James)
2- When an LNet router reboots, an infinite loop of CONN RACE reconnects can ensue if the LNet router has the larger NID value.

I'm working on number 2 with a customer who has run into this. My current theory is that the client tries to reconnect to the router while the router has not completely come up yet. If that connection attempt gets stuck (i.e. client never hears back from it), it can trigger never-ending reconnects.

I'd like the patch for this ticket address the above 2 issues so we can kill off many problems at once here.

Comment by James A Simmons [ 06/Jan/16 ]

I think I know why number 1 happens. The function kiblnd_check_reconnect() return right away if peer->ibp_connecting != 1. So for the checks to actually happen we need the condition peer->ibp_connecting == 1. But for some of the checks we end up incrementing ibp_connecting again. I think the logic is reversed from what it should be. Since we know peer->ibp_connecting == 1 on critical failure it should decremented. I'm testing this change now.

Comment by James A Simmons [ 06/Jan/16 ]

My theory was wrong. Still crashes.

Comment by James A Simmons [ 07/Jan/16 ]

Thanks to Jeremy he pointed out I was using a old patch so with my fix the latest version of the patch resolves problem 1 Doug listed. I haven't run into case 2 so no fix for that.

Comment by Doug Oucharek (Inactive) [ 07/Jan/16 ]

That is great news! For issue 2, I have a theoretical fix which will be tested today. From the logs, I am seeing this pattern which causes issue 2:

1- LNet router reboots (for any reason).
2- A client fails to transmit to the router and fails the connection. This cleans up the connection and peer structure.
3- Attempts to create active connections from the client to the router are continuously made until the router is back up.
4- I then see the router trying to create an active connection to the client.
5- For some unknown reason, the client seems to have a connection to the router stuck in a connecting state.
6- In this scenario. the client has the larger NID value so the client rejects router's connection attempt as CONN RACE.
7- The router does a reconnect, goto 5.

Thus, we have an infinite loop caused by what appears to be a stuck connecting connection. The code assumes we will always hear back from connection attempts so there is no cleanup of the connection by connd. Why the connection is stuck is unknown to me, but the code should be robust enough to detect this and avoid this infinite loop (i.e. self-healing code).

Note: I am only seeing this with mlx5-based cards.

Comment by Liang Zhen (Inactive) [ 08/Jan/16 ]

as the original patch has defects and been reverted from master, so the above patch is invalid now, I reimplemented the patch, which includes the original feature and improvements : http://review.whamcloud.com/#/c/17892/

For the stuck issue described by Doug, I'm not sure if it is a general issue, or just a bug in a particular version of mlx5 driver. So this patch didn't include the self-healing code mentioned by Doug. If you don't have the stuck issue, then probably you don't need the self-healing code at all.

Comment by James A Simmons [ 08/Jan/16 ]

Doug since you don't have a solution just yet for the connection issue could you create a new patch on top of the new one posted by Liang.

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

Ok, will do. I will create a new Jira ticket for that solution so this ticket is free to land Liang's latest patch and close.

Comment by James A Simmons [ 08/Jan/16 ]

Perhaps this problem will not exist with Liang latest patch? Its worth a try.

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

Jay: once this has landed to master (inspected, tested, etc) then it can be ported.

Comment by James A Simmons [ 11/Jan/16 ]

I'm of the opinion that this should be a blocker. Currently without this work it is possible if a IB leaf switch reboots to take down all the LNet routers. Because of this it needs to be slated for 2.8 landing.

Comment by Doug Oucharek (Inactive) [ 11/Jan/16 ]

Is this still true with 14600 reverted?

Comment by James A Simmons [ 13/Jan/16 ]

I just tested this patch with 14600 reverted on our Cray system and the routers crashed again. So this is still a serious problem. If we lose a IB leaf switch we lose the entire file system. IMHO this should be a blocker.

Comment by Liang Zhen (Inactive) [ 13/Jan/16 ]

what's the crash looks like after reverting of 14600, is it OOM or another assertion?

Comment by James A Simmons [ 13/Jan/16 ]

Hmmm. Doesn't appear to LNet related.

2016-01-13T15:56:22.464787-05:00 c0-0c0s7n1 LustreError: 167-0: sultan-OST000e-osc-ffff880405221400: This client was evicted by sultan-OST000e; in progress operations using this service will fail.
2016-01-13T15:56:22.464804-05:00 c0-0c0s7n1 LustreError: Skipped 11 previous similar messages
2016-01-13T15:56:22.514072-05:00 c0-0c0s7n1 Lustre: 2541:0:(llite_lib.c:2628:ll_dirty_page_discard_warn()) sultan: dirty page discard: 10.37.248.67@o2ib1:/sultan/fid: [0x20000a898:0x21:0x0]//stf008/scratch/jsimmons/test_ior/testfile.out may get corrupted (rc -108)
2016-01-13T15:56:22.756634-05:00 c0-0c0s7n1 Lustre: sultan-OST0025-osc-ffff880405221400: Connection restored to 10.37.248.70@o2ib1 (at 10.37.248.70@o2ib1)

Comment by Gerrit Updater [ 18/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17892/
Subject: LU-7569 o2iblnd: avoid intensive reconnecting
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ab698e4d99103b2fecf19b0fd3f90d28723e9d1

Comment by Peter Jones [ 18/Jan/16 ]

Landed for 2.8

Comment by Dmitry Eremin (Inactive) [ 07/Apr/16 ]

Jay, This patch is under review now. So, soon it will be landed to b2_7_fe also.

Comment by Doug Oucharek (Inactive) [ 07/Apr/16 ]

Jay, it has been ported to 2.7 FE as: http://review.whamcloud.com/#/c/18051/.

Comment by Jay Lan (Inactive) [ 08/Apr/16 ]

Doug, although the patch at http://review.whamcloud.com/#/c/18051/.
says "b2_7_fe" under "branch", there is a red remark saying "Cannot Merge" under "Strategy."

Probably the patch was not generated against b2_7_fe? I encountered non-trivia conflicts at
both modified: lnet/klnds/o2iblnd/o2iblnd.h
both modified: lnet/klnds/o2iblnd/o2iblnd_cb.c

Generated at Sat Feb 10 02:10:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.