[LU-11931] RDMA packets sent from client to MGS are timing out Created: 05/Feb/19  Updated: 22/May/19  Resolved: 21/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: ORNL
Environment:

Cray CLE6 system running 2.11 clients with 2.11 servers.


Issue Links:
Related
is related to LU-10291 remove concurrent_sends tunable Resolved
is related to LU-12279 client got evicted due to network issue. Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have seen in a production system the following error which are causing clients to be evicted.

[85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds

[85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8

[123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995

For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:

F1 file system server backend with 2.8.2 stack, ldiskfs:

    map_on_demand:0

    concurrent_sends:0

    peer_credits:8

F2 file system server 2.11 (ZFS 0.7.12)

    map_on_demand:1

    concurrent_sends:63

    peer_credits:8

C3 (cray 2.11 router)

   map_on_demand:0

   concurrent_sends:16

   peer_credits:8 (o2ib)

   peer_credits:16 (gni).

C4 (cray 2.11 router)

   map_on_demand:0

   concurrent_sends:63

   peer_credits:8 (o2ib)

   peer_credits:16 (gni)

Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.



 Comments   
Comment by Peter Jones [ 06/Feb/19 ]

James

So do I understand correctly that F2 is vanilla 2.11.0 code whereas C3 and C4 have patches applied by Cray (some of which may not have landed to master yet)?

Amir

What do you advise here?

Peter

Comment by Amir Shehata (Inactive) [ 06/Feb/19 ]

James and I investigated this issue. We're currently suspecting it's due to peer_credits set to 8. There has been a change in 2.11:
LU-10459 lnd: throttle tx based on queue depth

which throttles based on the queue depth as opposed to concurrent_sends which has been removed. In 2.8 concurrent_sends value would get set to 16. We're trying to do two things: 1) check to see if we can bump the queue depth to 16 (will require some interop testing with 2.8 on their test cluster) 2) see if we need to change the throttling code.

Comment by Gerrit Updater [ 06/Feb/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34200
Subject: LU-11931 lnd: relax throttling limit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 69f5e8e08a19f6fba884cf571de4b9a1307bed9a

Comment by James A Simmons [ 08/Mar/19 ]

In our testing of the patch we did see this:

[root@f2-util01 14:46:16]#
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:LNetError: 30311:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) ASSERTION( conn->ibc_nsends_posted <= conn->ibc_queue_depth ) failed:
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:LNetError: 30312:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) ASSERTION( conn->ibc_nsends_posted <= conn->ibc_queue_depth ) failed:
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:LNetError: 30312:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) LBUG
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:LNetError: 30305:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) ASSERTION( conn->ibc_nsends_posted <= conn->ibc_queue_depth ) failed:
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:LNetError: 30305:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) LBUG
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:LNetError: 30311:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) LBUG
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...
kernel:Kernel panic - not syncing: LBUG

Comment by Amir Shehata (Inactive) [ 08/Mar/19 ]

Yes, this assert needs to change. Although, I'm considering now that it might be a good idea to bring back concurrent_sends. Initially, I was thinking that it's enough to limit the number of txs by queue depth, but it seems like in order to saturate the link you might want to increase the concurrent_sends to over the queue depth. This will lead to queued txs, but it might be necessary to make sure that we maximize the bandwidth

 

Comment by Gerrit Updater [ 11/Mar/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34396
Subject: LU-11931 lnd: bring back concurrent_sends
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2dead91d20b77e6c279aa1ca048b53d6f5617b10

Comment by James A Simmons [ 05/Apr/19 ]

We have released this patch into our production server system and it has resolved the peer credit starvation issues on MGS that was causing client evictions. The work around before the patch was to remove a batch of client nodes until the evictions stopped. Now with the patch in production we have all the clients back in use. Please consider landing this for 2.12 LTS.

Comment by Peter Jones [ 05/Apr/19 ]

We'll land it to b2_12 as soon as it's landed to master. ATM it's the -1 review from you gating that. Are you willing to reconsider that -1 in light of the success of the patch or did you actually revise the patch as you have suggested before applying it in production?

Comment by James A Simmons [ 05/Apr/19 ]

Actually their are two patches. I nicked the one patch but I like the other patch that restored the concurrent_send functionality. Also the other patch is what we ended up running in production

Comment by Peter Jones [ 06/Apr/19 ]

Ah I see. So then we'll press to get  34396 landed and 34200 should probably move to being tracked under a new Jira reference. That way it'll be simplest to ensure that we get the desired patch into 2.12.1

Comment by Gerrit Updater [ 11/Apr/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34646
Subject: LU-11931 lnd: bring back concurrent_sends
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f08724454d919717e8a70555cf1194acddc731ad

Comment by Gerrit Updater [ 21/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34396/
Subject: LU-11931 lnd: bring back concurrent_sends
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 83e45ead69babfb2909a3157f054fcd8fdf33360

Comment by Gerrit Updater [ 21/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34646/
Subject: LU-11931 lnd: bring back concurrent_sends
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 056fe83188f0a24de9e27248a7574c8fae867163

Comment by Peter Jones [ 21/Apr/19 ]

ok. Main patch landed for 2.13. 34200 should be tracked under a new Jira ticket

Generated at Sat Feb 10 02:48:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.