[LU-11931] RDMA packets sent from client to MGS are timing out Created: 05/Feb/19 Updated: 22/May/19 Resolved: 21/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ORNL | ||
| Environment: |
Cray CLE6 system running 2.11 clients with 2.11 servers. |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We have seen in a production system the following error which are causing clients to be evicted. [85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds [85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8 [123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995 For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is: F1 file system server backend with 2.8.2 stack, ldiskfs: map_on_demand:0 concurrent_sends:0 peer_credits:8 F2 file system server 2.11 (ZFS 0.7.12) map_on_demand:1 concurrent_sends:63 peer_credits:8 C3 (cray 2.11 router) map_on_demand:0 concurrent_sends:16 peer_credits:8 (o2ib) peer_credits:16 (gni). C4 (cray 2.11 router) map_on_demand:0 concurrent_sends:63 peer_credits:8 (o2ib) peer_credits:16 (gni) Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems. |
| Comments |
| Comment by Peter Jones [ 06/Feb/19 ] |
|
James So do I understand correctly that F2 is vanilla 2.11.0 code whereas C3 and C4 have patches applied by Cray (some of which may not have landed to master yet)? Amir What do you advise here? Peter |
| Comment by Amir Shehata (Inactive) [ 06/Feb/19 ] |
|
James and I investigated this issue. We're currently suspecting it's due to peer_credits set to 8. There has been a change in 2.11: which throttles based on the queue depth as opposed to concurrent_sends which has been removed. In 2.8 concurrent_sends value would get set to 16. We're trying to do two things: 1) check to see if we can bump the queue depth to 16 (will require some interop testing with 2.8 on their test cluster) 2) see if we need to change the throttling code. |
| Comment by Gerrit Updater [ 06/Feb/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34200 |
| Comment by James A Simmons [ 08/Mar/19 ] |
|
In our testing of the patch we did see this: [root@f2-util01 14:46:16]# |
| Comment by Amir Shehata (Inactive) [ 08/Mar/19 ] |
|
Yes, this assert needs to change. Although, I'm considering now that it might be a good idea to bring back concurrent_sends. Initially, I was thinking that it's enough to limit the number of txs by queue depth, but it seems like in order to saturate the link you might want to increase the concurrent_sends to over the queue depth. This will lead to queued txs, but it might be necessary to make sure that we maximize the bandwidth
|
| Comment by Gerrit Updater [ 11/Mar/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34396 |
| Comment by James A Simmons [ 05/Apr/19 ] |
|
We have released this patch into our production server system and it has resolved the peer credit starvation issues on MGS that was causing client evictions. The work around before the patch was to remove a batch of client nodes until the evictions stopped. Now with the patch in production we have all the clients back in use. Please consider landing this for 2.12 LTS. |
| Comment by Peter Jones [ 05/Apr/19 ] |
|
We'll land it to b2_12 as soon as it's landed to master. ATM it's the -1 review from you gating that. Are you willing to reconsider that -1 in light of the success of the patch or did you actually revise the patch as you have suggested before applying it in production? |
| Comment by James A Simmons [ 05/Apr/19 ] |
|
Actually their are two patches. I nicked the one patch but I like the other patch that restored the concurrent_send functionality. Also the other patch is what we ended up running in production |
| Comment by Peter Jones [ 06/Apr/19 ] |
|
Ah I see. So then we'll press to get 34396 landed and 34200 should probably move to being tracked under a new Jira reference. That way it'll be simplest to ensure that we get the desired patch into 2.12.1 |
| Comment by Gerrit Updater [ 11/Apr/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34646 |
| Comment by Gerrit Updater [ 21/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34396/ |
| Comment by Gerrit Updater [ 21/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34646/ |
| Comment by Peter Jones [ 21/Apr/19 ] |
|
ok. Main patch landed for 2.13. 34200 should be tracked under a new Jira ticket |