[LU-12901] Failing to create a properly sized IB queue pair Created: 23/Oct/19 Updated: 20/Dec/21 Resolved: 13/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Seen with newer Mellanox ConnectX-4 devices |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Attempting to bring up a file system in our test bed with the latest lustre version (2.13) I saw this new error on LNet bring up. [ 472.738363] LNet: 8481:0:(o2iblnd_cb.c:3395:kiblnd_check_conns()) Timed out tx for 10.37.248.232@o2ib1: 471 seconds I found I can lower the peer_credits to get around this but that is not the proper fix.
|
| Comments |
| Comment by Serguei Smirnov [ 23/Oct/19 ] |
|
Hi simmonsja, A few questions: Do you see it happen with MLNX 5? Do you see the issue occur immediately at start-up? Do you see the issue occur on a server with any number of client connections? Thanks, Serguei.
|
| Comment by Nathan Crawford [ 06/Nov/19 ] |
|
Also seeing with Lustre 2.12.3 client on CentOS 7.6 (3.10.0-957.27.2.el7.x86_64). With both ConnectX-4 and -5 on client nodes. Server nodes are running Lustre 2.10.6 on CentOS 7.5 (3.10.0-862.14.4.el7.x86_64). Servers are also mixed Mellanox EDR and Intel/Qlogic QDR. All nodes are using the in-box RDMA drivers. As most client nodes are still on Intel QDR, we previously set the ib0 peer_credits to 62 across all nodes. Lowering peer_credits to 42 on the problematic clients allows mounting of file system. |
| Comment by James A Simmons [ 06/Nov/19 ] |
|
Sorry we are also having issues with our IB switch so currently I'm not using our IB network. |
| Comment by Karsten Weiss [ 23/Jan/20 ] |
|
I also still see this on a Lustre client v2_12_3-98-g6db0c4f082 (+ |
| Comment by Michael Neff [ 11/Feb/20 ] |
|
I also see this on a Lustre client with Centos7.7 and ConnectX6 using Mellanox OFED 4.7 |
| Comment by Jeff Johnson [ 28/Apr/20 ] |
|
I see this issue as well. CentOS 7.8, Lustre 2.13.0, MOFED 5.0-2.1.8, ConnectX6. Setting peer_credits to 128 fails as described. Lowering peer_credits to 48 results in functioning lnet.
|
| Comment by Karsten Weiss [ 25/Jun/20 ] |
|
May I suggest to change the "Affects Version/s" attribute of this bug from 2.13.0 to 2.12.x (including 2.12.5 which is a LTS release). See e.g. the comments here or the reports on lustre-discuss. |
| Comment by Gerrit Updater [ 24/Nov/20 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40748 |
| Comment by Gerrit Updater [ 13/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40748/ |
| Comment by Peter Jones [ 13/Dec/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 20/Dec/21 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45901 |