[LU-12901] Failing to create a properly sized IB queue pair Created: 23/Oct/19  Updated: 20/Dec/21  Resolved: 13/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Serguei Smirnov
Resolution: Fixed Votes: 1
Labels: None
Environment:

Seen with newer Mellanox ConnectX-4 devices


Issue Links:
Related
is related to LU-10213 o2iblnd: Potential discrepancy when a... Resolved
is related to LU-7124 MLX5: Limit hit in cap.max_send_wr Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Attempting to bring up a file system in our test bed with the latest lustre version (2.13) I saw this new error on LNet bring up.

[ 472.738363] LNet: 8481:0:(o2iblnd_cb.c:3395:kiblnd_check_conns()) Timed out tx for 10.37.248.232@o2ib1: 471 seconds
[ 473.739295] LNetError: 2014:0:(o2iblnd.c:929:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16317, recv_wr: 128, send_sge: 2, recv_sge: 1

I found I can lower the peer_credits to get around this but that is not the proper fix.

 

 

 

 



 Comments   
Comment by Serguei Smirnov [ 23/Oct/19 ]

Hi simmonsja,

A few questions:

Do you see it happen with MLNX 5?

Do you see the issue occur immediately at start-up?

Do you see the issue occur on a server with any number of client connections?

Thanks,

Serguei.

 

Comment by Nathan Crawford [ 06/Nov/19 ]

Also seeing with Lustre 2.12.3 client on CentOS 7.6 (3.10.0-957.27.2.el7.x86_64). With both ConnectX-4 and -5 on client nodes.

Server nodes are running Lustre 2.10.6 on CentOS 7.5 (3.10.0-862.14.4.el7.x86_64).  Servers are also mixed Mellanox EDR and Intel/Qlogic QDR.

All nodes are using the in-box RDMA drivers.

As most client nodes are still on Intel QDR, we previously set the ib0 peer_credits to 62 across all nodes. Lowering peer_credits to 42 on the problematic clients allows mounting of file system. 

Comment by James A Simmons [ 06/Nov/19 ]

Sorry we are also having issues with our IB switch so currently I'm not using our IB network.

Comment by Karsten Weiss [ 23/Jan/20 ]

I also still see this on a Lustre client v2_12_3-98-g6db0c4f082 (+LU-12637 patch) on CentOS 8.1 (4.18.0-147.3.1.el8_1.x86_64) with ConnectX-4 using the CentOS RDMA drivers (mlx5_core) . I can also confirm that peer_credits=42 (instead of 63) works.

Comment by Michael Neff [ 11/Feb/20 ]

I also see this on a Lustre client with Centos7.7 and ConnectX6 using Mellanox OFED 4.7

Comment by Jeff Johnson [ 28/Apr/20 ]

I see this issue as well. CentOS 7.8, Lustre 2.13.0, MOFED 5.0-2.1.8, ConnectX6. Setting peer_credits to 128 fails as described. Lowering peer_credits to 48 results in functioning lnet.

 

 

Comment by Karsten Weiss [ 25/Jun/20 ]

May I suggest to change the "Affects Version/s" attribute of this bug from 2.13.0 to 2.12.x (including 2.12.5 which is a LTS release). See e.g. the comments here or the reports on lustre-discuss.

Comment by Gerrit Updater [ 24/Nov/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40748
Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fe4fcd922196355b08981d9015f1635c88904fd3

Comment by Gerrit Updater [ 13/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40748/
Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8a3ef5713cc4aed1ac7bd3ce177895caa597cc4c

Comment by Peter Jones [ 13/Dec/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 20/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45901
Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 9e0736f2306286f2f2c653c4e06c17d2201d1c0f

Generated at Sat Feb 10 02:56:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.