[LU-17072] LNet dependency to RoCE v1 Created: 31/Aug/23  Updated: 27/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Rocky Linux 9.2 – 5.14.0-284.25.1.el9_2.x86_64 - Broadcom BCM57414


Attachments: Text File lnet-issue-self-lctl-ping-no-rocev1.txt    
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

During testing with master (2.15.57_130_g40c4041 / 40c404129b8ee51af5da7ec422672cc1eba74cbe) on EL9.2 with RoCE network, we noticed that LNet must have some dependencies with RoCE v1 being enabled. If only RoCE v2 is enabled and NOT v1, while the IB layer seems to work well (ib_write_bw, ibv_rc_pingpong, etc.), LNet doesn't work. Attaching a debug trace of a lctl ping on itself (using @o2ib), which doesn't succeed.

In our case, enabling RoCE v1 on the hardware fixes the issue with LNet:

# bnxtnvm -dev=$ROCEIF setoption=disable_roce_v1#0


 Comments   
Comment by Stephane Thiell [ 27/Nov/23 ]

Just to follow-up on this issue I reported: even after applying the workaround above, we had other LNet issues when using RoCE with Lustre on Broadcom NICs and realized that their RoCE support is actually not fully implemented with 57414 NICs. We switched to NVIDIA ConnectX-6 Lx NICs instead and RoCE seems to just work out of the box with Lustre.

Generated at Sat Feb 10 03:32:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.