[LU-17072] LNet dependency to RoCE v1 Created: 31/Aug/23 Updated: 27/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Rocky Linux 9.2 – 5.14.0-284.25.1.el9_2.x86_64 - Broadcom BCM57414 |
||
| Attachments: |
|
| Severity: | 4 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
During testing with master (2.15.57_130_g40c4041 / 40c404129b8ee51af5da7ec422672cc1eba74cbe) on EL9.2 with RoCE network, we noticed that LNet must have some dependencies with RoCE v1 being enabled. If only RoCE v2 is enabled and NOT v1, while the IB layer seems to work well (ib_write_bw, ibv_rc_pingpong, etc.), LNet doesn't work. Attaching a debug trace of a lctl ping on itself (using @o2ib), which doesn't succeed. In our case, enabling RoCE v1 on the hardware fixes the issue with LNet: # bnxtnvm -dev=$ROCEIF setoption=disable_roce_v1#0 |
| Comments |
| Comment by Stephane Thiell [ 27/Nov/23 ] |
|
Just to follow-up on this issue I reported: even after applying the workaround above, we had other LNet issues when using RoCE with Lustre on Broadcom NICs and realized that their RoCE support is actually not fully implemented with 57414 NICs. We switched to NVIDIA ConnectX-6 Lx NICs instead and RoCE seems to just work out of the box with Lustre. |