[LU-14744] RDMA write fails with dump_cqe error Created: 08/Jun/21  Updated: 06/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like some RDMA_WRITE WR could fail because Lustre posted the WR with too many SGEs. The symptom is:

[27213.113947] infiniband mlx5_0*: dump_cqe:286:(pid 42728): dump error cqe*
[27213.113951] 00000000*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*
[27213.113952] 00000010*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*
[27213.113954] 00000020*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*
[27213.113955] 00000030*: 00 00 00 00 02 00 51 04 08 00 13 f6 00 03 90 d2* 

The likely reason is the number of SGEs does not fit ib_qp_init_attr.cap.max_send_sge or ib_device_attr.max_send_sge.

ib_device_attr.max_send_sge is a HW capability attribute. The application (Lustre in our case) can query this value to know how many SQEs the HW supports. Then the application creates a new QP and sets ib_qp_init_attr.cap.max_send_sge to notify the HW how many SGEs it will use for this QP. ib_qp_init_attr.cap.max_send_sge must be <= ib_device_attr.max_send_sge. It is not allowed to post a WR with number of SGEs > ib_qp_init_attr.cap.max_send_sge to the QP.


Generated at Sat Feb 10 03:12:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.