[LU-14744] RDMA write fails with dump_cqe error Created: 08/Jun/21 Updated: 06/Mar/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
It looks like some RDMA_WRITE WR could fail because Lustre posted the WR with too many SGEs. The symptom is: [27213.113947] infiniband mlx5_0*: dump_cqe:286:(pid 42728): dump error cqe* [27213.113951] 00000000*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00* [27213.113952] 00000010*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00* [27213.113954] 00000020*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00* [27213.113955] 00000030*: 00 00 00 00 02 00 51 04 08 00 13 f6 00 03 90 d2* The likely reason is the number of SGEs does not fit ib_qp_init_attr.cap.max_send_sge or ib_device_attr.max_send_sge. ib_device_attr.max_send_sge is a HW capability attribute. The application (Lustre in our case) can query this value to know how many SQEs the HW supports. Then the application creates a new QP and sets ib_qp_init_attr.cap.max_send_sge to notify the HW how many SGEs it will use for this QP. ib_qp_init_attr.cap.max_send_sge must be <= ib_device_attr.max_send_sge. It is not allowed to post a WR with number of SGEs > ib_qp_init_attr.cap.max_send_sge to the QP. |