Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14744

RDMA write fails with dump_cqe error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like some RDMA_WRITE WR could fail because Lustre posted the WR with too many SGEs. The symptom is:

      [27213.113947] infiniband mlx5_0*: dump_cqe:286:(pid 42728): dump error cqe*
      [27213.113951] 00000000*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*
      [27213.113952] 00000010*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*
      [27213.113954] 00000020*: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*
      [27213.113955] 00000030*: 00 00 00 00 02 00 51 04 08 00 13 f6 00 03 90 d2* 

      The likely reason is the number of SGEs does not fit ib_qp_init_attr.cap.max_send_sge or ib_device_attr.max_send_sge.

      ib_device_attr.max_send_sge is a HW capability attribute. The application (Lustre in our case) can query this value to know how many SQEs the HW supports. Then the application creates a new QP and sets ib_qp_init_attr.cap.max_send_sge to notify the HW how many SGEs it will use for this QP. ib_qp_init_attr.cap.max_send_sge must be <= ib_device_attr.max_send_sge. It is not allowed to post a WR with number of SGEs > ib_qp_init_attr.cap.max_send_sge to the QP.

      Attachments

        Activity

          People

            wc-triage WC Triage
            ashehata Amir Shehata (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: