Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7124

MLX5: Limit hit in cap.max_send_wr

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Running Lustre with MLX5

      We were trying to increase O2IBLND's peer_credits to 32 on MLX5. Here is the problematic code:

              init_qp_attr->event_handler = kiblnd_qp_event;
              init_qp_attr->qp_context = conn;
              init_qp_attr->cap.max_send_wr = IBLND_SEND_WRS(version);
              init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(version);
              init_qp_attr->cap.max_send_sge = 1;
              init_qp_attr->cap.max_recv_sge = 1;
              init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
              init_qp_attr->qp_type = IB_QPT_RC;
              init_qp_attr->send_cq = cq;
              init_qp_attr->recv_cq = cq;
      
              rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
      
      #define IBLND_SEND_WRS(v)          ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))
      
      #define IBLND_RDMA_FRAGS(v)        ((v) == IBLND_MSG_VERSION_1 ? \
                                           IBLND_MAX_RDMA_FRAGS : IBLND_CFG_RDMA_FRAGS)
      
      #define IBLND_CFG_RDMA_FRAGS       (*kiblnd_tunables.kib_map_on_demand != 0 ? \
                                          *kiblnd_tunables.kib_map_on_demand :      \
                                           IBLND_MAX_RDMA_FRAGS)  /* max # of fragments configured by user */
      
      #define IBLND_MAX_RDMA_FRAGS         LNET_MAX_IOV           /* max # of fragments supported */
      
      /** limit on the number of fragments in discontiguous MDs */
      #define LNET_MAX_IOV    256
      

      Basically, when setting peer_credits to 32 then

      init_qp_attr->cap.max_send_wr = 8224
      
      [root@wt-2-00 ~]# ibv_devinfo -v | grep max_qp_wr
       max_qp_wr:   16384
      

      API returns -12 (out of memory)

      peer_credits 16 == 4112 seems to work.

      We're running on MOFED 3.0

      Is there any limitation that we're hitting on the MLX side? As far as I know MLX4 works with peer_credits set to 32.

      Full device info:

      [wt2user1@wildcat2 ~]$ ibv_devinfo -v
      hca_id: mlx5_0
              transport:                      InfiniBand (0)
              fw_ver:                         12.100.6440
              node_guid:                      e41d:2d03:0060:7652
              sys_image_guid:                 e41d:2d03:0060:7652
              vendor_id:                      0x02c9
              vendor_part_id:                 4115
              hw_ver:                         0x0
              board_id:                       MT_2180110032
              phys_port_cnt:                  1
              max_mr_size:                    0xffffffffffffffff
              page_size_cap:                  0xfffff000
              max_qp:                         262144
              max_qp_wr:                      16384
              device_cap_flags:               0x40509c36
                                              BAD_PKEY_CNTR
                                              BAD_QKEY_CNTR
                                              AUTO_PATH_MIG
                                              CHANGE_PHY_PORT
                                              PORT_ACTIVE_EVENT
                                              SYS_IMAGE_GUID
                                              RC_RNR_NAK_GEN
                                              XRC
                                              Unknown flags: 0x40408000
              device_cap_exp_flags:           0x5020007100000000
                                              EXP_DC_TRANSPORT
                                              EXP_MEM_MGT_EXTENSIONS
                                              EXP_CROSS_CHANNEL
                                              EXP_MR_ALLOCATE
                                              EXT_ATOMICS
                                              EXT_SEND NOP
                                              EXP_UMR
              max_sge:                        30
              max_sge_rd:                     0
              max_cq:                         16777216
              max_cqe:                        4194303
              max_mr:                         16777216
              max_pd:                         16777216
              max_qp_rd_atom:                 16
              max_ee_rd_atom:                 0
              max_res_rd_atom:                4194304
              max_qp_init_rd_atom:            16
              max_ee_init_rd_atom:            0
              atomic_cap:                     ATOMIC_HCA_REPLY_BE (64)
              log atomic arg sizes (mask)             3c
              max fetch and add bit boundary  64
              log max atomic inline           5
              max_ee:                         0
              max_rdd:                        0
              max_mw:                         0
              max_raw_ipv6_qp:                0
              max_raw_ethy_qp:                0
              max_mcast_grp:                  2097152
              max_mcast_qp_attach:            48
              max_total_mcast_qp_attach:      100663296
              max_ah:                         2147483647
              max_fmr:                        0
              max_srq:                        8388608
              max_srq_wr:                     16383
              max_srq_sge:                    31
              max_pkeys:                      128
              local_ca_ack_delay:             16
              hca_core_clock:                 0
              max_klm_list_size:              65536
              max_send_wqe_inline_klms:       20
              max_umr_recursion_depth:        4
              max_umr_stride_dimension:       1
              general_odp_caps:
              rc_odp_caps:
                                              NO SUPPORT
              uc_odp_caps:
                                              NO SUPPORT
              ud_odp_caps:
                                              NO SUPPORT
              dc_odp_caps:
                                              NO SUPPORT
              xrc_odp_caps:
                                              NO SUPPORT
              raw_eth_odp_caps:
                                              NO SUPPORT
              max_dct:                        262144
                      port:   1
                              state:                  PORT_ACTIVE (4)
                              max_mtu:                4096 (5)
                              active_mtu:             4096 (5)
                              sm_lid:                 19
                              port_lid:               1
                              port_lmc:               0x00
                              link_layer:             InfiniBand
                              max_msg_sz:             0x40000000
                              port_cap_flags:         0x2651e848
                              max_vl_num:             4 (3)
                              bad_pkey_cntr:          0x0
                              qkey_viol_cntr:         0x0
                              sm_sl:                  0
                              pkey_tbl_len:           128
                              gid_tbl_len:            8
                              subnet_timeout:         18
                              init_type_reply:        0
                              active_width:           4X (2)
                              active_speed:           25.0 Gbps (32)
                              phys_state:             LINK_UP (5)
                              GID[  0]:               fe80:0000:0000:0000:e41d:2d03:0060:7652
      

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: