Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Running Lustre with MLX5

      We were trying to increase O2IBLND's peer_credits to 32 on MLX5. Here is the problematic code:

              init_qp_attr->event_handler = kiblnd_qp_event;
              init_qp_attr->qp_context = conn;
              init_qp_attr->cap.max_send_wr = IBLND_SEND_WRS(version);
              init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(version);
              init_qp_attr->cap.max_send_sge = 1;
              init_qp_attr->cap.max_recv_sge = 1;
              init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
              init_qp_attr->qp_type = IB_QPT_RC;
              init_qp_attr->send_cq = cq;
              init_qp_attr->recv_cq = cq;
      
              rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
      
      #define IBLND_SEND_WRS(v)          ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))
      
      #define IBLND_RDMA_FRAGS(v)        ((v) == IBLND_MSG_VERSION_1 ? \
                                           IBLND_MAX_RDMA_FRAGS : IBLND_CFG_RDMA_FRAGS)
      
      #define IBLND_CFG_RDMA_FRAGS       (*kiblnd_tunables.kib_map_on_demand != 0 ? \
                                          *kiblnd_tunables.kib_map_on_demand :      \
                                           IBLND_MAX_RDMA_FRAGS)  /* max # of fragments configured by user */
      
      #define IBLND_MAX_RDMA_FRAGS         LNET_MAX_IOV           /* max # of fragments supported */
      
      /** limit on the number of fragments in discontiguous MDs */
      #define LNET_MAX_IOV    256
      

      Basically, when setting peer_credits to 32 then

      init_qp_attr->cap.max_send_wr = 8224
      
      [root@wt-2-00 ~]# ibv_devinfo -v | grep max_qp_wr
       max_qp_wr:   16384
      

      API returns -12 (out of memory)

      peer_credits 16 == 4112 seems to work.

      We're running on MOFED 3.0

      Is there any limitation that we're hitting on the MLX side? As far as I know MLX4 works with peer_credits set to 32.

      Full device info:

      [wt2user1@wildcat2 ~]$ ibv_devinfo -v
      hca_id: mlx5_0
              transport:                      InfiniBand (0)
              fw_ver:                         12.100.6440
              node_guid:                      e41d:2d03:0060:7652
              sys_image_guid:                 e41d:2d03:0060:7652
              vendor_id:                      0x02c9
              vendor_part_id:                 4115
              hw_ver:                         0x0
              board_id:                       MT_2180110032
              phys_port_cnt:                  1
              max_mr_size:                    0xffffffffffffffff
              page_size_cap:                  0xfffff000
              max_qp:                         262144
              max_qp_wr:                      16384
              device_cap_flags:               0x40509c36
                                              BAD_PKEY_CNTR
                                              BAD_QKEY_CNTR
                                              AUTO_PATH_MIG
                                              CHANGE_PHY_PORT
                                              PORT_ACTIVE_EVENT
                                              SYS_IMAGE_GUID
                                              RC_RNR_NAK_GEN
                                              XRC
                                              Unknown flags: 0x40408000
              device_cap_exp_flags:           0x5020007100000000
                                              EXP_DC_TRANSPORT
                                              EXP_MEM_MGT_EXTENSIONS
                                              EXP_CROSS_CHANNEL
                                              EXP_MR_ALLOCATE
                                              EXT_ATOMICS
                                              EXT_SEND NOP
                                              EXP_UMR
              max_sge:                        30
              max_sge_rd:                     0
              max_cq:                         16777216
              max_cqe:                        4194303
              max_mr:                         16777216
              max_pd:                         16777216
              max_qp_rd_atom:                 16
              max_ee_rd_atom:                 0
              max_res_rd_atom:                4194304
              max_qp_init_rd_atom:            16
              max_ee_init_rd_atom:            0
              atomic_cap:                     ATOMIC_HCA_REPLY_BE (64)
              log atomic arg sizes (mask)             3c
              max fetch and add bit boundary  64
              log max atomic inline           5
              max_ee:                         0
              max_rdd:                        0
              max_mw:                         0
              max_raw_ipv6_qp:                0
              max_raw_ethy_qp:                0
              max_mcast_grp:                  2097152
              max_mcast_qp_attach:            48
              max_total_mcast_qp_attach:      100663296
              max_ah:                         2147483647
              max_fmr:                        0
              max_srq:                        8388608
              max_srq_wr:                     16383
              max_srq_sge:                    31
              max_pkeys:                      128
              local_ca_ack_delay:             16
              hca_core_clock:                 0
              max_klm_list_size:              65536
              max_send_wqe_inline_klms:       20
              max_umr_recursion_depth:        4
              max_umr_stride_dimension:       1
              general_odp_caps:
              rc_odp_caps:
                                              NO SUPPORT
              uc_odp_caps:
                                              NO SUPPORT
              ud_odp_caps:
                                              NO SUPPORT
              dc_odp_caps:
                                              NO SUPPORT
              xrc_odp_caps:
                                              NO SUPPORT
              raw_eth_odp_caps:
                                              NO SUPPORT
              max_dct:                        262144
                      port:   1
                              state:                  PORT_ACTIVE (4)
                              max_mtu:                4096 (5)
                              active_mtu:             4096 (5)
                              sm_lid:                 19
                              port_lid:               1
                              port_lmc:               0x00
                              link_layer:             InfiniBand
                              max_msg_sz:             0x40000000
                              port_cap_flags:         0x2651e848
                              max_vl_num:             4 (3)
                              bad_pkey_cntr:          0x0
                              qkey_viol_cntr:         0x0
                              sm_sl:                  0
                              pkey_tbl_len:           128
                              gid_tbl_len:            8
                              subnet_timeout:         18
                              init_type_reply:        0
                              active_width:           4X (2)
                              active_speed:           25.0 Gbps (32)
                              phys_state:             LINK_UP (5)
                              GID[  0]:               fe80:0000:0000:0000:e41d:2d03:0060:7652
      

      Attachments

        Issue Links

          Activity

            [LU-7124] MLX5: Limit hit in cap.max_send_wr
            [49672.067906] mlx5_ib:mlx5_0:calc_sq_size:485:(pid 8297): wqe_size 192
            [49672.067908] mlx5_ib:mlx5_0:calc_sq_size:507:(pid 8297): wqe count(65536) exceeds limits(16384)
            [49672.067910] mlx5_ib:mlx5_0:create_kernel_qp:1051:(pid 8297): err -12
            

            Don't think that http://review.whamcloud.com/18347/ is right solution.
            It hides information that mlx5 doesn't support peer_credits > 16.
            At least warning should be added there.

            Moreover patch leads to several memory free/allocation and mutex locking/unlocking inside rdma_create_qp->mlx5_ib_create_qp...:

            mlx5_ib_create_qp: attrx = kzalloc(sizeof(*attrx), GFP_KERNEL); 
            __create_qp: qp = kzalloc(sizeof(*qp), GFP_KERNEL);
            

            Also there is a chance that ENOMEM could be returned in case of low system memory. I mean not from calc_sq_size.
            In such case it is bad idea to alloc and free small peaces of memory.

            scherementsev Sergey Cheremencev added a comment - [49672.067906] mlx5_ib:mlx5_0:calc_sq_size:485:(pid 8297): wqe_size 192 [49672.067908] mlx5_ib:mlx5_0:calc_sq_size:507:(pid 8297): wqe count(65536) exceeds limits(16384) [49672.067910] mlx5_ib:mlx5_0:create_kernel_qp:1051:(pid 8297): err -12 Don't think that http://review.whamcloud.com/18347/ is right solution. It hides information that mlx5 doesn't support peer_credits > 16. At least warning should be added there. Moreover patch leads to several memory free/allocation and mutex locking/unlocking inside rdma_create_qp->mlx5_ib_create_qp...: mlx5_ib_create_qp: attrx = kzalloc(sizeof(*attrx), GFP_KERNEL); __create_qp: qp = kzalloc(sizeof(*qp), GFP_KERNEL); Also there is a chance that ENOMEM could be returned in case of low system memory. I mean not from calc_sq_size. In such case it is bad idea to alloc and free small peaces of memory.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18347/
            Subject: LU-7124 o2iblnd: limit cap.max_send_wr for MLX5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4083806828a94ee09c2dadf2cca8c224547d5ebc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18347/ Subject: LU-7124 o2iblnd: limit cap.max_send_wr for MLX5 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4083806828a94ee09c2dadf2cca8c224547d5ebc
            simmonsja James A Simmons added a comment - - edited

            For me the problem was not being about to set my peer credit setting higher than 16.

            simmonsja James A Simmons added a comment - - edited For me the problem was not being about to set my peer credit setting higher than 16.

            Is the problem memory fragmentation as in LU-5718 or matching o2iblnd settings between client & server ?

            chunteraa Chris Hunter (Inactive) added a comment - Is the problem memory fragmentation as in LU-5718 or matching o2iblnd settings between client & server ?

            first of all, don't fill a new ticket when older exist. Please start your's work from search.

            second - it's not a solution. If someone want to avoid ENOMEM in that case, it may do via tunable, but you silence a horror that settings.

            Real fix in that case will implementing a shared receive queue for an o2iblnd and new memory registration model which able to dramatically reduce a number work requests.

            shadow Alexey Lyashkov added a comment - first of all, don't fill a new ticket when older exist. Please start your's work from search. second - it's not a solution. If someone want to avoid ENOMEM in that case, it may do via tunable, but you silence a horror that settings. Real fix in that case will implementing a shared receive queue for an o2iblnd and new memory registration model which able to dramatically reduce a number work requests.

            Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/18347
            Subject: LU-7124 o2iblnd: limit cap.max_send_wr for MLX5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e11441614b76bbc2887cc28fd4dfae7c2128963f

            gerrit Gerrit Updater added a comment - Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/18347 Subject: LU-7124 o2iblnd: limit cap.max_send_wr for MLX5 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e11441614b76bbc2887cc28fd4dfae7c2128963f

            I found if I used map_on_demand you can increase the peer_credits for mlx5. Now that ko2iblnd no longer supports PMR you can't do this anymore. Now if we implement Fast Registration API we could possible push the peer_credits higher using map_on_demand again. As a bonus Fast Registration API is supported with mlx4 so it can be tested more broadly.

            simmonsja James A Simmons added a comment - I found if I used map_on_demand you can increase the peer_credits for mlx5. Now that ko2iblnd no longer supports PMR you can't do this anymore. Now if we implement Fast Registration API we could possible push the peer_credits higher using map_on_demand again. As a bonus Fast Registration API is supported with mlx4 so it can be tested more broadly.

            Here is the reply I got from a Mellanox engineer:

            Hi,

            I sent in the past an explanation to this list and I am going to repeat it.

            The number reported for max_qp_wr is the maximum value the HCA supports. But it is not guaranteed that this maximum is supported for any configuration of a QP. For example, the number of send SGEs and the transport service can affect this max value.

            From the spec:

            11.2.1.2 QUERY HCA
            Description:
            Returns the attributes for the specified HCA.
            The maximum values defined in this section are guaranteed not-to-exceed
            values. It is possible for an implementation to allocate some HCA
            resources from the same space. In that case, the maximum values returned
            are not guaranteed for all of those resources simultaneously

            Mlx5 supported devices work as described above. Mlx4 supported devices has some flexibility allowing it to user larger work queues so this is why you can define 16K WRs for mlx4 and for mlx5 you can do only 8K (in your specific case).

            ashehata Amir Shehata (Inactive) added a comment - Here is the reply I got from a Mellanox engineer: Hi, I sent in the past an explanation to this list and I am going to repeat it. The number reported for max_qp_wr is the maximum value the HCA supports. But it is not guaranteed that this maximum is supported for any configuration of a QP. For example, the number of send SGEs and the transport service can affect this max value. From the spec: 11.2.1.2 QUERY HCA Description: Returns the attributes for the specified HCA. The maximum values defined in this section are guaranteed not-to-exceed values. It is possible for an implementation to allocate some HCA resources from the same space. In that case, the maximum values returned are not guaranteed for all of those resources simultaneously Mlx5 supported devices work as described above. Mlx4 supported devices has some flexibility allowing it to user larger work queues so this is why you can define 16K WRs for mlx4 and for mlx5 you can do only 8K (in your specific case).

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: