[LU-7124] MLX5: Limit hit in cap.max_send_wr Created: 09/Sep/15 Updated: 24/Nov/20 Resolved: 14/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Running Lustre with MLX5 We were trying to increase O2IBLND's peer_credits to 32 on MLX5. Here is the problematic code: init_qp_attr->event_handler = kiblnd_qp_event;
init_qp_attr->qp_context = conn;
init_qp_attr->cap.max_send_wr = IBLND_SEND_WRS(version);
init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(version);
init_qp_attr->cap.max_send_sge = 1;
init_qp_attr->cap.max_recv_sge = 1;
init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
init_qp_attr->qp_type = IB_QPT_RC;
init_qp_attr->send_cq = cq;
init_qp_attr->recv_cq = cq;
rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
#define IBLND_SEND_WRS(v) ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))
#define IBLND_RDMA_FRAGS(v) ((v) == IBLND_MSG_VERSION_1 ? \
IBLND_MAX_RDMA_FRAGS : IBLND_CFG_RDMA_FRAGS)
#define IBLND_CFG_RDMA_FRAGS (*kiblnd_tunables.kib_map_on_demand != 0 ? \
*kiblnd_tunables.kib_map_on_demand : \
IBLND_MAX_RDMA_FRAGS) /* max # of fragments configured by user */
#define IBLND_MAX_RDMA_FRAGS LNET_MAX_IOV /* max # of fragments supported */
/** limit on the number of fragments in discontiguous MDs */
#define LNET_MAX_IOV 256
Basically, when setting peer_credits to 32 then init_qp_attr->cap.max_send_wr = 8224 [root@wt-2-00 ~]# ibv_devinfo -v | grep max_qp_wr max_qp_wr: 16384 API returns -12 (out of memory) peer_credits 16 == 4112 seems to work. We're running on MOFED 3.0 Is there any limitation that we're hitting on the MLX side? As far as I know MLX4 works with peer_credits set to 32. Full device info: [wt2user1@wildcat2 ~]$ ibv_devinfo -v
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.100.6440
node_guid: e41d:2d03:0060:7652
sys_image_guid: e41d:2d03:0060:7652
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffff000
max_qp: 262144
max_qp_wr: 16384
device_cap_flags: 0x40509c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
XRC
Unknown flags: 0x40408000
device_cap_exp_flags: 0x5020007100000000
EXP_DC_TRANSPORT
EXP_MEM_MGT_EXTENSIONS
EXP_CROSS_CHANNEL
EXP_MR_ALLOCATE
EXT_ATOMICS
EXT_SEND NOP
EXP_UMR
max_sge: 30
max_sge_rd: 0
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA_REPLY_BE (64)
log atomic arg sizes (mask) 3c
max fetch and add bit boundary 64
log max atomic inline 5
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 48
max_total_mcast_qp_attach: 100663296
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 16383
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
hca_core_clock: 0
max_klm_list_size: 65536
max_send_wqe_inline_klms: 20
max_umr_recursion_depth: 4
max_umr_stride_dimension: 1
general_odp_caps:
rc_odp_caps:
NO SUPPORT
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
NO SUPPORT
dc_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
raw_eth_odp_caps:
NO SUPPORT
max_dct: 262144
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 19
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
max_msg_sz: 0x40000000
port_cap_flags: 0x2651e848
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 8
subnet_timeout: 18
init_type_reply: 0
active_width: 4X (2)
active_speed: 25.0 Gbps (32)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:e41d:2d03:0060:7652
|
| Comments |
| Comment by Amir Shehata (Inactive) [ 09/Sep/15 ] |
|
Here is the reply I got from a Mellanox engineer: Hi, I sent in the past an explanation to this list and I am going to repeat it. The number reported for max_qp_wr is the maximum value the HCA supports. But it is not guaranteed that this maximum is supported for any configuration of a QP. For example, the number of send SGEs and the transport service can affect this max value. From the spec: 11.2.1.2 QUERY HCA Mlx5 supported devices work as described above. Mlx4 supported devices has some flexibility allowing it to user larger work queues so this is why you can define 16K WRs for mlx4 and for mlx5 you can do only 8K (in your specific case). |
| Comment by James A Simmons [ 08/Oct/15 ] |
|
I found if I used map_on_demand you can increase the peer_credits for mlx5. Now that ko2iblnd no longer supports PMR you can't do this anymore. Now if we implement Fast Registration API we could possible push the peer_credits higher using map_on_demand again. As a bonus Fast Registration API is supported with mlx4 so it can be tested more broadly. |
| Comment by Gerrit Updater [ 08/Feb/16 ] |
|
Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/18347 |
| Comment by Alexey Lyashkov [ 10/Feb/16 ] |
|
first of all, don't fill a new ticket when older exist. Please start your's work from search. second - it's not a solution. If someone want to avoid ENOMEM in that case, it may do via tunable, but you silence a horror that settings. Real fix in that case will implementing a shared receive queue for an o2iblnd and new memory registration model which able to dramatically reduce a number work requests. |
| Comment by Chris Hunter (Inactive) [ 09/Mar/16 ] |
|
Is the problem memory fragmentation as in |
| Comment by James A Simmons [ 09/Mar/16 ] |
|
For me the problem was not being about to set my peer credit setting higher than 16. |
| Comment by Gerrit Updater [ 14/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18347/ |
| Comment by Sergey Cheremencev [ 08/Apr/16 ] |
[49672.067906] mlx5_ib:mlx5_0:calc_sq_size:485:(pid 8297): wqe_size 192 [49672.067908] mlx5_ib:mlx5_0:calc_sq_size:507:(pid 8297): wqe count(65536) exceeds limits(16384) [49672.067910] mlx5_ib:mlx5_0:create_kernel_qp:1051:(pid 8297): err -12 Don't think that http://review.whamcloud.com/18347/ is right solution. Moreover patch leads to several memory free/allocation and mutex locking/unlocking inside rdma_create_qp->mlx5_ib_create_qp...: mlx5_ib_create_qp: attrx = kzalloc(sizeof(*attrx), GFP_KERNEL); __create_qp: qp = kzalloc(sizeof(*qp), GFP_KERNEL); Also there is a chance that ENOMEM could be returned in case of low system memory. I mean not from calc_sq_size. |