[LU-5783] o2iblnd: investigate new memory registration mechanisms Created: 21/Oct/14 Updated: 04/Jun/16 Resolved: 13/May/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Isaac Huang (Inactive) | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 16229 | ||||||||||||||||
| Description |
|
The OFED has support of new memory registration mechanisms that the o2iblnd currently don't use.
The 1st one may improve throughput and the 2nd one may solve the long-standing issue of "too many RDMA frags" failures we had (forgot which ticket, Liang?). We should do some investigation and decide whether to add support of these OFED features to the o2iblnd. |
| Comments |
| Comment by James A Simmons [ 18/Sep/15 ] |
|
I believe this is the replacement for FMR in the mlx5 driver |
| Comment by James A Simmons [ 22/Sep/15 ] |
|
Just was told by a coworker that the above apis are being thrown out and replaced by a new RDMA api. When they API hits FMR will be removed from all drivers. You see the details in this thread: |
| Comment by James A Simmons [ 08/Oct/15 ] |
|
Started work on a patch to support Fast Registration API. |
| Comment by Gerrit Updater [ 15/Dec/15 ] |
|
Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/17606 |
| Comment by Alexey Lyashkov [ 15/Dec/15 ] |
|
I'm sorry, but i not able to post comments into gerrit directly. Consumer interface details: - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added to indicate device support for these features. - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV, IB_WR_RDMA_READ_WITH_INV are added. - A new consumer API function, ib_alloc_mr() is added to allocate fast register memory regions. - New consumer API functions, ib_alloc_fast_reg_page_list() and ib_free_fast_reg_page_list() are added to allocate and free device-specific memory for fast registration page lists. - A new consumer API function, ib_update_fast_reg_key(), is added to allow the key portion of the R_Key and L_Key of a fast registration MR to be updated. Consumers call this if desired before posting a IB_WR_FAST_REG_MR work request. Consumers can use this as follows: - MR is allocated with ib_alloc_mr(). - Page list memory is allocated with ib_alloc_fast_reg_page_list(). - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV), ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with invalidate operation. - MR is deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). |
| Comment by James A Simmons [ 16/Dec/15 ] |
|
This work might have flaws but it works like a charm on my mlx5 system. |
| Comment by Alexey Lyashkov [ 16/Dec/15 ] |
|
James, mlx5 system may work in older style. I have run o2iblnd without modification on own mlx5 system. why not ? if (hdev->ibh_mr_shift == 64) { LIBCFS_ALLOC(hdev->ibh_mrs, 1 * sizeof(*hdev->ibh_mrs)); if (hdev->ibh_mrs == NULL) { CERROR("Failed to allocate MRs table\n"); return -ENOMEM; } hdev->ibh_mrs[0] = NULL; hdev->ibh_nmrs = 1; mr = ib_get_dma_mr(hdev->ibh_pd, acflags); if (IS_ERR(mr)) { CERROR("Failed ib_get_dma_mr : %ld\n", PTR_ERR(mr)); kiblnd_hdev_cleanup_mrs(hdev); return PTR_ERR(mr); } hdev->ibh_mrs[0] = mr; goto out; } struct ib_mr *
kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd)
{
struct ib_mr *prev_mr;
struct ib_mr *mr;
int i;
LASSERT (hdev->ibh_mrs[0] != NULL);
if (*kiblnd_tunables.kib_map_on_demand > 0 &&
*kiblnd_tunables.kib_map_on_demand <= rd->rd_nfrags)
return NULL;
if (hdev->ibh_nmrs == 1)
/* looking for pre-mapping MR */ mr = kiblnd_find_rd_dma_mr(hdev, rd); if (mr != NULL) { /* found pre-mapping MR */ rd->rd_key = (rd != tx->tx_rd) ? mr->rkey : mr->lkey; return 0; } if (net->ibn_fmr_ps != NULL) return kiblnd_fmr_map_tx(net, tx, rd, nob); else if (net->ibn_pmr_ps != NULL) return kiblnd_pmr_map_tx(net, tx, rd, nob); So just enough to send a local dma address for mapped page as part of IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request. simple check. if fast reg used - you may safety drop number WR per QP from 256 * num_parallel_send to 5*parallel_sends. It's way to solve problems with OOM on nodes reported may times in past. |
| Comment by James A Simmons [ 16/Dec/15 ] |
|
Actually for my mlx5 stack testing I am enabling map_on_demand and it is working. |
| Comment by Alexey Lyashkov [ 16/Dec/15 ] |
|
well, with enabling map_on_depand it may work, but anyway. + if (!frd->frd_valid) { + wrq = &frd->frd_inv_wr; + wrq->next = &frd->frd_fastreg_wr; + } else { + wrq = &frd->frd_fastreg_wr; + } It may a simplify a code by avoid calling a invalidate on error case, but it block IB card optimization. per IB spec WR workflow it should be in order |
| Comment by James A Simmons [ 16/Dec/15 ] |
|
Oh I agree. Because something works doesn't mean it is done right |
| Comment by Alexey Lyashkov [ 17/Dec/15 ] |
|
well, i know were you take an original code + /* Prepare FASTREG WR */
+ memset(&fastreg_wr, 0, sizeof(fastreg_wr));
+ fastreg_wr.opcode = IB_WR_FAST_REG_MR;
+ fastreg_wr.send_flags = IB_SEND_SIGNALED;
+ fastreg_wr.wr.fast_reg.iova_start = desc->data_frpl->page_list[0] + offset;
+ fastreg_wr.wr.fast_reg.page_list = desc->data_frpl;
+ fastreg_wr.wr.fast_reg.page_list_len = page_list_len;
+ fastreg_wr.wr.fast_reg.page_shift = SHIFT_4K;
+ fastreg_wr.wr.fast_reg.length = data_size;
+ fastreg_wr.wr.fast_reg.rkey = desc->data_mr->rkey;
+ fastreg_wr.wr.fast_reg.access_flags = (IB_ACCESS_LOCAL_WRITE |
+ IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_REMOTE_READ);
+
so iser don't have two page lists as in patch. It uses directly an preallocated DMA region pages. Probably our's code will have two more memory usage if we don't use same pages to transfer and mapping. > I agree with you that perhaps we should just remove FMR support at this since it will be going away. if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) &&
(dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) &&
(dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR))
props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;
so probably better to just lower priority for FMR. one more note - don't use a flag, but have two pointers to function to map and to unmap region, as code completely different. |
| Comment by Jeremy Filizetti [ 18/Dec/15 ] |
|
I know we have mthca cards in use which use FMR and I don't believe support fast reg so I'm fine with lowering the priority but not removing the FMR code altogether. I'm having some issues with the patch but there is already some good feedback here towards a second revision so I'll wait until then before I dig deeper. |
| Comment by Gerrit Updater [ 13/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17606/ |
| Comment by Alexey Lyashkov [ 13/Mar/16 ] |
|
I like to see, Intel don't work in cooperations with other developers, as it land a patch without resolving comments in ticket. |
| Comment by James A Simmons [ 13/Mar/16 ] |
|
You don't review the patch in gerrit nor do you push a patch somewhere. I understand you don't want to use the gerrit system without gmail support but if you want to improve this work I suggest you create a github branch with your fix. I would gladly push the fix to gerrit for you. You need to figure a way to work with us. |
| Comment by Alexey Lyashkov [ 14/Mar/16 ] |
|
Intel don't fix a gerrit to ability to review. When you have a fix own gerrit to work with google ? |
| Comment by Cory Spitz [ 13/Apr/16 ] |
|
Alexey, we discussed your issue at the OpenSFS LWG and we decided to stick with the status quo. |
| Comment by Peter Jones [ 13/May/16 ] |
|
As far as aI can see this has been landed for 2.9 |