[LU-5783] o2iblnd: investigate new memory registration mechanisms Created: 21/Oct/14  Updated: 04/Jun/16  Resolved: 13/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Improvement Priority: Minor
Reporter: Isaac Huang (Inactive) Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6850 Remove use of ib_reg_phys_mr() from o... Resolved
is related to LU-3322 ko2iblnd support for different map_on... Resolved
is related to LU-7124 MLX5: Limit hit in cap.max_send_wr Resolved
Rank (Obsolete): 16229

 Description   

The OFED has support of new memory registration mechanisms that the o2iblnd currently don't use.

The 1st one may improve throughput and the 2nd one may solve the long-standing issue of "too many RDMA frags" failures we had (forgot which ticket, Liang?).

We should do some investigation and decide whether to add support of these OFED features to the o2iblnd.



 Comments   
Comment by James A Simmons [ 18/Sep/15 ]

I believe this is the replacement for FMR in the mlx5 driver

Comment by James A Simmons [ 22/Sep/15 ]

Just was told by a coworker that the above apis are being thrown out and replaced by a new RDMA api. When they API hits FMR will be removed from all drivers. You see the details in this thread:

http://article.gmane.org/gmane.linux.drivers.rdma/29040

Comment by James A Simmons [ 08/Oct/15 ]

Started work on a patch to support Fast Registration API.

Comment by Gerrit Updater [ 15/Dec/15 ]

Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/17606
Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 833db95ca7ada950c06de0e5bbdd1ca3da83c3db

Comment by Alexey Lyashkov [ 15/Dec/15 ]

I'm sorry, but i not able to post comments into gerrit directly.
But patch is completely wrong.
First of all - you don't need an alloc any pages via ib_alloc_fast_reg_page_list, but you need an just map pages from MD to dma addresses.
second - you must change a QP attributes. you don't need so large WR / SGE limits if fast registration enabled - typically you needs an 7 WR * num_parallel sends.
third - you must INVALIDATE a memory region after use, it's possible via READ_INV command, but for write you need additional WR with invalidate operation.

Consumer interface details:

 - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
   to indicate device support for these features.

 - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
   IB_WR_RDMA_READ_WITH_INV are added.

 - A new consumer API function, ib_alloc_mr() is added to allocate
   fast register memory regions.

 - New consumer API functions, ib_alloc_fast_reg_page_list() and
   ib_free_fast_reg_page_list() are added to allocate and free
   device-specific memory for fast registration page lists.

 - A new consumer API function, ib_update_fast_reg_key(), is added to
   allow the key portion of the R_Key and L_Key of a fast registration
   MR to be updated.  Consumers call this if desired before posting
   a IB_WR_FAST_REG_MR work request.

Consumers can use this as follows:

 - MR is allocated with ib_alloc_mr().

 - Page list memory is allocated with ib_alloc_fast_reg_page_list().

 - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().

 - MR made VALID and bound to a specific page list via
   ib_post_send(IB_WR_FAST_REG_MR)

 - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
   ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
   invalidate operation.

 - MR is deallocated with ib_dereg_mr()

 - page lists dealloced via ib_free_fast_reg_page_list().
Comment by James A Simmons [ 16/Dec/15 ]

This work might have flaws but it works like a charm on my mlx5 system.

Comment by Alexey Lyashkov [ 16/Dec/15 ]

James,

mlx5 system may work in older style. I have run o2iblnd without modification on own mlx5 system. why not ?
FRM codepath is inactive for common use, it's activated just with map_on_demand option. In common case o2iblnd uses an global MR mapping

        if (hdev->ibh_mr_shift == 64) {
                LIBCFS_ALLOC(hdev->ibh_mrs, 1 * sizeof(*hdev->ibh_mrs));
                if (hdev->ibh_mrs == NULL) {
                        CERROR("Failed to allocate MRs table\n");
                        return -ENOMEM;
                }

                hdev->ibh_mrs[0] = NULL;
                hdev->ibh_nmrs   = 1;

                mr = ib_get_dma_mr(hdev->ibh_pd, acflags);
                if (IS_ERR(mr)) {
                        CERROR("Failed ib_get_dma_mr : %ld\n", PTR_ERR(mr));
                        kiblnd_hdev_cleanup_mrs(hdev);
                        return PTR_ERR(mr);
                }

                hdev->ibh_mrs[0] = mr;

                goto out;
        }
struct ib_mr *
kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd)
{
        struct ib_mr *prev_mr;
        struct ib_mr *mr;
        int           i;

        LASSERT (hdev->ibh_mrs[0] != NULL);

        if (*kiblnd_tunables.kib_map_on_demand > 0 &&
            *kiblnd_tunables.kib_map_on_demand <= rd->rd_nfrags)
                return NULL;

        if (hdev->ibh_nmrs == 1)

        /* looking for pre-mapping MR */
        mr = kiblnd_find_rd_dma_mr(hdev, rd);
        if (mr != NULL) {
                /* found pre-mapping MR */
                rd->rd_key = (rd != tx->tx_rd) ? mr->rkey : mr->lkey;
                return 0;
        }

        if (net->ibn_fmr_ps != NULL)
                return kiblnd_fmr_map_tx(net, tx, rd, nob);
        else if (net->ibn_pmr_ps != NULL)
                return kiblnd_pmr_map_tx(net, tx, rd, nob);

So just enough to send a local dma address for mapped page as part of IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request.
all pages sends in WR chain.

simple check. if fast reg used - you may safety drop number WR per QP from 256 * num_parallel_send to 5*parallel_sends.
explanation - with older code - you need a fill remote address + R_key for each IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request. But with IB memory extension you able to have a single WR request to point whole transfer. Page list to fill an tranfer will take from fast reg operation and addressed by special L_Key.

It's way to solve problems with OOM on nodes reported may times in past.

Comment by James A Simmons [ 16/Dec/15 ]

Actually for my mlx5 stack testing I am enabling map_on_demand and it is working.

Comment by Alexey Lyashkov [ 16/Dec/15 ]

well, with enabling map_on_depand it may work, but anyway.
1) IB_WR_RDMA_WRITE_IMM, IB_WR_RDMA_READ_INV should be reused to avoid extra WR.
2) total number WR per QP need adjusted to reflect extra WR used for mapping purpose.
3) don't prefer a FMR if exist. Fast Reg supported for most cards now, and it's better choose.
4) don't use a "invalidate->register" WR links it's directly affects (1).

+                       if (!frd->frd_valid) {
+                               wrq = &frd->frd_inv_wr;
+                               wrq->next = &frd->frd_fastreg_wr;
+                       } else {
+                               wrq = &frd->frd_fastreg_wr;
+                       }

It may a simplify a code by avoid calling a invalidate on error case, but it block IB card optimization. per IB spec WR workflow it should be in order
1) register
2) write/read
3) invalidate

Comment by James A Simmons [ 16/Dec/15 ]

Oh I agree. Because something works doesn't mean it is done right I agree with you that perhaps we should just remove FMR support at this since it will be going away.

Comment by Alexey Lyashkov [ 17/Dec/15 ]

well, i know were you take an original code but one note.

+       /* Prepare FASTREG WR */
+       memset(&fastreg_wr, 0, sizeof(fastreg_wr));
+       fastreg_wr.opcode = IB_WR_FAST_REG_MR;
+       fastreg_wr.send_flags = IB_SEND_SIGNALED;
+       fastreg_wr.wr.fast_reg.iova_start = desc->data_frpl->page_list[0] + offset;
+       fastreg_wr.wr.fast_reg.page_list = desc->data_frpl;
+       fastreg_wr.wr.fast_reg.page_list_len = page_list_len;
+       fastreg_wr.wr.fast_reg.page_shift = SHIFT_4K;
+       fastreg_wr.wr.fast_reg.length = data_size;
+       fastreg_wr.wr.fast_reg.rkey = desc->data_mr->rkey;
+       fastreg_wr.wr.fast_reg.access_flags = (IB_ACCESS_LOCAL_WRITE  |
+                                              IB_ACCESS_REMOTE_WRITE |
+                                              IB_ACCESS_REMOTE_READ);
+

so iser don't have two page lists as in patch. It uses directly an preallocated DMA region pages. Probably our's code will have two more memory usage if we don't use same pages to transfer and mapping.

> I agree with you that perhaps we should just remove FMR support at this since it will be going away.
not at all. Yes, Fast reg introduced in 2008, but not at all mlx4 cards may support it (if you have a good melanox contact please ask him).

        if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) &&
            (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) &&
            (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR))
                props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;

so probably better to just lower priority for FMR.

one more note - don't use a flag, but have two pointers to function to map and to unmap region, as code completely different.

Comment by Jeremy Filizetti [ 18/Dec/15 ]

I know we have mthca cards in use which use FMR and I don't believe support fast reg so I'm fine with lowering the priority but not removing the FMR code altogether. I'm having some issues with the patch but there is already some good feedback here towards a second revision so I'll wait until then before I dig deeper.

Comment by Gerrit Updater [ 13/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17606/
Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f2600449ced3a2f78c4b3e650ff29bb88d1e7e45

Comment by Alexey Lyashkov [ 13/Mar/16 ]

I like to see, Intel don't work in cooperations with other developers, as it land a patch without resolving comments in ticket.

Comment by James A Simmons [ 13/Mar/16 ]

You don't review the patch in gerrit nor do you push a patch somewhere. I understand you don't want to use the gerrit system without gmail support but if you want to improve this work I suggest you create a github branch with your fix. I would gladly push the fix to gerrit for you. You need to figure a way to work with us.

Comment by Alexey Lyashkov [ 14/Mar/16 ]

Intel don't fix a gerrit to ability to review. When you have a fix own gerrit to work with google ?
I was asked for two years, new gerrit versions work fine with google, but Intel stick on older and buggy version.

Comment by Cory Spitz [ 13/Apr/16 ]

Alexey, we discussed your issue at the OpenSFS LWG and we decided to stick with the status quo.
James, it was very gracious and helpful of you to offer to push a fix to gerri on behalf of Alexey. Alexey, I agree with James, you need to find a way to work with us. Please take him up on his offer, but I suggest you find a way to work with the community more generally. If you'd like the LWG to revisit this topic, please send mail to the LWG list.

Comment by Peter Jones [ 13/May/16 ]

As far as aI can see this has been landed for 2.9

Generated at Sat Feb 10 01:54:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.