Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5783

o2iblnd: investigate new memory registration mechanisms

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 16229

    Description

      The OFED has support of new memory registration mechanisms that the o2iblnd currently don't use.

      The 1st one may improve throughput and the 2nd one may solve the long-standing issue of "too many RDMA frags" failures we had (forgot which ticket, Liang?).

      We should do some investigation and decide whether to add support of these OFED features to the o2iblnd.

      Attachments

        Issue Links

          Activity

            [LU-5783] o2iblnd: investigate new memory registration mechanisms

            I like to see, Intel don't work in cooperations with other developers, as it land a patch without resolving comments in ticket.

            shadow Alexey Lyashkov added a comment - I like to see, Intel don't work in cooperations with other developers, as it land a patch without resolving comments in ticket.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17606/
            Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f2600449ced3a2f78c4b3e650ff29bb88d1e7e45

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17606/ Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support Project: fs/lustre-release Branch: master Current Patch Set: Commit: f2600449ced3a2f78c4b3e650ff29bb88d1e7e45

            I know we have mthca cards in use which use FMR and I don't believe support fast reg so I'm fine with lowering the priority but not removing the FMR code altogether. I'm having some issues with the patch but there is already some good feedback here towards a second revision so I'll wait until then before I dig deeper.

            jfilizetti Jeremy Filizetti added a comment - I know we have mthca cards in use which use FMR and I don't believe support fast reg so I'm fine with lowering the priority but not removing the FMR code altogether. I'm having some issues with the patch but there is already some good feedback here towards a second revision so I'll wait until then before I dig deeper.

            well, i know were you take an original code but one note.

            +       /* Prepare FASTREG WR */
            +       memset(&fastreg_wr, 0, sizeof(fastreg_wr));
            +       fastreg_wr.opcode = IB_WR_FAST_REG_MR;
            +       fastreg_wr.send_flags = IB_SEND_SIGNALED;
            +       fastreg_wr.wr.fast_reg.iova_start = desc->data_frpl->page_list[0] + offset;
            +       fastreg_wr.wr.fast_reg.page_list = desc->data_frpl;
            +       fastreg_wr.wr.fast_reg.page_list_len = page_list_len;
            +       fastreg_wr.wr.fast_reg.page_shift = SHIFT_4K;
            +       fastreg_wr.wr.fast_reg.length = data_size;
            +       fastreg_wr.wr.fast_reg.rkey = desc->data_mr->rkey;
            +       fastreg_wr.wr.fast_reg.access_flags = (IB_ACCESS_LOCAL_WRITE  |
            +                                              IB_ACCESS_REMOTE_WRITE |
            +                                              IB_ACCESS_REMOTE_READ);
            +
            

            so iser don't have two page lists as in patch. It uses directly an preallocated DMA region pages. Probably our's code will have two more memory usage if we don't use same pages to transfer and mapping.

            > I agree with you that perhaps we should just remove FMR support at this since it will be going away.
            not at all. Yes, Fast reg introduced in 2008, but not at all mlx4 cards may support it (if you have a good melanox contact please ask him).

                    if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) &&
                        (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) &&
                        (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR))
                            props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;
            

            so probably better to just lower priority for FMR.

            one more note - don't use a flag, but have two pointers to function to map and to unmap region, as code completely different.

            shadow Alexey Lyashkov added a comment - well, i know were you take an original code but one note. + /* Prepare FASTREG WR */ + memset(&fastreg_wr, 0, sizeof(fastreg_wr)); + fastreg_wr.opcode = IB_WR_FAST_REG_MR; + fastreg_wr.send_flags = IB_SEND_SIGNALED; + fastreg_wr.wr.fast_reg.iova_start = desc->data_frpl->page_list[0] + offset; + fastreg_wr.wr.fast_reg.page_list = desc->data_frpl; + fastreg_wr.wr.fast_reg.page_list_len = page_list_len; + fastreg_wr.wr.fast_reg.page_shift = SHIFT_4K; + fastreg_wr.wr.fast_reg.length = data_size; + fastreg_wr.wr.fast_reg.rkey = desc->data_mr->rkey; + fastreg_wr.wr.fast_reg.access_flags = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); + so iser don't have two page lists as in patch. It uses directly an preallocated DMA region pages. Probably our's code will have two more memory usage if we don't use same pages to transfer and mapping. > I agree with you that perhaps we should just remove FMR support at this since it will be going away. not at all. Yes, Fast reg introduced in 2008, but not at all mlx4 cards may support it (if you have a good melanox contact please ask him). if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) && (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) && (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR)) props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS; so probably better to just lower priority for FMR. one more note - don't use a flag, but have two pointers to function to map and to unmap region, as code completely different.

            Oh I agree. Because something works doesn't mean it is done right I agree with you that perhaps we should just remove FMR support at this since it will be going away.

            simmonsja James A Simmons added a comment - Oh I agree. Because something works doesn't mean it is done right I agree with you that perhaps we should just remove FMR support at this since it will be going away.

            well, with enabling map_on_depand it may work, but anyway.
            1) IB_WR_RDMA_WRITE_IMM, IB_WR_RDMA_READ_INV should be reused to avoid extra WR.
            2) total number WR per QP need adjusted to reflect extra WR used for mapping purpose.
            3) don't prefer a FMR if exist. Fast Reg supported for most cards now, and it's better choose.
            4) don't use a "invalidate->register" WR links it's directly affects (1).

            +                       if (!frd->frd_valid) {
            +                               wrq = &frd->frd_inv_wr;
            +                               wrq->next = &frd->frd_fastreg_wr;
            +                       } else {
            +                               wrq = &frd->frd_fastreg_wr;
            +                       }
            

            It may a simplify a code by avoid calling a invalidate on error case, but it block IB card optimization. per IB spec WR workflow it should be in order
            1) register
            2) write/read
            3) invalidate

            shadow Alexey Lyashkov added a comment - well, with enabling map_on_depand it may work, but anyway. 1) IB_WR_RDMA_WRITE_IMM, IB_WR_RDMA_READ_INV should be reused to avoid extra WR. 2) total number WR per QP need adjusted to reflect extra WR used for mapping purpose. 3) don't prefer a FMR if exist. Fast Reg supported for most cards now, and it's better choose. 4) don't use a "invalidate->register" WR links it's directly affects (1). + if (!frd->frd_valid) { + wrq = &frd->frd_inv_wr; + wrq->next = &frd->frd_fastreg_wr; + } else { + wrq = &frd->frd_fastreg_wr; + } It may a simplify a code by avoid calling a invalidate on error case, but it block IB card optimization. per IB spec WR workflow it should be in order 1) register 2) write/read 3) invalidate

            Actually for my mlx5 stack testing I am enabling map_on_demand and it is working.

            simmonsja James A Simmons added a comment - Actually for my mlx5 stack testing I am enabling map_on_demand and it is working.
            shadow Alexey Lyashkov added a comment - - edited

            James,

            mlx5 system may work in older style. I have run o2iblnd without modification on own mlx5 system. why not ?
            FRM codepath is inactive for common use, it's activated just with map_on_demand option. In common case o2iblnd uses an global MR mapping

                    if (hdev->ibh_mr_shift == 64) {
                            LIBCFS_ALLOC(hdev->ibh_mrs, 1 * sizeof(*hdev->ibh_mrs));
                            if (hdev->ibh_mrs == NULL) {
                                    CERROR("Failed to allocate MRs table\n");
                                    return -ENOMEM;
                            }
            
                            hdev->ibh_mrs[0] = NULL;
                            hdev->ibh_nmrs   = 1;
            
                            mr = ib_get_dma_mr(hdev->ibh_pd, acflags);
                            if (IS_ERR(mr)) {
                                    CERROR("Failed ib_get_dma_mr : %ld\n", PTR_ERR(mr));
                                    kiblnd_hdev_cleanup_mrs(hdev);
                                    return PTR_ERR(mr);
                            }
            
                            hdev->ibh_mrs[0] = mr;
            
                            goto out;
                    }
            
            struct ib_mr *
            kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd)
            {
                    struct ib_mr *prev_mr;
                    struct ib_mr *mr;
                    int           i;
            
                    LASSERT (hdev->ibh_mrs[0] != NULL);
            
                    if (*kiblnd_tunables.kib_map_on_demand > 0 &&
                        *kiblnd_tunables.kib_map_on_demand <= rd->rd_nfrags)
                            return NULL;
            
                    if (hdev->ibh_nmrs == 1)
            
            
                    /* looking for pre-mapping MR */
                    mr = kiblnd_find_rd_dma_mr(hdev, rd);
                    if (mr != NULL) {
                            /* found pre-mapping MR */
                            rd->rd_key = (rd != tx->tx_rd) ? mr->rkey : mr->lkey;
                            return 0;
                    }
            
                    if (net->ibn_fmr_ps != NULL)
                            return kiblnd_fmr_map_tx(net, tx, rd, nob);
                    else if (net->ibn_pmr_ps != NULL)
                            return kiblnd_pmr_map_tx(net, tx, rd, nob);
            

            So just enough to send a local dma address for mapped page as part of IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request.
            all pages sends in WR chain.

            simple check. if fast reg used - you may safety drop number WR per QP from 256 * num_parallel_send to 5*parallel_sends.
            explanation - with older code - you need a fill remote address + R_key for each IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request. But with IB memory extension you able to have a single WR request to point whole transfer. Page list to fill an tranfer will take from fast reg operation and addressed by special L_Key.

            It's way to solve problems with OOM on nodes reported may times in past.

            shadow Alexey Lyashkov added a comment - - edited James, mlx5 system may work in older style. I have run o2iblnd without modification on own mlx5 system. why not ? FRM codepath is inactive for common use, it's activated just with map_on_demand option. In common case o2iblnd uses an global MR mapping if (hdev->ibh_mr_shift == 64) { LIBCFS_ALLOC(hdev->ibh_mrs, 1 * sizeof(*hdev->ibh_mrs)); if (hdev->ibh_mrs == NULL) { CERROR( "Failed to allocate MRs table\n" ); return -ENOMEM; } hdev->ibh_mrs[0] = NULL; hdev->ibh_nmrs = 1; mr = ib_get_dma_mr(hdev->ibh_pd, acflags); if (IS_ERR(mr)) { CERROR( "Failed ib_get_dma_mr : %ld\n" , PTR_ERR(mr)); kiblnd_hdev_cleanup_mrs(hdev); return PTR_ERR(mr); } hdev->ibh_mrs[0] = mr; goto out; } struct ib_mr * kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd) { struct ib_mr *prev_mr; struct ib_mr *mr; int i; LASSERT (hdev->ibh_mrs[0] != NULL); if (*kiblnd_tunables.kib_map_on_demand > 0 && *kiblnd_tunables.kib_map_on_demand <= rd->rd_nfrags) return NULL; if (hdev->ibh_nmrs == 1) /* looking for pre-mapping MR */ mr = kiblnd_find_rd_dma_mr(hdev, rd); if (mr != NULL) { /* found pre-mapping MR */ rd->rd_key = (rd != tx->tx_rd) ? mr->rkey : mr->lkey; return 0; } if (net->ibn_fmr_ps != NULL) return kiblnd_fmr_map_tx(net, tx, rd, nob); else if (net->ibn_pmr_ps != NULL) return kiblnd_pmr_map_tx(net, tx, rd, nob); So just enough to send a local dma address for mapped page as part of IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request. all pages sends in WR chain. simple check. if fast reg used - you may safety drop number WR per QP from 256 * num_parallel_send to 5*parallel_sends. explanation - with older code - you need a fill remote address + R_key for each IB_WR_RDMA_WRITE / IB_WR_RDMA_READ request. But with IB memory extension you able to have a single WR request to point whole transfer. Page list to fill an tranfer will take from fast reg operation and addressed by special L_Key. It's way to solve problems with OOM on nodes reported may times in past.

            This work might have flaws but it works like a charm on my mlx5 system.

            simmonsja James A Simmons added a comment - This work might have flaws but it works like a charm on my mlx5 system.

            I'm sorry, but i not able to post comments into gerrit directly.
            But patch is completely wrong.
            First of all - you don't need an alloc any pages via ib_alloc_fast_reg_page_list, but you need an just map pages from MD to dma addresses.
            second - you must change a QP attributes. you don't need so large WR / SGE limits if fast registration enabled - typically you needs an 7 WR * num_parallel sends.
            third - you must INVALIDATE a memory region after use, it's possible via READ_INV command, but for write you need additional WR with invalidate operation.

            Consumer interface details:
            
             - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
               to indicate device support for these features.
            
             - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
               IB_WR_RDMA_READ_WITH_INV are added.
            
             - A new consumer API function, ib_alloc_mr() is added to allocate
               fast register memory regions.
            
             - New consumer API functions, ib_alloc_fast_reg_page_list() and
               ib_free_fast_reg_page_list() are added to allocate and free
               device-specific memory for fast registration page lists.
            
             - A new consumer API function, ib_update_fast_reg_key(), is added to
               allow the key portion of the R_Key and L_Key of a fast registration
               MR to be updated.  Consumers call this if desired before posting
               a IB_WR_FAST_REG_MR work request.
            
            Consumers can use this as follows:
            
             - MR is allocated with ib_alloc_mr().
            
             - Page list memory is allocated with ib_alloc_fast_reg_page_list().
            
             - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
            
             - MR made VALID and bound to a specific page list via
               ib_post_send(IB_WR_FAST_REG_MR)
            
             - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
               ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
               invalidate operation.
            
             - MR is deallocated with ib_dereg_mr()
            
             - page lists dealloced via ib_free_fast_reg_page_list().
            
            shadow Alexey Lyashkov added a comment - I'm sorry, but i not able to post comments into gerrit directly. But patch is completely wrong. First of all - you don't need an alloc any pages via ib_alloc_fast_reg_page_list, but you need an just map pages from MD to dma addresses. second - you must change a QP attributes. you don't need so large WR / SGE limits if fast registration enabled - typically you needs an 7 WR * num_parallel sends. third - you must INVALIDATE a memory region after use, it's possible via READ_INV command, but for write you need additional WR with invalidate operation. Consumer interface details: - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added to indicate device support for these features. - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV, IB_WR_RDMA_READ_WITH_INV are added. - A new consumer API function, ib_alloc_mr() is added to allocate fast register memory regions. - New consumer API functions, ib_alloc_fast_reg_page_list() and ib_free_fast_reg_page_list() are added to allocate and free device-specific memory for fast registration page lists. - A new consumer API function, ib_update_fast_reg_key(), is added to allow the key portion of the R_Key and L_Key of a fast registration MR to be updated. Consumers call this if desired before posting a IB_WR_FAST_REG_MR work request. Consumers can use this as follows: - MR is allocated with ib_alloc_mr(). - Page list memory is allocated with ib_alloc_fast_reg_page_list(). - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV), ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with invalidate operation. - MR is deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list().

            Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/17606
            Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 833db95ca7ada950c06de0e5bbdd1ca3da83c3db

            gerrit Gerrit Updater added a comment - Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/17606 Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 833db95ca7ada950c06de0e5bbdd1ca3da83c3db

            People

              dmiter Dmitry Eremin (Inactive)
              isaac Isaac Huang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: