Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5783

o2iblnd: investigate new memory registration mechanisms

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 16229

    Description

      The OFED has support of new memory registration mechanisms that the o2iblnd currently don't use.

      The 1st one may improve throughput and the 2nd one may solve the long-standing issue of "too many RDMA frags" failures we had (forgot which ticket, Liang?).

      We should do some investigation and decide whether to add support of these OFED features to the o2iblnd.

      Attachments

        Issue Links

          Activity

            [LU-5783] o2iblnd: investigate new memory registration mechanisms
            pjones Peter Jones added a comment -

            As far as aI can see this has been landed for 2.9

            pjones Peter Jones added a comment - As far as aI can see this has been landed for 2.9
            spitzcor Cory Spitz added a comment -

            Alexey, we discussed your issue at the OpenSFS LWG and we decided to stick with the status quo.
            James, it was very gracious and helpful of you to offer to push a fix to gerri on behalf of Alexey. Alexey, I agree with James, you need to find a way to work with us. Please take him up on his offer, but I suggest you find a way to work with the community more generally. If you'd like the LWG to revisit this topic, please send mail to the LWG list.

            spitzcor Cory Spitz added a comment - Alexey, we discussed your issue at the OpenSFS LWG and we decided to stick with the status quo. James, it was very gracious and helpful of you to offer to push a fix to gerri on behalf of Alexey. Alexey, I agree with James, you need to find a way to work with us. Please take him up on his offer, but I suggest you find a way to work with the community more generally. If you'd like the LWG to revisit this topic, please send mail to the LWG list.

            Intel don't fix a gerrit to ability to review. When you have a fix own gerrit to work with google ?
            I was asked for two years, new gerrit versions work fine with google, but Intel stick on older and buggy version.

            shadow Alexey Lyashkov added a comment - Intel don't fix a gerrit to ability to review. When you have a fix own gerrit to work with google ? I was asked for two years, new gerrit versions work fine with google, but Intel stick on older and buggy version.

            You don't review the patch in gerrit nor do you push a patch somewhere. I understand you don't want to use the gerrit system without gmail support but if you want to improve this work I suggest you create a github branch with your fix. I would gladly push the fix to gerrit for you. You need to figure a way to work with us.

            simmonsja James A Simmons added a comment - You don't review the patch in gerrit nor do you push a patch somewhere. I understand you don't want to use the gerrit system without gmail support but if you want to improve this work I suggest you create a github branch with your fix. I would gladly push the fix to gerrit for you. You need to figure a way to work with us.

            I like to see, Intel don't work in cooperations with other developers, as it land a patch without resolving comments in ticket.

            shadow Alexey Lyashkov added a comment - I like to see, Intel don't work in cooperations with other developers, as it land a patch without resolving comments in ticket.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17606/
            Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f2600449ced3a2f78c4b3e650ff29bb88d1e7e45

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17606/ Subject: LU-5783 o2iblnd: Add Fast Reg memory registration support Project: fs/lustre-release Branch: master Current Patch Set: Commit: f2600449ced3a2f78c4b3e650ff29bb88d1e7e45

            I know we have mthca cards in use which use FMR and I don't believe support fast reg so I'm fine with lowering the priority but not removing the FMR code altogether. I'm having some issues with the patch but there is already some good feedback here towards a second revision so I'll wait until then before I dig deeper.

            jfilizetti Jeremy Filizetti added a comment - I know we have mthca cards in use which use FMR and I don't believe support fast reg so I'm fine with lowering the priority but not removing the FMR code altogether. I'm having some issues with the patch but there is already some good feedback here towards a second revision so I'll wait until then before I dig deeper.

            well, i know were you take an original code but one note.

            +       /* Prepare FASTREG WR */
            +       memset(&fastreg_wr, 0, sizeof(fastreg_wr));
            +       fastreg_wr.opcode = IB_WR_FAST_REG_MR;
            +       fastreg_wr.send_flags = IB_SEND_SIGNALED;
            +       fastreg_wr.wr.fast_reg.iova_start = desc->data_frpl->page_list[0] + offset;
            +       fastreg_wr.wr.fast_reg.page_list = desc->data_frpl;
            +       fastreg_wr.wr.fast_reg.page_list_len = page_list_len;
            +       fastreg_wr.wr.fast_reg.page_shift = SHIFT_4K;
            +       fastreg_wr.wr.fast_reg.length = data_size;
            +       fastreg_wr.wr.fast_reg.rkey = desc->data_mr->rkey;
            +       fastreg_wr.wr.fast_reg.access_flags = (IB_ACCESS_LOCAL_WRITE  |
            +                                              IB_ACCESS_REMOTE_WRITE |
            +                                              IB_ACCESS_REMOTE_READ);
            +
            

            so iser don't have two page lists as in patch. It uses directly an preallocated DMA region pages. Probably our's code will have two more memory usage if we don't use same pages to transfer and mapping.

            > I agree with you that perhaps we should just remove FMR support at this since it will be going away.
            not at all. Yes, Fast reg introduced in 2008, but not at all mlx4 cards may support it (if you have a good melanox contact please ask him).

                    if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) &&
                        (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) &&
                        (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR))
                            props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;
            

            so probably better to just lower priority for FMR.

            one more note - don't use a flag, but have two pointers to function to map and to unmap region, as code completely different.

            shadow Alexey Lyashkov added a comment - well, i know were you take an original code but one note. + /* Prepare FASTREG WR */ + memset(&fastreg_wr, 0, sizeof(fastreg_wr)); + fastreg_wr.opcode = IB_WR_FAST_REG_MR; + fastreg_wr.send_flags = IB_SEND_SIGNALED; + fastreg_wr.wr.fast_reg.iova_start = desc->data_frpl->page_list[0] + offset; + fastreg_wr.wr.fast_reg.page_list = desc->data_frpl; + fastreg_wr.wr.fast_reg.page_list_len = page_list_len; + fastreg_wr.wr.fast_reg.page_shift = SHIFT_4K; + fastreg_wr.wr.fast_reg.length = data_size; + fastreg_wr.wr.fast_reg.rkey = desc->data_mr->rkey; + fastreg_wr.wr.fast_reg.access_flags = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); + so iser don't have two page lists as in patch. It uses directly an preallocated DMA region pages. Probably our's code will have two more memory usage if we don't use same pages to transfer and mapping. > I agree with you that perhaps we should just remove FMR support at this since it will be going away. not at all. Yes, Fast reg introduced in 2008, but not at all mlx4 cards may support it (if you have a good melanox contact please ask him). if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) && (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) && (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR)) props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS; so probably better to just lower priority for FMR. one more note - don't use a flag, but have two pointers to function to map and to unmap region, as code completely different.

            Oh I agree. Because something works doesn't mean it is done right I agree with you that perhaps we should just remove FMR support at this since it will be going away.

            simmonsja James A Simmons added a comment - Oh I agree. Because something works doesn't mean it is done right I agree with you that perhaps we should just remove FMR support at this since it will be going away.

            well, with enabling map_on_depand it may work, but anyway.
            1) IB_WR_RDMA_WRITE_IMM, IB_WR_RDMA_READ_INV should be reused to avoid extra WR.
            2) total number WR per QP need adjusted to reflect extra WR used for mapping purpose.
            3) don't prefer a FMR if exist. Fast Reg supported for most cards now, and it's better choose.
            4) don't use a "invalidate->register" WR links it's directly affects (1).

            +                       if (!frd->frd_valid) {
            +                               wrq = &frd->frd_inv_wr;
            +                               wrq->next = &frd->frd_fastreg_wr;
            +                       } else {
            +                               wrq = &frd->frd_fastreg_wr;
            +                       }
            

            It may a simplify a code by avoid calling a invalidate on error case, but it block IB card optimization. per IB spec WR workflow it should be in order
            1) register
            2) write/read
            3) invalidate

            shadow Alexey Lyashkov added a comment - well, with enabling map_on_depand it may work, but anyway. 1) IB_WR_RDMA_WRITE_IMM, IB_WR_RDMA_READ_INV should be reused to avoid extra WR. 2) total number WR per QP need adjusted to reflect extra WR used for mapping purpose. 3) don't prefer a FMR if exist. Fast Reg supported for most cards now, and it's better choose. 4) don't use a "invalidate->register" WR links it's directly affects (1). + if (!frd->frd_valid) { + wrq = &frd->frd_inv_wr; + wrq->next = &frd->frd_fastreg_wr; + } else { + wrq = &frd->frd_fastreg_wr; + } It may a simplify a code by avoid calling a invalidate on error case, but it block IB card optimization. per IB spec WR workflow it should be in order 1) register 2) write/read 3) invalidate

            People

              dmiter Dmitry Eremin (Inactive)
              isaac Isaac Huang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: