Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10089

kiblnd_fmr_pool_map() Failed to map mr 10/11 elements

Details

    • 3
    • 9223372036854775807

    Description

      The following group of messages appear in the console logs of MDTs.

      2017-10-04 08:11:00 [407096.858161] LNetError: 174158:0:(o2iblnd.c:1893:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
      2017-10-04 08:11:00 [407096.869697] LNetError: 174158:0:(o2iblnd_cb.c:590:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
      2017-10-04 08:11:00 [407096.880686] LNetError: 174158:0:(o2iblnd_cb.c:1582:kiblnd_send()) Can't setup GET sink for 172.19.1.112@o2ib100: -22
      2017-10-04 08:11:00 [407096.893504] LustreError: 174158:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff883ebaa9bb00
      2017-10-04 08:12:40 [407196.901157] LustreError: 174158:0:(ldlm_lib.c:3186:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883f27232850 x1579913603429696/t0(0) o1000->lquake-MDT0001-mdtlov_UUID@172.19.1.112@o2ib100:-1/-1 lens 352/0 e 0 to 0 dl 1507130003 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      

      The nodes have Mellanox ConnectX-4 IB adapters.

      Attachments

        Issue Links

          Activity

            [LU-10089] kiblnd_fmr_pool_map() Failed to map mr 10/11 elements

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29551/
            Subject: LU-10089 o2iblnd: use IB_MR_TYPE_SG_GAPS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1b609396e468949f2420f14fed5ebfc999366b62

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29551/ Subject: LU-10089 o2iblnd: use IB_MR_TYPE_SG_GAPS Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1b609396e468949f2420f14fed5ebfc999366b62

            I updated https://review.whamcloud.com/#/c/29551/ to address the comments.

            One thing to note, if you're using OPA you should use map-on-demand set to 256. I'm still analyzing this issue and hopefully will have a patch soon. This issue is tracked under LU-10129

            ashehata Amir Shehata (Inactive) added a comment - I updated https://review.whamcloud.com/#/c/29551/ to address the comments. One thing to note, if you're using OPA you should use map-on-demand set to 256. I'm still analyzing this issue and hopefully will have a patch soon. This issue is tracked under LU-10129

            Their is a question about querying the IB device to see if it supports IB_MR_TYPE_SG_GAPS instead of assuming that IB_MR_TYPE_SG_GAPS is always the case.

            simmonsja James A Simmons added a comment - Their is a question about querying the IB device to see if it supports IB_MR_TYPE_SG_GAPS instead of assuming that IB_MR_TYPE_SG_GAPS is always the case.
            ofaaland Olaf Faaland added a comment -

            Hi Amir,

            I see https://review.whamcloud.com/#/c/29551/ has status "Ready to land", but hasn't been landed.  Is there further work needed, or is it just waiting for the next time a set of patches get merged?  Thanks.

            ofaaland Olaf Faaland added a comment - Hi Amir, I see https://review.whamcloud.com/#/c/29551/  has status "Ready to land", but hasn't been landed.  Is there further work needed, or is it just waiting for the next time a set of patches get merged?  Thanks.
            ofaaland Olaf Faaland added a comment -

            This appears to be working well in my tests.

            ofaaland Olaf Faaland added a comment - This appears to be working well in my tests.
            ashehata Amir Shehata (Inactive) added a comment - - edited

            Some notes:
            1. Fastreg with MLX4 + OFED + ib_alloc_mr(.., IB_MR_TYPE_MEM_REG): gets BIND_ERR
            2. Fastreg with MLX4 + MOFED 4.1 + ib_alloc_mr(.., IB_MR_TYPE_MEM_REG): works
            3. IB_MR_TYPE_SG_GAPS is not supported for MLX4 on either OFED or MOFED
            4. MLX4 FMR mapping works differently on MOFED vs OFED.

            So the best solution is to:
            1. Always use FMR for MLX4
            2. Always use FMR for OPA
            3. FMR is not available for MLX5 so fastreg will be used

            The three patches I described earlier seems like the ideal solution for now.

            ashehata Amir Shehata (Inactive) added a comment - - edited Some notes: 1. Fastreg with MLX4 + OFED + ib_alloc_mr(.., IB_MR_TYPE_MEM_REG): gets BIND_ERR 2. Fastreg with MLX4 + MOFED 4.1 + ib_alloc_mr(.., IB_MR_TYPE_MEM_REG): works 3. IB_MR_TYPE_SG_GAPS is not supported for MLX4 on either OFED or MOFED 4. MLX4 FMR mapping works differently on MOFED vs OFED. https://review.whamcloud.com/29290 is needed for OFED but not needed for MOFED (although it doesn't seem to hurt) So the best solution is to: 1. Always use FMR for MLX4 2. Always use FMR for OPA 3. FMR is not available for MLX5 so fastreg will be used The three patches I described earlier seems like the ideal solution for now.
            ofaaland Olaf Faaland added a comment -

            Note that we cannot land these patches to our production tree until they are through your review and testing process, and are merged to master at a minimum.  Let me know if there's anything I can do to help that along.

            ofaaland Olaf Faaland added a comment - Note that we cannot land these patches to our production tree until they are through your review and testing process, and are merged to master at a minimum.  Let me know if there's anything I can do to help that along.
            ofaaland Olaf Faaland added a comment -

            Backport them to 2.8 fe, please. They will likely apply cleanly.

            The relationship between our stack and 2.8fe is complicated but they are not that different, and we will be switching to 2.8fe + a small stack of commits very soon.

            ofaaland Olaf Faaland added a comment - Backport them to 2.8 fe, please. They will likely apply cleanly. The relationship between our stack and 2.8fe is complicated but they are not that different, and we will be switching to 2.8fe + a small stack of commits very soon.

            I believe that I'll push for landing these three patches as they stabilize master as well.

            I'm not sure how you guys will pick up the patches. Do you maintain your own tree? or do would you need this patches backported?

            ashehata Amir Shehata (Inactive) added a comment - I believe that I'll push for landing these three patches as they stabilize master as well. I'm not sure how you guys will pick up the patches. Do you maintain your own tree? or do would you need this patches backported?
            ofaaland Olaf Faaland added a comment -

            With brief testing, I see no errors on MLX-4 and OPA machines. What is the next step?

            ofaaland Olaf Faaland added a comment - With brief testing, I see no errors on MLX-4 and OPA machines. What is the next step?
            ofaaland Olaf Faaland added a comment -

            I ran on an MLX-5 machine with good success. I'll test on MLX-4 and OPA this morning.

            ofaaland Olaf Faaland added a comment - I ran on an MLX-5 machine with good success. I'll test on MLX-4 and OPA this morning.

            People

              ashehata Amir Shehata (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: