Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9472

FastReg (MLX5) support breaks when map_on_demand > 0

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0, Upstream
    • Lustre 2.10.0, Upstream
    • 3
    • 9223372036854775807

    Description

      When building against MODED 4, the default for map_on_demand switches from 0 to 256.  This is breaking MLX5-based cards which make use of the FastReg support in ko2iblnd.  There are three problems with FastReg which need to be fixed:

      1. In kiblnd_fmr_pool_map() when using elements from the fpo_pool_list, if the list runs out, the current code is setting rc to -EBUSY when it should be -EAGAIN.  EAGAIN triggers the pool to be made bigger.  EBUSY just fails the transfer and connection (not what we want).
      2. Even after I fix the setting of rc in number 1, bringing down the network via "lctl network down" trips this assert: 
        [ 1172.255552] LNetError: 10176:0:(o2iblnd.c:1421:kiblnd_destroy_fmr_pool()) ASSERTION( fpo->fpo_map_count == 0 ) failed: 
      1. Every time the pool size is increased, I keep seeing this annoying log (with neterror on): 
        May  9 00:22:26 trevis-407 kernel: LNet: Using FastReg for registration

      The first 2 items are blockers and must be fixed ASAP.  The 3rd might as well be addressed at the same time.

       

      Attachments

        Issue Links

          Activity

            [LU-9472] FastReg (MLX5) support breaks when map_on_demand > 0

            Not yet.

            simmonsja James A Simmons added a comment - Not yet.

            Has this been pushed upstream yet?

            dougo Doug Oucharek (Inactive) added a comment - Has this been pushed upstream yet?
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27015/
            Subject: LU-9472 lnd: Fix FastReg map/unmap for MLX5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b436c75d9488222190de8b30f56d720f8ec63d6f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27015/ Subject: LU-9472 lnd: Fix FastReg map/unmap for MLX5 Project: fs/lustre-release Branch: master Current Patch Set: Commit: b436c75d9488222190de8b30f56d720f8ec63d6f

            Tested fix and it work for me.

            shadow Alexey Lyashkov added a comment - Tested fix and it work for me.

            Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/27015
            Subject: LU-9472 lnd: Fix FastReg map/unmap for MLX5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a6a1d45a72360b5cc7e9e3a65c7456fa62c19192

            gerrit Gerrit Updater added a comment - Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/27015 Subject: LU-9472 lnd: Fix FastReg map/unmap for MLX5 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a6a1d45a72360b5cc7e9e3a65c7456fa62c19192

            Thanks for finding this.

            simmonsja James A Simmons added a comment - Thanks for finding this.

            The main problem, it turns out, is that the unmap routine is never being called for FastReg.  As such, we have to keep growing the pool and assert when trying to shut down networking (because pool items are leaking).

            doug Doug Oucharek (Inactive) added a comment - The main problem, it turns out, is that the unmap routine is never being called for FastReg.  As such, we have to keep growing the pool and assert when trying to shut down networking (because pool items are leaking).

            People

              doug Doug Oucharek (Inactive)
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: