Details

    • Task
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • None
    • None
    • 9223372036854775807

    Description

      There have been a few gnilnd changes since the last time we sync'd up. I'll be pushing up the latest commits.

      Attachments

        Activity

          [LU-7578] Push latest gnilnd changes

          Patches have landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Patches have landed for 2.8

          All outstanding patches have landed.

          simmonsja James A Simmons added a comment - All outstanding patches have landed.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17666/
          Subject: LU-7578 gnilnd: Return correct error on GNI_RC_ERROR_NOMEM
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 919b8968d84d0d6ad57e2e6e5e1a8ccb02a1bd2c

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17666/ Subject: LU-7578 gnilnd: Return correct error on GNI_RC_ERROR_NOMEM Project: fs/lustre-release Branch: master Current Patch Set: Commit: 919b8968d84d0d6ad57e2e6e5e1a8ccb02a1bd2c

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17665/
          Subject: LU-7578 gnilnd: Handle new return code in gni_mem_register()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 37e5f21ee4db9cb3df063d5537511ec15c1196b3

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17665/ Subject: LU-7578 gnilnd: Handle new return code in gni_mem_register() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 37e5f21ee4db9cb3df063d5537511ec15c1196b3

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17664/
          Subject: LU-7578 gnilnd: Add module parameter reg_fail_timeout
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 5b787cb7a375372c7a4f3c405d38137a7a867677

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17664/ Subject: LU-7578 gnilnd: Add module parameter reg_fail_timeout Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5b787cb7a375372c7a4f3c405d38137a7a867677

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17663/
          Subject: LU-7578 gnilnd: Modify allocator flags to prevent waiting
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 4e7994f45811e66f50a5d174b1b5dfc20c65269b

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17663/ Subject: LU-7578 gnilnd: Modify allocator flags to prevent waiting Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4e7994f45811e66f50a5d174b1b5dfc20c65269b

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17667/
          Subject: LU-7578 gnilnd: Revert max_immediate setting
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 928c5050f7d2a8a2cabb6eeb3993b29166fdaf1e

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17667/ Subject: LU-7578 gnilnd: Revert max_immediate setting Project: fs/lustre-release Branch: master Current Patch Set: Commit: 928c5050f7d2a8a2cabb6eeb3993b29166fdaf1e

          Just did another round of testing and I didn't see problems this time. Strange some unrelated change must of landed that fix the problem the latest Gemini changes must of been exposing.

          simmonsja James A Simmons added a comment - Just did another round of testing and I didn't see problems this time. Strange some unrelated change must of landed that fix the problem the latest Gemini changes must of been exposing.
          chuckf Chuck Fossen added a comment -

          James, are you saying that gnilnd is now using more memory or that allocations are failing when the node is under high memory pressure?
          Also, I assume this is on compute nodes that you are seeing this issue. Is that true?
          I don't see that these changes would cause gnilnd to use more memory.
          http://review.whamcloud.com/17663 changed the vmalloc allocation flags so an allocation will fail instead of waiting forever to allocate memory.
          We have seen heartbeat failures when a node needs to allocate memory to establish a connection in the case where Lustre is trying to write to disk in order to free memory.

          chuckf Chuck Fossen added a comment - James, are you saying that gnilnd is now using more memory or that allocations are failing when the node is under high memory pressure? Also, I assume this is on compute nodes that you are seeing this issue. Is that true? I don't see that these changes would cause gnilnd to use more memory. http://review.whamcloud.com/17663 changed the vmalloc allocation flags so an allocation will fail instead of waiting forever to allocate memory. We have seen heartbeat failures when a node needs to allocate memory to establish a connection in the case where Lustre is trying to write to disk in order to free memory.
          hornc Chris Horn added a comment -

          James, I've passed along your comments to our gnilnd engineers and asked them to weigh in on this ticket.

          hornc Chris Horn added a comment - James, I've passed along your comments to our gnilnd engineers and asked them to weigh in on this ticket.

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: