Details

    • Task
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • None
    • None
    • 9223372036854775807

    Description

      There have been a few gnilnd changes since the last time we sync'd up. I'll be pushing up the latest commits.

      Attachments

        Activity

          [LU-7578] Push latest gnilnd changes

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17665/
          Subject: LU-7578 gnilnd: Handle new return code in gni_mem_register()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 37e5f21ee4db9cb3df063d5537511ec15c1196b3

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17665/ Subject: LU-7578 gnilnd: Handle new return code in gni_mem_register() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 37e5f21ee4db9cb3df063d5537511ec15c1196b3

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17664/
          Subject: LU-7578 gnilnd: Add module parameter reg_fail_timeout
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 5b787cb7a375372c7a4f3c405d38137a7a867677

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17664/ Subject: LU-7578 gnilnd: Add module parameter reg_fail_timeout Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5b787cb7a375372c7a4f3c405d38137a7a867677

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17663/
          Subject: LU-7578 gnilnd: Modify allocator flags to prevent waiting
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 4e7994f45811e66f50a5d174b1b5dfc20c65269b

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17663/ Subject: LU-7578 gnilnd: Modify allocator flags to prevent waiting Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4e7994f45811e66f50a5d174b1b5dfc20c65269b

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17667/
          Subject: LU-7578 gnilnd: Revert max_immediate setting
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 928c5050f7d2a8a2cabb6eeb3993b29166fdaf1e

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17667/ Subject: LU-7578 gnilnd: Revert max_immediate setting Project: fs/lustre-release Branch: master Current Patch Set: Commit: 928c5050f7d2a8a2cabb6eeb3993b29166fdaf1e

          Just did another round of testing and I didn't see problems this time. Strange some unrelated change must of landed that fix the problem the latest Gemini changes must of been exposing.

          simmonsja James A Simmons added a comment - Just did another round of testing and I didn't see problems this time. Strange some unrelated change must of landed that fix the problem the latest Gemini changes must of been exposing.
          chuckf Chuck Fossen added a comment -

          James, are you saying that gnilnd is now using more memory or that allocations are failing when the node is under high memory pressure?
          Also, I assume this is on compute nodes that you are seeing this issue. Is that true?
          I don't see that these changes would cause gnilnd to use more memory.
          http://review.whamcloud.com/17663 changed the vmalloc allocation flags so an allocation will fail instead of waiting forever to allocate memory.
          We have seen heartbeat failures when a node needs to allocate memory to establish a connection in the case where Lustre is trying to write to disk in order to free memory.

          chuckf Chuck Fossen added a comment - James, are you saying that gnilnd is now using more memory or that allocations are failing when the node is under high memory pressure? Also, I assume this is on compute nodes that you are seeing this issue. Is that true? I don't see that these changes would cause gnilnd to use more memory. http://review.whamcloud.com/17663 changed the vmalloc allocation flags so an allocation will fail instead of waiting forever to allocate memory. We have seen heartbeat failures when a node needs to allocate memory to establish a connection in the case where Lustre is trying to write to disk in order to free memory.
          hornc Chris Horn added a comment -

          James, I've passed along your comments to our gnilnd engineers and asked them to weigh in on this ticket.

          hornc Chris Horn added a comment - James, I've passed along your comments to our gnilnd engineers and asked them to weigh in on this ticket.
          simmonsja James A Simmons added a comment - - edited

          Chris one of these patches is causing a regression in my testing. I'm seeing an increase in memory pressure that is causing jobs to fail under pressure.

          simmonsja James A Simmons added a comment - - edited Chris one of these patches is causing a regression in my testing. I'm seeing an increase in memory pressure that is causing jobs to fail under pressure.

          Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/17667
          Subject: LU-7578 gnilnd: Revert max_immediate setting
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: a382968a06574400bd48e2e0beb848ad1ba81304

          gerrit Gerrit Updater added a comment - Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/17667 Subject: LU-7578 gnilnd: Revert max_immediate setting Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a382968a06574400bd48e2e0beb848ad1ba81304

          Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/17666
          Subject: LU-7578 gnilnd: Return correct error on GNI_RC_ERROR_NOMEM
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: b2d35706d02c0cf16ef404f865211e7fde14cfb1

          gerrit Gerrit Updater added a comment - Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/17666 Subject: LU-7578 gnilnd: Return correct error on GNI_RC_ERROR_NOMEM Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b2d35706d02c0cf16ef404f865211e7fde14cfb1

          Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/17665
          Subject: LU-7578 gnilnd: Handle new return code in gni_mem_register()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 237a4857ae42ef2fb569664a1e5f24398ae53687

          gerrit Gerrit Updater added a comment - Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/17665 Subject: LU-7578 gnilnd: Handle new return code in gni_mem_register() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 237a4857ae42ef2fb569664a1e5f24398ae53687

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: