Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14661

Provide kernel API for adding peer/peer NI

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      Provide kernel API for adding peer and peer NI

      Implement LNetAddPeer() and LNetAddPeerNI() APIs to allow other
      kernel modules to add peer and peer NIs to LNet.

      Peers created via these APIs are not marked as having been configured
      by DLC. As such, they can be overwritten by discovery.

      Attachments

        Activity

          [LU-14661] Provide kernel API for adding peer/peer NI

          hornc can we discover new NIDs asynchronously so that the thread processing llog doesn't block?

          bzzz Alex Zhuravlev added a comment - hornc can we discover new NIDs asynchronously so that the thread processing llog doesn't block?

          IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.

          Agree.

          gtapase Gaurang Tapase added a comment - IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc. Agree.

          IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.

          bzzz Alex Zhuravlev added a comment - IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.

          We can definitely improve HA agents to wait a bit more (right now they wait for 450s before giving up). I had tried doubling the timeout (900s), still saw the problem. We need a definite timeout before which the mount would succeed.

          gtapase Gaurang Tapase added a comment - We can definitely improve HA agents to wait a bit more (right now they wait for 450s before giving up). I had tried doubling the timeout (900s), still saw the problem. We need a definite timeout before which the mount would succeed.

          the problem that mount takes too long in such a scenario, HA gives up and initiates a failover which can hit the same problem.
          would it be possible to make such a "try to connect" async? so mount process (adding nids from the config) is not blocked trying to connect one by one?

          bzzz Alex Zhuravlev added a comment - the problem that mount takes too long in such a scenario, HA gives up and initiates a failover which can hit the same problem. would it be possible to make such a "try to connect" async? so mount process (adding nids from the config) is not blocked trying to connect one by one?

          I was experimenting with the mount command which lists nids as X1,X2:Y1,Y2 

          In my experiment all nids were fake, so all attempts to connect were expected to fail. I was just checking which nids LNet would try to connect to.

          Basically, it looks like before LU-14661 patch, LNet would only try to connect to X1 and Y1 when discovering X and Y respectively. With LU-14661 patch, LNet tries to connect to all listed nids. I think this behavior is expected though and is good in the sense that it gives the mount the chance to succeed if X1 is down but X2 is up.

          ssmirnov Serguei Smirnov added a comment - I was experimenting with the mount command which lists nids as X1,X2:Y1,Y2  In my experiment all nids were fake, so all attempts to connect were expected to fail. I was just checking which nids LNet would try to connect to. Basically, it looks like before LU-14661 patch, LNet would only try to connect to X1 and Y1 when discovering X and Y respectively. With LU-14661 patch, LNet tries to connect to all listed nids. I think this behavior is expected though and is good in the sense that it gives the mount the chance to succeed if X1 is down but X2 is up.

          If so, can you see if https://review.whamcloud.com/#/c/fs/lustre-release/+/47322/ helps?

          according to the testing, no, it doesn't help.

          bzzz Alex Zhuravlev added a comment - If so, can you see if https://review.whamcloud.com/#/c/fs/lustre-release/+/47322/ helps? according to the testing, no, it doesn't help.

          Is LNet peer discovery disabled in the good case but enabled in the bad case?

          not quite ready to answer, will have to check with collegue.
          this is what we've found yesterday: same branch, same setup - gets stuck with LU-14661 and does well with LU-14661 reverted.

          bzzz Alex Zhuravlev added a comment - Is LNet peer discovery disabled in the good case but enabled in the bad case? not quite ready to answer, will have to check with collegue. this is what we've found yesterday: same branch, same setup - gets stuck with LU-14661 and does well with LU-14661 reverted.
          hornc Chris Horn added a comment -

          Is LNet peer discovery disabled in the good case but enabled in the bad case?

          hornc Chris Horn added a comment - Is LNet peer discovery disabled in the good case but enabled in the bad case?

          Alex Zhuravlev Can you please attach the full log for the bad case as well?

          just attached

          bzzz Alex Zhuravlev added a comment - Alex Zhuravlev Can you please attach the full log for the bad case as well? just attached
          hornc Chris Horn added a comment -

          And can you provide the exact Lustre hash(es) being tested?

          hornc Chris Horn added a comment - And can you provide the exact Lustre hash(es) being tested?

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: