Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13548

LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      If non-MR peer (2.10.8) is discovered by a 2.12 MR peer, the following problem may happen: if non-MR peer has LNets that are not defined on the MR peer, it is possible that a NID on the undefined LNet is listed as primary. Later this causes communication problems when mounting. 

      Here's an example of the buggy discovery:

       

      lnetctl discover 192.168.1.123@o2ib4

      discover:

          - primary nid: 192.168.1.123@o2ib

            Multi-Rail: False

            peer ni:

              - nid: 192.168.1.123@o2ib4

              - nid: 192.168.1.123@o2ib

      lnetctl peer show

      peer:

          - primary nid: 192.168.1.123@o2ib

            Multi-Rail: False

            peer ni:

              - nid: 192.168.1.123@o2ib4

                state: NA

              - nid: 192.168.1.123@o2ib

                state: NA

       

      In the example above, the peer that is running the discovery has an only nid on o2ib4, and so designating a peer with a primary nid on o2ib is a problem.

       

      Here's the lnet config on the MR peer (the peer running discovery):

      lnetctl net show

      net:

          - net type: lo

            local NI(s):

              - nid: 0@lo

                status: up

          - net type: o2ib4

            local NI(s):

              - nid: 192.168.1.105@o2ib4

                status: up

                interfaces:

                    0: ib0

       Here's the lnet config on the non-MR peer (the peer being discovered):

      lnetctl net show

      net:

          - net type: lo

            local NI(s):

              - nid: 0@lo

                status: up

          - net type: o2ib

            local NI(s):

              - nid: 192.168.1.123@o2ib

                status: up

                interfaces:

                    0: ib0

          - net type: o2ib4

            local NI(s):

              - nid: 192.168.1.123@o2ib4

                status: up

                interfaces:

                    0: ib0

      Attachments

        Issue Links

          Activity

            [LU-13548] LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs

            Hi Olaf,

            Are you still having routing issues with this patch?

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Are you still having routing issues with this patch? Thanks, Serguei.

            Hi Serguei and Aurelien,
            I put 2.12.6 + https://review.whamcloud.com/40857 on a system and had issues related to routers (and therefore LNet). I'm working building 2.12.6 and then checking to see if I can still re-create the problem.

            ofaaland Olaf Faaland added a comment - Hi Serguei and Aurelien, I put 2.12.6 + https://review.whamcloud.com/40857 on a system and had issues related to routers (and therefore LNet). I'm working building 2.12.6 and then checking to see if I can still re-create the problem.

            Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40857
            Subject: LU-13548 lnet: backport fix for discovery of non-MR peers
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: d20018597825ba5ad85ffec2bbd148ae4bc8ccb1

            gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40857 Subject: LU-13548 lnet: backport fix for discovery of non-MR peers Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: d20018597825ba5ad85ffec2bbd148ae4bc8ccb1

            I don't know Serguei what's your current workload and when you will be able to port this fix but I would appreciate if you can push a non-fully ported patch somewhere for me to look at it and see if i can finish the backport.

            degremoa Aurelien Degremont (Inactive) added a comment - I don't know Serguei what's your current workload and when you will be able to port this fix but I would appreciate if you can push a non-fully ported patch somewhere for me to look at it and see if i can finish the backport.

            Thank you Serguei!

            charr Cameron Harr added a comment - Thank you Serguei!

            Hi Cameron,

            The earlier comment was about a test that I ran at the time, based on 2.12 + ported changes. It was just a proof of concept as it broke something else. The actual patch with proper fix went into a private branch, but it still needs to be ported to 2.12. I guess I thought that MRR feature was going to get ported to 2.12, but that was wrong. I'll add porting this fix to 2.12 to my list of things to do.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Cameron, The earlier comment was about a test that I ran at the time, based on 2.12 + ported changes. It was just a proof of concept as it broke something else. The actual patch with proper fix went into a private branch, but it still needs to be ported to 2.12. I guess I thought that MRR feature was going to get ported to 2.12, but that was wrong. I'll add porting this fix to 2.12 to my list of things to do. Thanks, Serguei.
            charr Cameron Harr added a comment -

            Serguei, You mention a port to 2.12 above. Which 2.12 minor version has the LU-11641 patch? Is this ticket still waiting on additional work?

            charr Cameron Harr added a comment - Serguei, You mention a port to 2.12 above. Which 2.12 minor version has the LU-11641 patch? Is this ticket still waiting on additional work?

            It has been determined that porting changes from LU-11641 to 2.12 is able to change the discovery behaviour as follows:

            lnetctl discover 192.168.1.123@o2ib4

            discover:

                - primary nid: 192.168.1.123@o2ib4

                  Multi-Rail: False

                  peer ni:

                    - nid: 192.168.1.123@o2ib4

             

            lnetctl peer show

            peer:

                - primary nid: 192.168.1.123@o2ib4

                  Multi-Rail: False

                  peer ni:

                    - nid: 192.168.1.123@o2ib4

                      state: NA

            This is the correct behaviour. Same is observed on 2.13 (2.13 peer discovering 2.10.8 peer, same configuration)

             

            ssmirnov Serguei Smirnov added a comment - It has been determined that porting changes from  LU-11641 to 2.12 is able to change the discovery behaviour as follows: lnetctl discover 192.168.1.123@o2ib4 discover:     - primary nid: 192.168.1.123@o2ib4       Multi-Rail: False       peer ni:         - nid: 192.168.1.123@o2ib4   lnetctl peer show peer:     - primary nid: 192.168.1.123@o2ib4       Multi-Rail: False       peer ni:         - nid: 192.168.1.123@o2ib4           state: NA This is the correct behaviour. Same is observed on 2.13 (2.13 peer discovering 2.10.8 peer, same configuration)  

            People

              wc-triage WC Triage
              ssmirnov Serguei Smirnov
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: