Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13548

LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      If non-MR peer (2.10.8) is discovered by a 2.12 MR peer, the following problem may happen: if non-MR peer has LNets that are not defined on the MR peer, it is possible that a NID on the undefined LNet is listed as primary. Later this causes communication problems when mounting. 

      Here's an example of the buggy discovery:

       

      lnetctl discover 192.168.1.123@o2ib4

      discover:

          - primary nid: 192.168.1.123@o2ib

            Multi-Rail: False

            peer ni:

              - nid: 192.168.1.123@o2ib4

              - nid: 192.168.1.123@o2ib

      lnetctl peer show

      peer:

          - primary nid: 192.168.1.123@o2ib

            Multi-Rail: False

            peer ni:

              - nid: 192.168.1.123@o2ib4

                state: NA

              - nid: 192.168.1.123@o2ib

                state: NA

       

      In the example above, the peer that is running the discovery has an only nid on o2ib4, and so designating a peer with a primary nid on o2ib is a problem.

       

      Here's the lnet config on the MR peer (the peer running discovery):

      lnetctl net show

      net:

          - net type: lo

            local NI(s):

              - nid: 0@lo

                status: up

          - net type: o2ib4

            local NI(s):

              - nid: 192.168.1.105@o2ib4

                status: up

                interfaces:

                    0: ib0

       Here's the lnet config on the non-MR peer (the peer being discovered):

      lnetctl net show

      net:

          - net type: lo

            local NI(s):

              - nid: 0@lo

                status: up

          - net type: o2ib

            local NI(s):

              - nid: 192.168.1.123@o2ib

                status: up

                interfaces:

                    0: ib0

          - net type: o2ib4

            local NI(s):

              - nid: 192.168.1.123@o2ib4

                status: up

                interfaces:

                    0: ib0

      Attachments

        Issue Links

          Activity

            [LU-13548] LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs
            ssmirnov Serguei Smirnov added a comment - - edited

            FYI, with routing the following still results in a problem:

            NodeA --tcp0-- GW --tcp1-- NodeB 
            ------ NodeA ------
            lnetctl net show
                - net type: tcp9
                  local NI(s):
                    - nid: 192.168.122.10@tcp9
                - net type: tcp
                  local NI(s):
                    - nid: 192.168.122.142@tcp
            ------ NodeB ------
            lnetctl net show
            net:
                - net type: tcp1
                  local NI(s):
                    - nid: 192.168.122.40@tcp1
            ------ NodeB ------
            lnetctl peer show
            peer:
                - primary nid: 192.168.122.10@tcp9
                  Multi-Rail: True
                  peer ni:
                    - nid: 192.168.122.142@tcp
                      state: NA
                    - nid: 192.168.122.10@tcp9
                      state: NA

            Note that NodeB lists NodeA under the unreachable tcp9 primary nid. Even though NodeB is aware of the reachable nid for NodeA, it gets confused if using the primary nid:

            ------ NodeB ------
            lnetctl ping 192.168.122.10@tcp9
            manage:
                - ping:
                      errno: -1
                      descr: failed to ping 192.168.122.10@tcp9: Input/output error

             This is being tracked in LU-14386

            ssmirnov Serguei Smirnov added a comment - - edited FYI, with routing the following still results in a problem: NodeA --tcp0-- GW --tcp1-- NodeB ------ NodeA ------ lnetctl net show - net type: tcp9 local NI(s): - nid: 192.168.122.10@tcp9 - net type: tcp local NI(s): - nid: 192.168.122.142@tcp ------ NodeB ------ lnetctl net show net: - net type: tcp1 local NI(s): - nid: 192.168.122.40@tcp1 ------ NodeB ------ lnetctl peer show peer: - primary nid: 192.168.122.10@tcp9 Multi-Rail: True peer ni: - nid: 192.168.122.142@tcp state: NA - nid: 192.168.122.10@tcp9 state: NA Note that NodeB lists NodeA under the unreachable tcp9 primary nid. Even though NodeB is aware of the reachable nid for NodeA, it gets confused if using the primary nid: ------ NodeB ------ lnetctl ping 192.168.122.10@tcp9 manage: - ping: errno: -1 descr: failed to ping 192.168.122.10@tcp9: Input/output error  This is being tracked in  LU-14386

            My 2 cents: our minimal testing confirmed this patch is working. But I didn't test with routers.

            degremoa Aurelien Degremont (Inactive) added a comment - My 2 cents: our minimal testing confirmed this patch is working. But I didn't test with routers.

            Hi Olaf,

            Are you still having routing issues with this patch?

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Are you still having routing issues with this patch? Thanks, Serguei.

            Hi Serguei and Aurelien,
            I put 2.12.6 + https://review.whamcloud.com/40857 on a system and had issues related to routers (and therefore LNet). I'm working building 2.12.6 and then checking to see if I can still re-create the problem.

            ofaaland Olaf Faaland added a comment - Hi Serguei and Aurelien, I put 2.12.6 + https://review.whamcloud.com/40857 on a system and had issues related to routers (and therefore LNet). I'm working building 2.12.6 and then checking to see if I can still re-create the problem.

            Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40857
            Subject: LU-13548 lnet: backport fix for discovery of non-MR peers
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: d20018597825ba5ad85ffec2bbd148ae4bc8ccb1

            gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40857 Subject: LU-13548 lnet: backport fix for discovery of non-MR peers Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: d20018597825ba5ad85ffec2bbd148ae4bc8ccb1

            I don't know Serguei what's your current workload and when you will be able to port this fix but I would appreciate if you can push a non-fully ported patch somewhere for me to look at it and see if i can finish the backport.

            degremoa Aurelien Degremont (Inactive) added a comment - I don't know Serguei what's your current workload and when you will be able to port this fix but I would appreciate if you can push a non-fully ported patch somewhere for me to look at it and see if i can finish the backport.

            Thank you Serguei!

            charr Cameron Harr added a comment - Thank you Serguei!

            Hi Cameron,

            The earlier comment was about a test that I ran at the time, based on 2.12 + ported changes. It was just a proof of concept as it broke something else. The actual patch with proper fix went into a private branch, but it still needs to be ported to 2.12. I guess I thought that MRR feature was going to get ported to 2.12, but that was wrong. I'll add porting this fix to 2.12 to my list of things to do.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Cameron, The earlier comment was about a test that I ran at the time, based on 2.12 + ported changes. It was just a proof of concept as it broke something else. The actual patch with proper fix went into a private branch, but it still needs to be ported to 2.12. I guess I thought that MRR feature was going to get ported to 2.12, but that was wrong. I'll add porting this fix to 2.12 to my list of things to do. Thanks, Serguei.
            charr Cameron Harr added a comment -

            Serguei, You mention a port to 2.12 above. Which 2.12 minor version has the LU-11641 patch? Is this ticket still waiting on additional work?

            charr Cameron Harr added a comment - Serguei, You mention a port to 2.12 above. Which 2.12 minor version has the LU-11641 patch? Is this ticket still waiting on additional work?

            It has been determined that porting changes from LU-11641 to 2.12 is able to change the discovery behaviour as follows:

            lnetctl discover 192.168.1.123@o2ib4

            discover:

                - primary nid: 192.168.1.123@o2ib4

                  Multi-Rail: False

                  peer ni:

                    - nid: 192.168.1.123@o2ib4

             

            lnetctl peer show

            peer:

                - primary nid: 192.168.1.123@o2ib4

                  Multi-Rail: False

                  peer ni:

                    - nid: 192.168.1.123@o2ib4

                      state: NA

            This is the correct behaviour. Same is observed on 2.13 (2.13 peer discovering 2.10.8 peer, same configuration)

             

            ssmirnov Serguei Smirnov added a comment - It has been determined that porting changes from  LU-11641 to 2.12 is able to change the discovery behaviour as follows: lnetctl discover 192.168.1.123@o2ib4 discover:     - primary nid: 192.168.1.123@o2ib4       Multi-Rail: False       peer ni:         - nid: 192.168.1.123@o2ib4   lnetctl peer show peer:     - primary nid: 192.168.1.123@o2ib4       Multi-Rail: False       peer ni:         - nid: 192.168.1.123@o2ib4           state: NA This is the correct behaviour. Same is observed on 2.13 (2.13 peer discovering 2.10.8 peer, same configuration)  

            People

              wc-triage WC Triage
              ssmirnov Serguei Smirnov
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: