[LU-13548] LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs Created: 12/May/20  Updated: 15/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Related
is related to LU-11840 Multi rail dynamic discovery prevent ... Open
is related to LU-14386 LNet: select reachable remote peer nid Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If non-MR peer (2.10.8) is discovered by a 2.12 MR peer, the following problem may happen: if non-MR peer has LNets that are not defined on the MR peer, it is possible that a NID on the undefined LNet is listed as primary. Later this causes communication problems when mounting. 

Here's an example of the buggy discovery:

 

lnetctl discover 192.168.1.123@o2ib4

discover:

    - primary nid: 192.168.1.123@o2ib

      Multi-Rail: False

      peer ni:

        - nid: 192.168.1.123@o2ib4

        - nid: 192.168.1.123@o2ib

lnetctl peer show

peer:

    - primary nid: 192.168.1.123@o2ib

      Multi-Rail: False

      peer ni:

        - nid: 192.168.1.123@o2ib4

          state: NA

        - nid: 192.168.1.123@o2ib

          state: NA

 

In the example above, the peer that is running the discovery has an only nid on o2ib4, and so designating a peer with a primary nid on o2ib is a problem.

 

Here's the lnet config on the MR peer (the peer running discovery):

lnetctl net show

net:

    - net type: lo

      local NI(s):

        - nid: 0@lo

          status: up

    - net type: o2ib4

      local NI(s):

        - nid: 192.168.1.105@o2ib4

          status: up

          interfaces:

              0: ib0

 Here's the lnet config on the non-MR peer (the peer being discovered):

lnetctl net show

net:

    - net type: lo

      local NI(s):

        - nid: 0@lo

          status: up

    - net type: o2ib

      local NI(s):

        - nid: 192.168.1.123@o2ib

          status: up

          interfaces:

              0: ib0

    - net type: o2ib4

      local NI(s):

        - nid: 192.168.1.123@o2ib4

          status: up

          interfaces:

              0: ib0



 Comments   
Comment by Serguei Smirnov [ 12/May/20 ]

It has been determined that porting changes from LU-11641 to 2.12 is able to change the discovery behaviour as follows:

lnetctl discover 192.168.1.123@o2ib4

discover:

    - primary nid: 192.168.1.123@o2ib4

      Multi-Rail: False

      peer ni:

        - nid: 192.168.1.123@o2ib4

 

lnetctl peer show

peer:

    - primary nid: 192.168.1.123@o2ib4

      Multi-Rail: False

      peer ni:

        - nid: 192.168.1.123@o2ib4

          state: NA

This is the correct behaviour. Same is observed on 2.13 (2.13 peer discovering 2.10.8 peer, same configuration)

 

Comment by Cameron Harr [ 01/Dec/20 ]

Serguei, You mention a port to 2.12 above. Which 2.12 minor version has the LU-11641 patch? Is this ticket still waiting on additional work?

Comment by Serguei Smirnov [ 01/Dec/20 ]

Hi Cameron,

The earlier comment was about a test that I ran at the time, based on 2.12 + ported changes. It was just a proof of concept as it broke something else. The actual patch with proper fix went into a private branch, but it still needs to be ported to 2.12. I guess I thought that MRR feature was going to get ported to 2.12, but that was wrong. I'll add porting this fix to 2.12 to my list of things to do.

Thanks,

Serguei.

Comment by Cameron Harr [ 01/Dec/20 ]

Thank you Serguei!

Comment by Aurelien Degremont (Inactive) [ 02/Dec/20 ]

I don't know Serguei what's your current workload and when you will be able to port this fix but I would appreciate if you can push a non-fully ported patch somewhere for me to look at it and see if i can finish the backport.

Comment by Gerrit Updater [ 03/Dec/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40857
Subject: LU-13548 lnet: backport fix for discovery of non-MR peers
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d20018597825ba5ad85ffec2bbd148ae4bc8ccb1

Comment by Olaf Faaland [ 18/Dec/20 ]

Hi Serguei and Aurelien,
I put 2.12.6 + https://review.whamcloud.com/40857 on a system and had issues related to routers (and therefore LNet). I'm working building 2.12.6 and then checking to see if I can still re-create the problem.

Comment by Serguei Smirnov [ 27/Jan/21 ]

Hi Olaf,

Are you still having routing issues with this patch?

Thanks,

Serguei.

Comment by Aurelien Degremont (Inactive) [ 28/Jan/21 ]

My 2 cents: our minimal testing confirmed this patch is working. But I didn't test with routers.

Comment by Serguei Smirnov [ 28/Jan/21 ]

FYI, with routing the following still results in a problem:

NodeA --tcp0-- GW --tcp1-- NodeB 
------ NodeA ------
lnetctl net show
    - net type: tcp9
      local NI(s):
        - nid: 192.168.122.10@tcp9
    - net type: tcp
      local NI(s):
        - nid: 192.168.122.142@tcp
------ NodeB ------
lnetctl net show
net:
    - net type: tcp1
      local NI(s):
        - nid: 192.168.122.40@tcp1
------ NodeB ------
lnetctl peer show
peer:
    - primary nid: 192.168.122.10@tcp9
      Multi-Rail: True
      peer ni:
        - nid: 192.168.122.142@tcp
          state: NA
        - nid: 192.168.122.10@tcp9
          state: NA

Note that NodeB lists NodeA under the unreachable tcp9 primary nid. Even though NodeB is aware of the reachable nid for NodeA, it gets confused if using the primary nid:

------ NodeB ------
lnetctl ping 192.168.122.10@tcp9
manage:
    - ping:
          errno: -1
          descr: failed to ping 192.168.122.10@tcp9: Input/output error

 This is being tracked in LU-14386

Generated at Sat Feb 10 03:02:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.