[LU-13548] LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs Created: 12/May/20 Updated: 15/Mar/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
If non-MR peer (2.10.8) is discovered by a 2.12 MR peer, the following problem may happen: if non-MR peer has LNets that are not defined on the MR peer, it is possible that a NID on the undefined LNet is listed as primary. Later this causes communication problems when mounting. Here's an example of the buggy discovery:
In the example above, the peer that is running the discovery has an only nid on o2ib4, and so designating a peer with a primary nid on o2ib is a problem.
Here's the lnet config on the MR peer (the peer running discovery):
Here's the lnet config on the non-MR peer (the peer being discovered):
|
| Comments |
| Comment by Serguei Smirnov [ 12/May/20 ] |
|
It has been determined that porting changes from
This is the correct behaviour. Same is observed on 2.13 (2.13 peer discovering 2.10.8 peer, same configuration)
|
| Comment by Cameron Harr [ 01/Dec/20 ] |
|
Serguei, You mention a port to 2.12 above. Which 2.12 minor version has the |
| Comment by Serguei Smirnov [ 01/Dec/20 ] |
|
Hi Cameron, The earlier comment was about a test that I ran at the time, based on 2.12 + ported changes. It was just a proof of concept as it broke something else. The actual patch with proper fix went into a private branch, but it still needs to be ported to 2.12. I guess I thought that MRR feature was going to get ported to 2.12, but that was wrong. I'll add porting this fix to 2.12 to my list of things to do. Thanks, Serguei. |
| Comment by Cameron Harr [ 01/Dec/20 ] |
|
Thank you Serguei! |
| Comment by Aurelien Degremont (Inactive) [ 02/Dec/20 ] |
|
I don't know Serguei what's your current workload and when you will be able to port this fix but I would appreciate if you can push a non-fully ported patch somewhere for me to look at it and see if i can finish the backport. |
| Comment by Gerrit Updater [ 03/Dec/20 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40857 |
| Comment by Olaf Faaland [ 18/Dec/20 ] |
|
Hi Serguei and Aurelien, |
| Comment by Serguei Smirnov [ 27/Jan/21 ] |
|
Hi Olaf, Are you still having routing issues with this patch? Thanks, Serguei. |
| Comment by Aurelien Degremont (Inactive) [ 28/Jan/21 ] |
|
My 2 cents: our minimal testing confirmed this patch is working. But I didn't test with routers. |
| Comment by Serguei Smirnov [ 28/Jan/21 ] |
|
FYI, with routing the following still results in a problem: NodeA --tcp0-- GW --tcp1-- NodeB
------ NodeA ------
lnetctl net show
- net type: tcp9
local NI(s):
- nid: 192.168.122.10@tcp9
- net type: tcp
local NI(s):
- nid: 192.168.122.142@tcp
------ NodeB ------
lnetctl net show
net:
- net type: tcp1
local NI(s):
- nid: 192.168.122.40@tcp1
------ NodeB ------
lnetctl peer show
peer:
- primary nid: 192.168.122.10@tcp9
Multi-Rail: True
peer ni:
- nid: 192.168.122.142@tcp
state: NA
- nid: 192.168.122.10@tcp9
state: NA
Note that NodeB lists NodeA under the unreachable tcp9 primary nid. Even though NodeB is aware of the reachable nid for NodeA, it gets confused if using the primary nid:
------ NodeB ------
lnetctl ping 192.168.122.10@tcp9
manage:
- ping:
errno: -1
descr: failed to ping 192.168.122.10@tcp9: Input/output error
This is being tracked in LU-14386 |