[LU-12889] Do not assume peers are MR capable Created: 20/Oct/19  Updated: 22/Oct/20  Resolved: 01/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12955 AST replies are dropped when servers ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If a peer has discovery disabled then it will not consolidate peer
NI information. This means we need to use a consistent source NI
when sending to it just like we do for non-MR peers.

A comment in lnet_discovery_event_reply() indicates that this was a
known issue, but the situation is not handled properly.

Do not assume peers are multi-rail capable when peer objects are
allocated and initialized.

Do not mark a peer as multi-rail capable unless all of the following
conditions are satisified:
1. The peer has the MR feature flag set
2. The peer has discovery enabled.
3. We have discovery enabled locally

Marked ticket as critical as it can break setups where one side has discovery enabled and the other side has it disabled.



 Comments   
Comment by Gerrit Updater [ 20/Oct/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/36512
Subject: LU-12889 lnet: Do not assume peers are MR capable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4a448bf2e5de7675658d3c114c2b7af675b34e60

Comment by Chris Horn [ 12/Nov/19 ]

I don't know if this patch should wait to land until a solution is found for https://jira.whamcloud.com/browse/LU-12955

Comment by Gerrit Updater [ 01/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36512/
Subject: LU-12889 lnet: Do not assume peers are MR capable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3c580c93b8d3e94fac0ac2cf3cca2ff706c6497a

Comment by Peter Jones [ 01/Feb/20 ]

Chris it seems to have landed - should it be reverted? For future reference, it is safer to apply a -1 (that can later be removed) in Gerrit if you want to "hit the pause button" on something landing for the time being - you can't assume that the gatekeeper is reading every since JIRA ticket

Comment by Chris Horn [ 03/Feb/20 ]

pjones I don't think it needs to be reverted. The issue only impacts mixed MR/non-MR configurations so it shouldn't affect maloo testing. It should be sufficient to land the fix for LU-12955.

Comment by Peter Jones [ 03/Feb/20 ]

ok thanks hornc. I've flagged that ticket for 2.14. It looks like both you and Amir have possible approaches for that ticket but I'll leave the two of you to duke it out on which to use

Comment by Gerrit Updater [ 22/Oct/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40345
Subject: LU-12889 lnet: Do not assume peers are MR capable
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8cdd1b7b76fafaac7e14c0b9b468f01f8ea89cfe

Generated at Sat Feb 10 02:56:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.