[LU-13575] LNet should ensure round-robin interface selection when interfaces are healthy Created: 15/May/20  Updated: 14/Oct/23  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807
Epic Link: unlabelled-LU-13422

 Description   

When an interface fails and stays out of commission for a period of time, and then is brought back into commission, the sequence numbers for the interface which has been currently in use would be far larger than the newly commissioned interface. This leads to the new interface being used continuously until its sequence number catches up with the in use interface. This is not ideal behavior, because the system has two available interfaces, but only one is being used simply because of the sequence number, which is intended to allow round robin. Ideally, once an interface comes back into service, it should immediately be used.

A similar thing happens when there are a lot of source specified sends. One NI gets a bunch of sequence increments so then it takes a while for other NIs to "catch up".

We should modify the sequence number manipulation to help ensure we actually round robin when desired, or otherwise modify the relevant code.



 Comments   
Comment by Gerrit Updater [ 21/Sep/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45003
Subject: LU-13575 lnet: Ensure round robin selection of local NIs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1c9b558346be925aa5e3e7e9e24fc5d7ff6ea3b8

Comment by Gerrit Updater [ 21/Sep/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45004
Subject: LU-13575 lnet: Ensure round robin selection of peer NIs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3a60ea6a3c8f341b130c4176821a1ad8c5067033

Comment by Gerrit Updater [ 10/Oct/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45003/
Subject: LU-13575 lnet: Ensure round robin selection of local NIs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a18c4a16246e6185919eda805eca52772bbc3efe

Comment by Gerrit Updater [ 10/Oct/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45004/
Subject: LU-13575 lnet: Ensure round robin selection of peer NIs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c51763948abfdbdc8e3f3ea7e73f2632320a095a

Comment by Peter Jones [ 10/Oct/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:02:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.