[LU-15713] Round robin across nets can be broken Created: 01/Apr/22  Updated: 03/Jul/23  Resolved: 11/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue is very similar to LU-13575, but it relates to the round robin across multiple nets whereas that ticket was about round robin across interfaces within a single net.

Currently if a peer has multiple network types (either multiple LNDs or multiple nets on their interfaces) there are situations where traffic can be routed to the interfaces on one net (like if a peer is talking to another peer that only has interfaces on one of the nets, or if interfaces go down on the other net for an extended period of time). This causes the peer net/local net sequence numbers to diverge in the same manner documented in LU-13575. This can cause future traffic to funnel to just one of the available nets leading to degraded performance.



 Comments   
Comment by Gerrit Updater [ 11/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46976/
Subject: LU-15713 lnet: Ensure round robin across nets
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 05413b3d84f7d1febb89cf4e9c86a7e017d147df

Comment by Peter Jones [ 11/Jul/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 03/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51547
Subject: LU-15713 lnet: Ensure round robin across nets
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 4b616f00cacc448f1be6607754feb77dbb347167

Generated at Sat Feb 10 03:20:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.