[LU-17505] socklnd: return LNET_MSG_STATUS_NETWORK_TIMEOUT to LNet on ETIMEDOUT Created: 05/Feb/24  Updated: 08/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: lnet, lnet-health, socklnd

Issue Links:
Related
is related to LU-17379 try MGS NIDs more quickly at initial ... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Currently socklnd returns LNET_MSG_STATUS_LOCAL_TIMEOUT to LNet if ETIMEDOUT error occurs. This causes LNet to only decrement the local NI health score, while the issue may actually be with the remote NI. Because of this, peer NI health is not decremented and so LNet continues to believe it is as good to select for sending as other options.

Returning  LNET_MSG_STATUS_NETWORK_TIMEOUT would cause LNet to decrement both local NI and peer NI health. If local NI is ok, it will recover its score quickly, but the proposed change would allow peer NI score to be properly lowered until it is recovered.



 Comments   
Comment by Gerrit Updater [ 05/Feb/24 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53930
Subject: LU-17505 socklnd: return NETWORK_TIMEOUT to LNet on ETIMEOUT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13e762fb4a6e36c91bf520a8681082ded9aee627

Generated at Sat Feb 10 03:35:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.