[LU-12303] Use lnet_health_sensitivity for restoring health for each lnet_recovery_internal Created: 15/May/19  Updated: 16/Oct/20  Resolved: 31/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Minor
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Any lustre 2.12 system with LNet health enabled


Issue Links:
Related
is related to LU-12292 Decrement Health Value even if recove... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Currently for each lnet_health_interval the LNet health is incremented by 1. The maximum LNet health value so it is possible to take up to 1000 seconds to recovery depending on the setup. A better way to handle this is to use the lnet_health_interval by the same amount that the health went by it.



 Comments   
Comment by Andreas Dilger [ 15/May/19 ]

If the lnet_health_sensitivity was also used to increment the interface health, then there would be no point in having a variable lnet_health_sensitivity value. It would mean that there are N health decrements and the same N health increments for an interface. Also note (AFAIK, but Amir to confirm) that while the retry interval is 1s, it will increment the health for every successful RPC sent/received.

The reason that lnet_health_sensitivity=100 for decrements, but 1 for increments, this implies that the interface can only lose 1/100 = 1% of messages on that interface for it to continue to be in use. If it fails RPCs more than 1% of the time it will decrement faster than increment, which is good because you don't want to be using that interface. If it fails less than 1% it will generally remain in use. This "minimum acceptable failure ratio" is tunable by lnet_health_sensitivity.

Comment by James A Simmons [ 16/May/19 ]

Your docs are wrong  It states at

https://wiki.whamcloud.com/display/LNet/LNet+Health+User+Documentation

that the health increments every time interval which is one seconds. By those docs it could be 1000 seconds before the interface is seen as healthy.

Comment by Gerrit Updater [ 04/Dec/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36920
Subject: LU-12303 lnet: recover health at same rate as dec
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 601be615b8409dc74f2f6e5c49fe0810bc443a73

Comment by Gerrit Updater [ 31/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36920/
Subject: LU-12303 lnet: recover health at same rate as dec
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1d94a29dbc018fd00aa1c8a7a7ae343e0c9a4b83

Comment by Peter Jones [ 31/Mar/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:51:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.