[LU-12303] Use lnet_health_sensitivity for restoring health for each lnet_recovery_internal Created: 15/May/19 Updated: 16/Oct/20 Resolved: 31/Mar/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Any lustre 2.12 system with LNet health enabled |
||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Currently for each lnet_health_interval the LNet health is incremented by 1. The maximum LNet health value so it is possible to take up to 1000 seconds to recovery depending on the setup. A better way to handle this is to use the lnet_health_interval by the same amount that the health went by it. |
| Comments |
| Comment by Andreas Dilger [ 15/May/19 ] |
|
If the lnet_health_sensitivity was also used to increment the interface health, then there would be no point in having a variable lnet_health_sensitivity value. It would mean that there are N health decrements and the same N health increments for an interface. Also note (AFAIK, but Amir to confirm) that while the retry interval is 1s, it will increment the health for every successful RPC sent/received. The reason that lnet_health_sensitivity=100 for decrements, but 1 for increments, this implies that the interface can only lose 1/100 = 1% of messages on that interface for it to continue to be in use. If it fails RPCs more than 1% of the time it will decrement faster than increment, which is good because you don't want to be using that interface. If it fails less than 1% it will generally remain in use. This "minimum acceptable failure ratio" is tunable by lnet_health_sensitivity. |
| Comment by James A Simmons [ 16/May/19 ] |
|
Your docs are wrong https://wiki.whamcloud.com/display/LNet/LNet+Health+User+Documentation that the health increments every time interval which is one seconds. By those docs it could be 1000 seconds before the interface is seen as healthy. |
| Comment by Gerrit Updater [ 04/Dec/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36920 |
| Comment by Gerrit Updater [ 31/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36920/ |
| Comment by Peter Jones [ 31/Mar/20 ] |
|
Landed for 2.14 |