[LU-11816] LNet Health: Correct timeout defaults Created: 19/Dec/18  Updated: 15/Jan/20  Resolved: 14/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet-health

Issue Links:
Duplicate
is duplicated by LU-12290 Inconsistent Timeout value (one is 5s... Resolved
Related
is related to LU-11817 LNet: timing statistics Open
is related to LU-12817 lnet: module parameters are not set c... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Turn on health by default since the ability to retransmit is useful.
Setup proper default for timeouts when health is enabled/disabled.

Detailed problem and solution documented here:
https://wiki.whamcloud.com/display/LNet/LNet+Transaction+Timeouts



 Comments   
Comment by Andreas Dilger [ 19/Dec/18 ]

Mark for 2.13 to get it enabled by default and it can start being used in regular testing. It would still be useful to backport "always on" to 2.12.x once this has been reasonably tested and we have a good idea what values should be used.

Is there some way to extract stats about the current commit time that is being seen (average, maximum), so that we have some idea what are reasonable values to use on a heavily-loaded system?

Comment by Gerrit Updater [ 19/Dec/18 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33898
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8ef83df2f38e6c058b257ec7bae013c641085f9b

Comment by Amir Shehata (Inactive) [ 19/Dec/18 ]

added LU-11817 to track stats

Comment by Gerrit Updater [ 14/Feb/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34252
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set: 1
Commit: 8a142c6bb4f043b1bb82f4293c20dc6c6a5402cc

Comment by Gerrit Updater [ 07/Jun/19 ]

Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/34252/
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set:
Commit: 8632e94aeb7e62da07f342a9897d15dfd8251148

Comment by Joseph Gmitter (Inactive) [ 14/Aug/19 ]

This ticket is resolved as the patch landed to master under the MR routing merge commit: https://review.whamcloud.com/#/c/34983/

Comment by Gerrit Updater [ 03/Sep/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36031
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: db81f3f293dbc0c9dba90ea1153f554b33fbb80b

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36031/
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0a5a2be2c4af59244ce4c26d58d1c6cc47fb2c0a

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36172
Subject: LU-11816 test revert
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4183d28324bf3b3a4ba8c327b11ba7bcd7f76f30

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36173
Subject: Revert "LU-11816 lnet: setup health timeout defaults"
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3578844e9b84535d2124e8094e621d966a47fd0b

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36173/
Subject: Revert "LU-11816 lnet: setup health timeout defaults"
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 2e0e446bcab61276f6bc3052f2f03a87a7346795

Comment by Gerrit Updater [ 04/Oct/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36382
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6ab0d12526b5f9ad9ce7ddc12a8a178514092163

Comment by Gerrit Updater [ 08/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36382/
Subject: LU-11816 lnet: setup health timeout defaults
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 113f39aa9b2381be9af4b50c1aad4268b0683507

Generated at Sat Feb 10 02:47:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.