[LU-13145] LNet Health: increase transaction timeout Created: 16/Jan/20  Updated: 19/Dec/22  Resolved: 08/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
is blocked by LU-15288 LNet Health: increase transaction tim... Resolved
Related
is related to LU-13020 ko2iblnd tuning Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On larger clusters it appears like the current transaction timeout default of 10 seconds is too short and it causes RDMA timeouts.

The proposal is to increase the timeout to 150s. With a retry count of 3 that would bring the LND timeout to 50s, which was the initial value before health.



 Comments   
Comment by Gerrit Updater [ 16/Jan/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37253
Subject: LU-13145 lnet: increase transaction timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07ef6f9f0c142240a4586674fe9c556482f323c4

Comment by Gerrit Updater [ 31/Jan/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37390
Subject: LU-13145 lnet: use conservative health timeouts
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f308fc8fc393d939f34e4d7587c0848667084a8c

Comment by Gerrit Updater [ 04/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37430
Subject: LU-13145 lnet: use conservative health timeouts
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 188b4cb2f782e348cfa4225745341ac8e3e267c3

Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37430/
Subject: LU-13145 lnet: use conservative health timeouts
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 361e9eaef13c0f472ad45388d3e147dabc32b737

Comment by Peter Jones [ 08/Feb/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37390/
Subject: LU-13145 lnet: use conservative health timeouts
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3c86a9361862d34a4efde73c4f1cb7603ec6b2f9

Comment by Andreas Dilger [ 05/Jun/20 ]

From LU-13020 the workaround to get equivalent behavior for systems without this patch is to run the following commands on all of the 2.12.3 nodes in the shown order:

echo 150 > /sys/module/lnet/parameters/lnet_transaction_timeout
echo 2 > /sys/module/lnet/parameters/lnet_retry_count

This only temporarily changes these values, but they can be set permanently by adding the following line in /etc/modprobe.d/lnet.conf:

options lnet lnet_retry_count=2 lnet_transaction_timeout=150
Generated at Sat Feb 10 02:58:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.