[LU-13145] LNet Health: increase transaction timeout Created: 16/Jan/20 Updated: 19/Dec/22 Resolved: 08/Feb/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
On larger clusters it appears like the current transaction timeout default of 10 seconds is too short and it causes RDMA timeouts. The proposal is to increase the timeout to 150s. With a retry count of 3 that would bring the LND timeout to 50s, which was the initial value before health. |
| Comments |
| Comment by Gerrit Updater [ 16/Jan/20 ] |
|
|
| Comment by Gerrit Updater [ 31/Jan/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37390 |
| Comment by Gerrit Updater [ 04/Feb/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37430 |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37430/ |
| Comment by Peter Jones [ 08/Feb/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37390/ |
| Comment by Andreas Dilger [ 05/Jun/20 ] |
|
From echo 150 > /sys/module/lnet/parameters/lnet_transaction_timeout echo 2 > /sys/module/lnet/parameters/lnet_retry_count This only temporarily changes these values, but they can be set permanently by adding the following line in /etc/modprobe.d/lnet.conf: options lnet lnet_retry_count=2 lnet_transaction_timeout=150 |