[LU-13574] Provide mechanism to ensure a sane timeout hierarchy Created: 15/May/20  Updated: 15/May/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807
Epic Link: unlabelled-LU-13422

 Description   

The introduction of lnet_transaction_timeout and LNet resends has perturbed the timeout hierarchy that was in place prior to these features. The lnet_transaction_timeout needs to be large enough such that LNet can attempt lnet_retry_count resend attempts before a message is finalized to the upper layer. LND timeouts, then, must be small enough to accommodate those resends within the transaction timeout period.

The current code allows for some situations where error handling in upper layers is delayed by LNet's resends. For example, suppose w/o LNet health we had ko2iblnd timeout=10 and at_min = 15. We could have an RPC timeout at 15 seconds, and ptlrpc will retry.
Now, with LNet health, we'd have lnet_transaction_timeout = 20, retry_count = 2 for an lnd_timeout of 10. at_min must be > 20. So the earliest rpc timeout would be > 20 seconds and we've delayed ptlrpc by > 5 seconds relative to the first case.

We need to figure out how to get the best of both worlds. One idea Amir had was to modify LNetPut/Get API to take a max timeout parameter. This would let users tell LNet how long they are willing to wait. LNet can then decide if it has enough time to attempt resending a message on a per-Put/Get basis.


Generated at Sat Feb 10 03:02:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.