Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
The introduction of lnet_transaction_timeout and LNet resends has perturbed the timeout hierarchy that was in place prior to these features. The lnet_transaction_timeout needs to be large enough such that LNet can attempt lnet_retry_count resend attempts before a message is finalized to the upper layer. LND timeouts, then, must be small enough to accommodate those resends within the transaction timeout period.
The current code allows for some situations where error handling in upper layers is delayed by LNet's resends. For example, suppose w/o LNet health we had ko2iblnd timeout=10 and at_min = 15. We could have an RPC timeout at 15 seconds, and ptlrpc will retry.
Now, with LNet health, we'd have lnet_transaction_timeout = 20, retry_count = 2 for an lnd_timeout of 10. at_min must be > 20. So the earliest rpc timeout would be > 20 seconds and we've delayed ptlrpc by > 5 seconds relative to the first case.
We need to figure out how to get the best of both worlds. One idea Amir had was to modify LNetPut/Get API to take a max timeout parameter. This would let users tell LNet how long they are willing to wait. LNet can then decide if it has enough time to attempt resending a message on a per-Put/Get basis.