Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13574

Provide mechanism to ensure a sane timeout hierarchy

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None

    Description

      The introduction of lnet_transaction_timeout and LNet resends has perturbed the timeout hierarchy that was in place prior to these features. The lnet_transaction_timeout needs to be large enough such that LNet can attempt lnet_retry_count resend attempts before a message is finalized to the upper layer. LND timeouts, then, must be small enough to accommodate those resends within the transaction timeout period.

      The current code allows for some situations where error handling in upper layers is delayed by LNet's resends. For example, suppose w/o LNet health we had ko2iblnd timeout=10 and at_min = 15. We could have an RPC timeout at 15 seconds, and ptlrpc will retry.
      Now, with LNet health, we'd have lnet_transaction_timeout = 20, retry_count = 2 for an lnd_timeout of 10. at_min must be > 20. So the earliest rpc timeout would be > 20 seconds and we've delayed ptlrpc by > 5 seconds relative to the first case.

      We need to figure out how to get the best of both worlds. One idea Amir had was to modify LNetPut/Get API to take a max timeout parameter. This would let users tell LNet how long they are willing to wait. LNet can then decide if it has enough time to attempt resending a message on a per-Put/Get basis.

      Attachments

        Activity

          People

            wc-triage WC Triage
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: