Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10275

ptlrpc reply acknowledgement

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 16881

    Description

      Because most ptlrpc messages do not have ACK , RPC client cannot distinguish message loss from long service time. Also, in current implementation, message-resend can only be triggered by RPC client after service timeout, no matter which message is lost in lifecycle of RPC.

      To improve Lustre RAS against message loss, we should allow message resend for any step of RPC lifecycle. However, current RPC client already has request message timeout/resend protocol and adaptive timeout, it may need fundamental changes if we want to have ACK for request message and use network timeout instead of service time to trigger request message resend. This may require a lot more efforts and resources, so it is not covered by this document.

      Reply-resend is relatively simple and more practicable, RPC server can repeatedly resend reply at fix time interval (e.g. 20 seconds), which should be sufficient even for latency in environment with router. Reply-resend can be stopped when there is an ACK for reply message, or client is evicted/disconnected.

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: