Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 16881

    Description

      Because most ptlrpc messages do not have ACK , RPC client cannot distinguish message loss from long service time. Also, in current implementation, message-resend can only be triggered by RPC client after service timeout, no matter which message is lost in lifecycle of RPC.

      To improve Lustre RAS against message loss, we should allow message resend for any step of RPC lifecycle. However, current RPC client already has request message timeout/resend protocol and adaptive timeout, it may need fundamental changes if we want to have ACK for request message and use network timeout instead of service time to trigger request message resend. This may require a lot more efforts and resources, so it is not covered by this document.

      Reply-resend is relatively simple and more practicable, RPC server can repeatedly resend reply at fix time interval (e.g. 20 seconds), which should be sufficient even for latency in environment with router. Reply-resend can be stopped when there is an ACK for reply message, or client is evicted/disconnected.

      Attachments

        Issue Links

          Activity

            [LU-10275] ptlrpc reply acknowledgement

            This looks like it's covered with the LNet Health work. I'll take a look at the docs in more detail to see what he had intended.

            ashehata Amir Shehata (Inactive) added a comment - This looks like it's covered with the LNet Health work. I'll take a look at the docs in more detail to see what he had intended.

            Amir, how does this relate to our recent discussions about LNet Health and reply timeouts? Are these patches still useful?

            adilger Andreas Dilger added a comment - Amir, how does this relate to our recent discussions about LNet Health and reply timeouts? Are these patches still useful?
            adilger Andreas Dilger made changes -
            Labels New: lnet performance
            adilger Andreas Dilger made changes -
            Key Original: INTL-173 New: LU-10275
            Workflow Original: classic default workflow [ 34287 ] New: Sub-task Blocking [ 57098 ]
            Project Original: Intel Internal [ 10117 ] New: Lustre [ 10000 ]
            adilger Andreas Dilger made changes -
            Assignee Original: Zhenyu Xu [ bobijam ] New: Amir Shehata [ ashehata ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to INTL-166 [ INTL-166 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-11 [ JFC-11 ]
            liang Liang Zhen (Inactive) made changes -
            Assignee Original: Liang Zhen [ liang ] New: Zhenyu Xu [ bobijam ]

            I will work on CORAL soon, so have to reassign this ticket to bobijam. I will maintain patches for a while, but can't finish the landing process.

            liang Liang Zhen (Inactive) added a comment - I will work on CORAL soon, so have to reassign this ticket to bobijam. I will maintain patches for a while, but can't finish the landing process.

            Andreas, yes I think I can do this, I will update the patch to make it optional.

            liang Liang Zhen (Inactive) added a comment - Andreas, yes I think I can do this, I will update the patch to make it optional.

            People

              ashehata Amir Shehata (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: