Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 16881

    Description

      Because most ptlrpc messages do not have ACK , RPC client cannot distinguish message loss from long service time. Also, in current implementation, message-resend can only be triggered by RPC client after service timeout, no matter which message is lost in lifecycle of RPC.

      To improve Lustre RAS against message loss, we should allow message resend for any step of RPC lifecycle. However, current RPC client already has request message timeout/resend protocol and adaptive timeout, it may need fundamental changes if we want to have ACK for request message and use network timeout instead of service time to trigger request message resend. This may require a lot more efforts and resources, so it is not covered by this document.

      Reply-resend is relatively simple and more practicable, RPC server can repeatedly resend reply at fix time interval (e.g. 20 seconds), which should be sufficient even for latency in environment with router. Reply-resend can be stopped when there is an ACK for reply message, or client is evicted/disconnected.

      Attachments

        Issue Links

          Activity

            [LU-10275] ptlrpc reply acknowledgement

            This looks like it's covered with the LNet Health work. I'll take a look at the docs in more detail to see what he had intended.

            ashehata Amir Shehata (Inactive) added a comment - This looks like it's covered with the LNet Health work. I'll take a look at the docs in more detail to see what he had intended.

            Amir, how does this relate to our recent discussions about LNet Health and reply timeouts? Are these patches still useful?

            adilger Andreas Dilger added a comment - Amir, how does this relate to our recent discussions about LNet Health and reply timeouts? Are these patches still useful?

            I will work on CORAL soon, so have to reassign this ticket to bobijam. I will maintain patches for a while, but can't finish the landing process.

            liang Liang Zhen (Inactive) added a comment - I will work on CORAL soon, so have to reassign this ticket to bobijam. I will maintain patches for a while, but can't finish the landing process.

            Andreas, yes I think I can do this, I will update the patch to make it optional.

            liang Liang Zhen (Inactive) added a comment - Andreas, yes I think I can do this, I will update the patch to make it optional.

            Liang, thanks for the data. It looks like the overhead is noticeable, but not so bad that the change is unusable.

            Is it possible to make this feature optional, so that we can turn it on or off to debug?

            adilger Andreas Dilger added a comment - Liang, thanks for the data. It looks like the overhead is noticeable, but not so bad that the change is unusable. Is it possible to make this feature optional, so that we can turn it on or off to debug?

            Andreas, do you have any concern/comment on these data?

            liang Liang Zhen (Inactive) added a comment - Andreas, do you have any concern/comment on these data?

            performance data for ACKed lnet messages.

            liang Liang Zhen (Inactive) added a comment - performance data for ACKed lnet messages.
            liang Liang Zhen (Inactive) added a comment - - edited

            Andreas, because this patch can only improve reliability on one direction so far, so I did some tests with enabling message drop for reply portals only, without this patch, I got client eviction time to time while running random workload. After I applied this patch (because it is a small cluster, so I set reply-resend interval to a small value like 2-4 seconds instead of default value, so it can resend within AT), there was almost no client eviction.Of course we can't expect improvement like this in real world because message drop can be on both directions, but I think it is a good step to start.
            This feature can be applied on a per-client basis, or let's say we can upgrade any node without breaking interoperability, this feature is enabled only when both end of ptlrpc connection has this patch. Also, it can be enabled/disabled at runtime.

            I will attach some data, these data are not from this patch because I collected them before I worked this patch, but from a simple patch to enable ACK for all ptlrpc messages, so I think they are essentially same. From these data, we lost about 10% performance for lightweight metadata operations (0-stripe) when we have ACK for all messages (both request & reply), so I assume we may lose 5% performance with reply-ack only.

            liang Liang Zhen (Inactive) added a comment - - edited Andreas, because this patch can only improve reliability on one direction so far, so I did some tests with enabling message drop for reply portals only, without this patch, I got client eviction time to time while running random workload. After I applied this patch (because it is a small cluster, so I set reply-resend interval to a small value like 2-4 seconds instead of default value, so it can resend within AT), there was almost no client eviction.Of course we can't expect improvement like this in real world because message drop can be on both directions, but I think it is a good step to start. This feature can be applied on a per-client basis, or let's say we can upgrade any node without breaking interoperability, this feature is enabled only when both end of ptlrpc connection has this patch. Also, it can be enabled/disabled at runtime. I will attach some data, these data are not from this patch because I collected them before I worked this patch, but from a simple patch to enable ACK for all ptlrpc messages, so I think they are essentially same. From these data, we lost about 10% performance for lightweight metadata operations (0-stripe) when we have ACK for all messages (both request & reply), so I assume we may lose 5% performance with reply-ack only.

            Liang, have you done any testing on this to determine how much it improves reliability, recoverability, etc? Any idea on what kind of performance impact it has on normal operation? What kind of interoperability is needed for this (i.e. can it be done on a per-client basis, or do all clients need it, or what)?

            adilger Andreas Dilger added a comment - Liang, have you done any testing on this to determine how much it improves reliability, recoverability, etc? Any idea on what kind of performance impact it has on normal operation? What kind of interoperability is needed for this (i.e. can it be done on a per-client basis, or do all clients need it, or what)?
            liang Liang Zhen (Inactive) added a comment - - edited patch list: http://review.whamcloud.com/#/c/13203 http://review.whamcloud.com/#/c/13204 http://review.whamcloud.com/#/c/13219 http://review.whamcloud.com/#/c/13220 http://review.whamcloud.com/#/c/13227 http://review.whamcloud.com/#/c/13228 http://review.whamcloud.com/#/c/13489

            People

              ashehata Amir Shehata (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: