<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:33:36 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10275] ptlrpc reply acknowledgement</title>
                <link>https://jira.whamcloud.com/browse/LU-10275</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Because most ptlrpc messages do not have ACK , RPC client cannot distinguish message loss from long service time. Also, in current implementation, message-resend can only be triggered by RPC client after service timeout, no matter which message is lost in lifecycle of RPC.&lt;/p&gt;

&lt;p&gt;To improve Lustre RAS against message loss, we should allow message resend for any step of RPC lifecycle. However, current RPC client already has request message timeout/resend protocol and adaptive timeout, it may need fundamental changes if we want to have ACK for request message and use network timeout instead of service time to trigger request message resend. This may require a lot more efforts and resources, so it is not covered by this document.&lt;/p&gt;

&lt;p&gt;Reply-resend is relatively simple and more practicable, RPC server can repeatedly resend reply at fix time interval (e.g. 20 seconds), which should be sufficient even for latency in environment with router. Reply-resend can be stopped when there is an ACK for reply message, or client is evicted/disconnected.&lt;/p&gt;</description>
                <environment></environment>
        <key id="28000">LU-10275</key>
            <summary>ptlrpc reply acknowledgement</summary>
                <type id="2" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11311&amp;avatarType=issuetype">New Feature</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="liang">Liang Zhen</reporter>
                        <labels>
                            <label>lnet</label>
                            <label>performance</label>
                    </labels>
                <created>Sun, 21 Dec 2014 07:00:49 +0000</created>
                <updated>Tue, 18 Dec 2018 02:58:06 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="102288" author="liang" created="Wed, 24 Dec 2014 09:14:15 +0000"  >&lt;p&gt;I created a dedicated patch for lnet event add-on: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10274&quot; title=&quot;LNet event add-on function&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10274&quot;&gt;&lt;del&gt;INTL-166&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="102495" author="liang" created="Sun, 4 Jan 2015 13:57:07 +0000"  >&lt;p&gt;patch list:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13203&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13203&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13204&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13204&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13219&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13219&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13220&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13220&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13227&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13227&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13228&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13228&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13489&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13489&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="107994" author="adilger" created="Wed, 25 Feb 2015 20:21:26 +0000"  >&lt;p&gt;Liang, have you done any testing on this to determine how much it improves reliability, recoverability, etc?  Any idea on what kind of performance impact it has on normal operation?  What kind of interoperability is needed for this (i.e. can it be done on a per-client basis, or do all clients need it, or what)?&lt;/p&gt;</comment>
                            <comment id="108052" author="liang" created="Thu, 26 Feb 2015 03:45:44 +0000"  >&lt;p&gt;Andreas, because this patch can only improve reliability on one direction so far, so I did some tests with enabling message drop for reply portals only, without this patch, I got client eviction time to time while running random workload. After I applied this patch (because it is a small cluster, so I set reply-resend interval to a small value like 2-4 seconds instead of default value, so it can resend within AT), there was almost no client eviction.Of  course we can&apos;t expect improvement like this in real world because message drop can be on both directions, but I think it is a good step to start.&lt;br/&gt;
This feature can be applied on a per-client basis, or let&apos;s say we can upgrade any node without breaking interoperability, this feature is enabled only when both end of ptlrpc connection has this patch. Also, it can be enabled/disabled at runtime.&lt;/p&gt;

&lt;p&gt;I will attach some data, these data are not from this patch because I collected them before I worked this patch, but from a simple patch to enable ACK for all ptlrpc messages, so I think they are essentially same. From these data, we lost about 10% performance for lightweight metadata operations (0-stripe) when we have ACK for all messages (both request &amp;amp; reply), so I assume we may lose 5% performance with reply-ack only.&lt;/p&gt;
</comment>
                            <comment id="108053" author="liang" created="Thu, 26 Feb 2015 03:47:22 +0000"  >&lt;p&gt;performance data for ACKed lnet messages.&lt;/p&gt;</comment>
                            <comment id="109455" author="liang" created="Wed, 11 Mar 2015 14:53:26 +0000"  >&lt;p&gt;Andreas, do you have any concern/comment on these data? &lt;/p&gt;</comment>
                            <comment id="109528" author="adilger" created="Thu, 12 Mar 2015 06:07:35 +0000"  >&lt;p&gt;Liang, thanks for the data.  It looks like the overhead is noticeable, but not so bad that the change is unusable.&lt;/p&gt;

&lt;p&gt;Is it possible to make this feature optional, so that we can turn it on or off to debug?&lt;/p&gt;</comment>
                            <comment id="109709" author="liang" created="Sat, 14 Mar 2015 03:31:08 +0000"  >&lt;p&gt;Andreas, yes I think I can do this, I will update the patch to make it optional.&lt;/p&gt;</comment>
                            <comment id="110070" author="liang" created="Thu, 19 Mar 2015 05:57:27 +0000"  >&lt;p&gt;I will work on CORAL soon, so have to reassign this ticket to bobijam. I will maintain patches for a while, but can&apos;t finish the landing process.&lt;/p&gt;</comment>
                            <comment id="238714" author="adilger" created="Mon, 17 Dec 2018 22:49:00 +0000"  >&lt;p&gt;Amir, how does this relate to our recent discussions about LNet Health and reply timeouts?  Are these patches still useful?&lt;/p&gt;</comment>
                            <comment id="238728" author="ashehata" created="Tue, 18 Dec 2018 02:58:06 +0000"  >&lt;p&gt;This looks like it&apos;s covered with the LNet Health work. I&apos;ll take a look at the docs in more detail to see what he had intended.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="27892">LU-10274</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="17117" name="ack_perf_5.xlsx" size="70280" author="liang" created="Thu, 26 Feb 2015 03:47:22 +0000"/>
                            <attachment id="16640" name="pltrpc_reply_resend_2.docx" size="132132" author="liang" created="Sun, 21 Dec 2014 07:00:49 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzx2yv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16881</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>