<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:15:10 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15068] Race between commit callback and reply_out_callback::LNET_EVENT_SEND</title>
                <link>https://jira.whamcloud.com/browse/LU-15068</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;When LNet is under load it is possible for messages to be queued while waiting for a peer TX credit or a network TX credit. When running benchmarks on a large scale system we observed clients hitting &quot;slow reply&quot; timeouts for MDS_REINT RPCs. Tracing revealed that the server received the MDS_REINT RPC and sent a reply to the client, but the reply was queued in LNet because there weren&apos;t any peer credits available.&lt;/p&gt;

&lt;p&gt;Shortly after, the commit callback was triggered which added the reply state to be handled via &lt;tt&gt;ptlrpc_commit_replies() -&amp;gt; rs_batch_add()&lt;/tt&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;void ptlrpc_commit_replies(struct obd_export *exp)
{
...
                if (rs-&amp;gt;rs_transno &amp;lt;= exp-&amp;gt;exp_last_committed) {
                        list_del_init(&amp;amp;rs-&amp;gt;rs_obd_list);
                        rs_batch_add(&amp;amp;batch, rs);
                } 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The reply state MD handle then got unlinked by ptlrpc_handle_rs().&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int
ptlrpc_handle_rs(struct ptlrpc_reply_state *rs)
{
...
        if ((!been_handled &amp;amp;&amp;amp; rs-&amp;gt;rs_on_net) || nlocks &amp;gt; 0) {
                spin_unlock(&amp;amp;rs-&amp;gt;rs_lock);

                if (!been_handled &amp;amp;&amp;amp; rs-&amp;gt;rs_on_net) {
                        LNetMDUnlink(rs-&amp;gt;rs_md_h);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But the reply never left the server - it was always queued in LNet. Since the MD was unlinked, LNet aborted the send once a credit became available. Client eventually hit &quot;timeout for slow reply&quot; and this caused the client to reconnect.&lt;/p&gt;

&lt;p&gt;I&apos;m able to readily reproduce the issue using a four node cluster where I have 1 MDS, 1 OSS and 2 clients.&lt;br/&gt;
1. Run mdtest create&lt;br/&gt;
2. Start LST in the background - I&apos;m doing a simultaneous read and write session where MDS is in the &quot;to&quot; group and the OSS and 2 clients are in the &quot;from&quot; group - concurrency 64&lt;br/&gt;
3. Run  mdtest delete&lt;/p&gt;

&lt;p&gt;LST causes credit starvation during the mdtest delete phase, and so the replies are more readily queued in LNet as I described above.&lt;/p&gt;</description>
                <environment></environment>
        <key id="66521">LU-15068</key>
            <summary>Race between commit callback and reply_out_callback::LNET_EVENT_SEND</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="hornc">Chris Horn</assignee>
                                    <reporter username="hornc">Chris Horn</reporter>
                        <labels>
                    </labels>
                <created>Wed, 6 Oct 2021 16:02:30 +0000</created>
                <updated>Sat, 12 Mar 2022 00:20:56 +0000</updated>
                            <resolved>Tue, 30 Nov 2021 13:44:36 +0000</resolved>
                                                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="314828" author="gerrit" created="Wed, 6 Oct 2021 16:22:57 +0000"  >&lt;p&gt;&quot;Chris Horn &amp;lt;chris.horn@hpe.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/45138&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45138&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15068&quot; title=&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15068&quot;&gt;&lt;del&gt;LU-15068&lt;/del&gt;&lt;/a&gt; ptlrpc: Do not unlink difficult reply until sent&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 7060fe3cbbc35c44cf071da993cf1fdea60e7f1f&lt;/p&gt;</comment>
                            <comment id="319459" author="gerrit" created="Tue, 30 Nov 2021 03:45:40 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/45138/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45138/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15068&quot; title=&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15068&quot;&gt;&lt;del&gt;LU-15068&lt;/del&gt;&lt;/a&gt; ptlrpc: Do not unlink difficult reply until sent&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5c156b48425aae245537aaf10229734166463347&lt;/p&gt;</comment>
                            <comment id="319534" author="pjones" created="Tue, 30 Nov 2021 13:44:36 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                            <comment id="320842" author="gerrit" created="Tue, 14 Dec 2021 13:35:31 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/45849&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45849&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15068&quot; title=&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15068&quot;&gt;&lt;del&gt;LU-15068&lt;/del&gt;&lt;/a&gt; ptlrpc: Do not unlink difficult reply until sent&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 6234ec04be8e5523ffda71ce4a25dbbb63002a57&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i026hz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>