<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:50:39 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5341] Intermittent hangs waiting for RPCs in unregistering phase to timeout</title>
                <link>https://jira.whamcloud.com/browse/LU-5341</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Regression introduced by the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5073&quot; title=&quot;lustre don&amp;#39;t able to unload modules in conf-sanity 31.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5073&quot;&gt;&lt;del&gt;LU-5073&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;http://review.whamcloud.com/10353&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10353&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;See also &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5259&quot; title=&quot;request gets stuck in UNREGISTERING phase&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5259&quot;&gt;&lt;del&gt;LU-5259&lt;/del&gt;&lt;/a&gt;,  &lt;a href=&quot;http://review.whamcloud.com/10846&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10846&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When reply_in_callback executes before request_out_callback,  the task sending the rpc can hang in ptlrpc_set_wait  until the request times out. An example is a call to mdc_close; in this example the mdc_close takes 55 seconds to complete even though the server handles the request promptly.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Client:
&amp;gt; 00000002:00000001:18.0:1403799444.471726:0:14636:0:(mdc_request.c:829:mdc_close()) Process entered
...
&amp;gt; 00000100:00000001:18.0:1403799444.848775:0:14636:0:(client.c:1382:ptlrpc_send_new_req()) Process entered
&amp;gt; 00000100:00000040:18.0:1403799444.848776:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req &quot;New&quot; -&amp;gt; &quot;Rpc&quot;  req@ffff880fffb10c00 x1471331744844100/t0(0) o35-&amp;gt;snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
...
&amp;gt; 00000100:00000200:18.0:1403799444.852847:0:14636:0:(events.c:100:reply_in_callback()) @@@ type 6, status 0  req@ffff880fffb10c00 x1471331744844100/t0(0) o35-&amp;gt;snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 3 fl Rpc:R/0/ffffffff rc 0/-1
...
&amp;gt; 00000100:00000040:18.0:1403799444.852858:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req &quot;Rpc&quot; -&amp;gt; &quot;Unregistering&quot;  req@ffff880fffb10c00 x1471331744844100/t0(0) o35-&amp;gt;snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 3 fl Rpc:R/0/ffffffff rc 0/-1
...
&amp;gt; 00000100:00000001:18.0:1403799499.847659:0:14636:0:(client.c:1929:ptlrpc_expired_set()) Process entered
&amp;gt; 00000100:00000001:18.0:1403799499.847661:0:14636:0:(client.c:1965:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1)
...
&amp;gt; 00000100:00000040:18.0:1403799499.847669:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req &quot;Unregistering&quot; -&amp;gt; &quot;Rpc&quot;  req@ffff880fffb10c00 x1471331744844100/t0(0) o35-&amp;gt;snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 2 fl Unregistering:R/0/ffffffff rc 0/-1
...
&amp;gt; 00000002:00000001:18.0:1403799444.472256:0:14636:0:(mdc_request.c:922:mdc_close()) Process leaving (rc=0 : 0 : 0)


MDS:
&amp;gt; 00000100:00100000:14.0:1403799444.858578:0:20307:0:(service.c:1886:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_178:0a613492-ed84-267d-d415-d0983ee768af+19:14636:x1471331744844100:12345-253@gni1:35
&amp;gt; 00000100:00100000:14.0:1403799444.858660:0:20307:0:(service.c:1930:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_178:0a613492-ed84-267d-d415-d0983ee768af+19:14636:x1471331744844100:12345-253@gni1:35 Request procesed in 89us (129us total) trans 193287303580 rc  0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When reply_in_callback executes, rq_req_unlink is still set so the rpc is moved to the unregistering phase until request_out_callback has a chance to reset the flag. But when request_out_callback is finally invoked it is with an LNET_EVENT_SEND event:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 00000100:00000200:2:1404844097.216045:0:5109:0:(events.c:68:request_out_callback()) @@@ type 5, status 0  req@ffff8801fb962c00 x1473065104065052/t0(0) o35-&amp;gt;snx11014-MDT0000-mdc-ffff88040f1a6000@10.10.100.4@o2ib6000:23/10 lens 392/2536 e 0 to 0 dl 1404844281 ref 3 fl Unregistering:R/0/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;/* Note: trace is from a different dump but shows the event type and rpc state.  Apologies for mixing data from different dumps but I don&apos;t have a single debug log that shows the whole story.) */&lt;/p&gt;

&lt;p&gt;With a send event, request_out_callback does not call ptlrpc_client_wake_req, so the mdc_close task waiting in ptlrpc_set_wait is not woken up, ptlrpc_check_set is not invoked, and mdc_close waits until the rpc times out before continuing. &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;void request_out_callback(lnet_event_t *ev)
{
...
        DEBUG_REQ(D_NET, req, &quot;type %d, status %d&quot;, ev-&amp;gt;type, ev-&amp;gt;status);
....
        if (ev-&amp;gt;unlinked)
                req-&amp;gt;rq_req_unlink = 0;

        if (ev-&amp;gt;type == LNET_EVENT_UNLINK || ev-&amp;gt;status != 0) {

                /* Failed send: make it seem like the reply timed out, just
                 * like failing sends in client.c does currently...  */

                req-&amp;gt;rq_net_err = 1;
                ptlrpc_client_wake_req(req);
        }   
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note, waiting for the rpc timeout does not appear to cause any functional errors but the performance penalty is significant. &lt;/p&gt;
</description>
                <environment></environment>
        <key id="25564">LU-5341</key>
            <summary>Intermittent hangs waiting for RPCs in unregistering phase to timeout</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bogl">Bob Glossman</assignee>
                                    <reporter username="amk">Ann Koehler</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Mon, 14 Jul 2014 16:51:27 +0000</created>
                <updated>Thu, 28 Jul 2016 15:08:37 +0000</updated>
                            <resolved>Thu, 28 Jul 2016 15:08:37 +0000</resolved>
                                    <version>Lustre 2.6.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="88950" author="jhammond" created="Mon, 14 Jul 2014 16:57:26 +0000"  >&lt;p&gt;I&apos;ve noticed that insanity tests 0 and 1 (and possibly others) fail about 50% of the time when run locally. I bisected that to the landing of &lt;a href=&quot;http://review.whamcloud.com/10846&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10846&lt;/a&gt;.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;== insanity test 1: MDS/MDS failure == 11:51:49 (1404751909)
Stopping /mnt/mds1 (opts:) on t
Failover mds1 to t
Stopping /mnt/mds2 (opts:) on t
Reintegrating MDS2
11:52:26 (1404751946) waiting for t network 900 secs ...
11:52:26 (1404751946) network interface is UP
Starting mds2:   -o loop /tmp/lustre-mdt2 /mnt/mds2
Started lustre-MDT0001
11:52:41 (1404751961) waiting for t network 900 secs ...
11:52:41 (1404751961) network interface is UP
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
mount.lustre: mount /dev/loop1 at /mnt/mds1 failed: Input/output error
Is the MGS running?
Start of /tmp/lustre-mdt1 on mds1 failed 5
 insanity test_1: @@@@@@ FAIL: test_1 failed with 5
  Trace dump:
  = lustre/tests/../tests/test-framework.sh:4504:error_noexit()
  = lustre/tests/../tests/test-framework.sh:4535:error()
  = lustre/tests/../tests/test-framework.sh:4781:run_one()
  = lustre/tests/../tests/test-framework.sh:4816:run_one_logged()
  = lustre/tests/../tests/test-framework.sh:4636:run_test()
  = lustre/tests/insanity.sh:207:main()
Dumping lctl log to /tmp/test_logs/1404751907/insanity.test_1.*.1404751982.log
Dumping logs only on local client.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="88957" author="amk" created="Mon, 14 Jul 2014 17:39:01 +0000"  >&lt;p&gt;Patch available at &lt;a href=&quot;http://review.whamcloud.com/11090&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11090&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This patch is an expedient fix to the bug that minimizes perturbations to the ptlrpc state machine.  In the longer term,  the whole set of issues associated with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5073&quot; title=&quot;lustre don&amp;#39;t able to unload modules in conf-sanity 31.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5073&quot;&gt;&lt;del&gt;LU-5073&lt;/del&gt;&lt;/a&gt; warrant an architectural review. &lt;/p&gt;</comment>
                            <comment id="94519" author="cliffw" created="Fri, 19 Sep 2014 17:14:43 +0000"  >&lt;p&gt;The patch has a negative review, can you address those concerns?&lt;/p&gt;</comment>
                            <comment id="100887" author="hornc" created="Fri, 5 Dec 2014 23:12:15 +0000"  >&lt;p&gt;Looks like this was addressed in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5696&quot; title=&quot;missing wakeup for ptlrpc_check_set&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5696&quot;&gt;&lt;del&gt;LU-5696&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="160191" author="amk" created="Thu, 28 Jul 2016 15:08:37 +0000"  >&lt;p&gt;Closing since the mod was superceded by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5696&quot; title=&quot;missing wakeup for ptlrpc_check_set&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5696&quot;&gt;&lt;del&gt;LU-5696&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="25321">LU-5259</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="24741">LU-5073</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwrcn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14898</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>