<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:10:22 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-782] sanity.sh subtest test_220 failed with 3, error -107 -ENOTCONN</title>
                <link>https://jira.whamcloud.com/browse/LU-782</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for Chris Gearing &amp;lt;chris@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/af3669de-f9cd-11e0-b486-52540025f9af&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/af3669de-f9cd-11e0-b486-52540025f9af&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The sub-test test_220 failed with the following error:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;test_220 failed with 3&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Info required for matching: sanity 220&lt;/p&gt;</description>
                <environment></environment>
        <key id="12212">LU-782</key>
            <summary>sanity.sh subtest test_220 failed with 3, error -107 -ENOTCONN</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Fri, 21 Oct 2011 05:17:55 +0000</created>
                <updated>Mon, 27 Feb 2017 23:35:15 +0000</updated>
                            <resolved>Mon, 27 Feb 2017 23:35:15 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="22226" author="pjones" created="Mon, 31 Oct 2011 11:45:40 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please look into this regular failure. Note the comments from Tappro on LU748 and the work in this area done by Andreas on LU614&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="22269" author="bobijam" created="Tue, 1 Nov 2011 21:23:57 +0000"  >&lt;p&gt;another hit &lt;a href=&quot;https://maloo.whamcloud.com/sub_tests/ec9a1b44-f451-11e0-908b-52540025f9af&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/sub_tests/ec9a1b44-f451-11e0-908b-52540025f9af&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With unknown reason, while creating object, MDS reports ENOTCONN&lt;/p&gt;

&lt;p&gt;(lov_request.c:569:lov_update_create_set()) error creating fid 0x78d0 sub-object on OST idx 0/1: rc = -107&lt;/p&gt;

&lt;p&gt;But I havn&apos;t found the evidence of network error between MDS0 and OST0.&lt;/p&gt;</comment>
                            <comment id="22321" author="bobijam" created="Wed, 2 Nov 2011 21:29:19 +0000"  >&lt;p&gt;Couldn&apos;t find the cause of this -107 error, need -1 bug log to reveal more details.&lt;/p&gt;</comment>
                            <comment id="23444" author="pjones" created="Fri, 25 Nov 2011 11:35:39 +0000"  >&lt;p&gt;Sarah&lt;/p&gt;

&lt;p&gt;Could you please try and repeat this test manually and gather the -1 debug log?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="23449" author="sarah" created="Sun, 27 Nov 2011 02:45:23 +0000"  >&lt;p&gt;sure, will keep you updated&lt;/p&gt;</comment>
                            <comment id="25121" author="yong.fan" created="Thu, 22 Dec 2011 04:36:14 +0000"  >&lt;p&gt;Another failure instance:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://maloo.whamcloud.com/test_sets/7d12e158-2c63-11e1-ab2e-5254004bbbd3&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/7d12e158-2c63-11e1-ab2e-5254004bbbd3&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="41362" author="spiechurski" created="Mon, 2 Jul 2012 06:16:14 +0000"  >&lt;p&gt;We seem to hit the same issue.&lt;br/&gt;
Some file creation onto specific OSTs fail after 50 seconds. The next available OST is then assigned to this file, with the MDS reporting the &quot;error creating fix 0xXX sub-objects on OST idx XX/1: rc= -107&quot; message.&lt;br/&gt;
The failing OSTs are not always the same, but persist enough time to reproduce the issue for several minutes consistently.&lt;br/&gt;
I will attach soon full debug logs from the client, the MDS and the OSS serving the failing OST during a &apos;lfs setstripe -c 1 -o 3 toto&apos; which shows the problem.&lt;br/&gt;
This looks like a communication problem between the MDS and OSS, but communication with other OSTs on the same OSS does not cause any problem.&lt;/p&gt;

&lt;p&gt;In the MDS debug log, Request x1405820398383011 is never seen as &quot;Sent&quot; and also never appears in the OSS debug logs.&lt;/p&gt;</comment>
                            <comment id="41364" author="spiechurski" created="Mon, 2 Jul 2012 08:09:18 +0000"  >&lt;p&gt;Full debug log of lfs setstripe -c 1 -o 3 toto on client, MDS and OSS.&lt;/p&gt;</comment>
                            <comment id="41558" author="spiechurski" created="Fri, 6 Jul 2012 10:36:30 +0000"  >&lt;p&gt;A crashdump of the MDS in this state is now available if needed.&lt;/p&gt;</comment>
                            <comment id="41586" author="bobijam" created="Sun, 8 Jul 2012 23:24:21 +0000"  >&lt;p&gt;Is it easy to reproduce it? If so, would you mind collecting -1 debug log for it? &lt;/p&gt;</comment>
                            <comment id="41589" author="spiechurski" created="Mon, 9 Jul 2012 03:33:21 +0000"  >&lt;p&gt;This is easy to reproduce in the current state of the cluster.&lt;br/&gt;
The tarball I already uploaded (issue-20120702.tar.gz) already contains a -1 debug log for it on client+mds+oss.&lt;br/&gt;
The debug log on the mds mentions a RPC which times out after 50s, but this RPC number was never logged as &quot;Sent&quot;, and the OSS log never mentions it as &quot;Received&quot;.&lt;/p&gt;</comment>
                            <comment id="41607" author="bobijam" created="Mon, 9 Jul 2012 10:35:09 +0000"  >&lt;p&gt;I think the log has not covered the life cycle of x1405820398383011 request, the first appearance of it in MDS is its timeout event, the log shows it timed out and it seems that it&apos;s a no-resend request, mostly goes through following path.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;ptlrpc_check_set()&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; &lt;span class=&quot;code-comment&quot;&gt;//
&lt;/span&gt;                                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ptlrpc_import_delay_req(imp, req, &amp;amp;status)){
                                        /* put on delay list - only &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; we wait
                                         * recovery finished - before send */
                                        cfs_list_del_init(&amp;amp;req-&amp;gt;rq_list);
                                        cfs_list_add_tail(&amp;amp;req-&amp;gt;rq_list,
                                                          &amp;amp;imp-&amp;gt; \
                                                          imp_delayed_list);
                                        cfs_spin_unlock(&amp;amp;imp-&amp;gt;imp_lock);
                                        &lt;span class=&quot;code-keyword&quot;&gt;continue&lt;/span&gt;;
                                }

                                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (status != 0)  {
                                        req-&amp;gt;rq_status = status;
                                        ptlrpc_rqphase_move(req,
                                                RQ_PHASE_INTERPRET);
                                        cfs_spin_unlock(&amp;amp;imp-&amp;gt;imp_lock);
                                        GOTO(interpret, req-&amp;gt;rq_status);
                                }
                                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ptlrpc_no_resend(req) &amp;amp;&amp;amp; !req-&amp;gt;rq_wait_ctx) {        &lt;span class=&quot;code-comment&quot;&gt;/////// HERE
&lt;/span&gt;                                        req-&amp;gt;rq_status = -ENOTCONN;
                                        ptlrpc_rqphase_move(req,
                                                RQ_PHASE_INTERPRET);
                                        cfs_spin_unlock(&amp;amp;imp-&amp;gt;imp_lock);
                                        GOTO(interpret, req-&amp;gt;rq_status);
                                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and the request never got send, is it possible to get a log covering the life cycle of the -107 request?&lt;/p&gt;

&lt;p&gt;ps. what Lustre version do you use?&lt;/p&gt;</comment>
                            <comment id="41610" author="spiechurski" created="Mon, 9 Jul 2012 12:27:34 +0000"  >&lt;p&gt;Here is a new debug log session.&lt;br/&gt;
MDS is helios17, OST used is scratch-OST000d which is mounted on helios26 and client is helios86.&lt;br/&gt;
Their nids are respectively 10.4.72.51@o2ib, 10.4.72.8@o2ib and 10.4.67.2@o2ib.&lt;/p&gt;

&lt;p&gt;The request number leading to the -107 is 1406627952493697 and once again appears only once it times out on the MDS. I let 5 seconds between the activation of debug and trigger of the issue, so the whole lifecycle of the RPC should appear.&lt;/p&gt;

&lt;p&gt;We are using our internal Bull version Bull.2.213 which is based upon 2.1.1 + a few patches, including ORNL-22 and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1144&quot; title=&quot;implement a NUMA aware ptlrpcd binding policy&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1144&quot;&gt;&lt;del&gt;LU-1144&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
The introduced ptlrpc_bind_policy parameter is set to 3.&lt;/p&gt;

&lt;p&gt;We should soon test the behavior with ptlrpc_bind_policy=1 to &quot;revert&quot; to the old code path.&lt;/p&gt;</comment>
                            <comment id="41644" author="bobijam" created="Mon, 9 Jul 2012 23:46:28 +0000"  >&lt;p&gt;strangely the request birth still escaped our monitoring.&lt;/p&gt;

&lt;div class=&quot;panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;$ grep ffff880bcb72f800 lustre-debug-20120709.helios17&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;panelContent&quot;&gt;
&lt;p&gt;154252:00000100:00000040:11.0:1341848952.925668:0:5006:0:(client.c:1794:ptlrpc_expire_one_request()) @@@ Request x1406627952493697 sent from scratch-OST000d-osc-MDT0000 to NID 10.4.72.8@o2ib has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1341848902&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;real_sent 0&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;current 1341848952&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;deadline 50s&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;delay 0s&amp;#93;&lt;/span&gt;  req@ffff880bcb72f800 x1406627952493697/t0(0) o-1-&amp;gt;scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1&lt;/p&gt;

&lt;p&gt;154265:00000100:00000040:11.0:1341848952.925691:0:5006:0:(lustre_net.h:1675:ptlrpc_rqphase_move()) @@@ move req &quot;Rpc&quot; &lt;del&gt;&amp;gt; &quot;Interpret&quot;  req@ffff880bcb72f800 x1406627952493697/t0(0) o-1&lt;/del&gt;&amp;gt;scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Rpc:XN/ffffffff/ffffffff rc -107/-1&lt;/p&gt;

&lt;p&gt;154276:00000100:00000001:11.0:1341848952.937088:0:5006:0:(client.c:2440:ptlrpc_request_addref()) Process leaving (rc=18446612182972168192 : -131890737383424 : ffff880bcb72f800)&lt;/p&gt;

&lt;p&gt;154277:00000100:00000040:11.0:1341848952.937089:0:5006:0:(lustre_net.h:1675:ptlrpc_rqphase_move()) @@@ move req &quot;Interpret&quot; &lt;del&gt;&amp;gt; &quot;Complete&quot;  req@ffff880bcb72f800 x1406627952493697/t0(0) o-1&lt;/del&gt;&amp;gt;scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 2 fl Interpret:XN/ffffffff/ffffffff rc -107/-1&lt;/p&gt;

&lt;p&gt;154279:00000100:00000040:11.0:1341848952.937093:0:5006:0:(client.c:2201:__ptlrpc_req_finished()) @@@ refcount now 1  req@ffff880bcb72f800 x1406627952493697/t0(0) o-1-&amp;gt;scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 2 fl Complete:XN/ffffffff/ffffffff rc -107/-1&lt;/p&gt;

&lt;p&gt;154285:00000100:00000040:11.0:1341848952.937101:0:5006:0:(client.c:2201:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff880bcb72f800 x1406627952493697/t0(0) o-1-&amp;gt;scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Complete:XN/ffffffff/ffffffff rc -107/-1&lt;/p&gt;

&lt;p&gt;154290:00000100:00000010:11.0:1341848952.937107:0:5006:0:(client.c:2166:__ptlrpc_free_req()) kfreed &apos;request&apos;: 968 at ffff880bcb72f800.&lt;/p&gt;

&lt;p&gt;...&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="186376" author="adilger" created="Mon, 27 Feb 2017 23:35:15 +0000"  >&lt;p&gt;Close old bug.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11682" name="debuglog-LU-782.tar.xz" size="7534048" author="spiechurski" created="Mon, 9 Jul 2012 12:27:34 +0000"/>
                            <attachment id="11666" name="issue-20120702.tar.gz" size="9424050" author="spiechurski" created="Mon, 2 Jul 2012 08:09:18 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv51z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4352</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>