<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:14:55 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1246] SANITY_QUOTA test_32 failed in cleanup_and_setup_lustre with LOAD_MODULES_REMOTE=true</title>
                <link>https://jira.whamcloud.com/browse/LU-1246</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;SANITY_QUOTA test_32 always failed. Test was started from service331 (a lustre client actually):&lt;/p&gt;

&lt;p&gt;...&lt;br/&gt;
Formatting mgs, mds, osts&lt;br/&gt;
...&lt;br/&gt;
Setup mgs, mdt, osts&lt;br/&gt;
start mds /dev/sdb1 -o errors=panic,acl&lt;br/&gt;
Starting mds: -o errors=panic,acl  /dev/sdb1 /mnt/mds&lt;br/&gt;
service360: Reading test skip list from /usr/lib64/lustre/tests/cfg/tests-to-skip.sh&lt;br/&gt;
service360: #!/bin/bash&lt;br/&gt;
service360: #SANITY_BIGFILE_EXCEPT=&quot;64b&quot;&lt;br/&gt;
service360: #export SANITY_EXCEPT=&quot;$SANITY_BIGFILE_EXCEPT&quot;&lt;br/&gt;
service360: MDSSIZE=2000000, OSTSIZE=2000000.&lt;br/&gt;
service360: ncli_nas.sh: Before init_clients_lists&lt;br/&gt;
service360: ncli_nas.sh: Done init_clients_lists&lt;br/&gt;
service360: lnet.debug=0x33f1504&lt;br/&gt;
service360: lnet.subsystem_debug=0xffb7e3ff&lt;br/&gt;
service360: lnet.debug_mb=16&lt;br/&gt;
Started lustre-MDT0000&lt;br/&gt;
start ost1 /dev/sdb1 -o errors=panic,mballoc,extents&lt;br/&gt;
Starting ost1: -o errors=panic,mballoc,extents  /dev/sdb1 /mnt/ost1&lt;br/&gt;
service361: mount.lustre: mount /dev/sdb1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown&lt;br/&gt;
mount -t lustre  /dev/sdb1 /mnt/ost1&lt;br/&gt;
Start of /dev/sdb1 on ost1 failed 108&lt;br/&gt;
...&lt;/p&gt;


&lt;p&gt;It seems this test is the only one that set LOAD_MODULES_REMOTE=true before calling cleanup_and_setup_lustre and failed. Sometimes only OST1 had error 108 problem, sometimes both OST1 and OST2 were hit with this problem. I put &quot;sleep 3&quot; in setupall()&lt;br/&gt;
after mds started but before trying to start ost, but it did not help.&lt;/p&gt;



&lt;p&gt;The &apos;demsg&apos; from MDS (service360) showed:&lt;br/&gt;
Lustre: DEBUG MARKER: == test 32: check lqs hash(bug 21846) ========================================== == 11:05:01&lt;br/&gt;
Lustre: MDT lustre-MDT0000 has stopped.&lt;br/&gt;
LustreError: 28890:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway&lt;br/&gt;
LustreError: 28890:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108&lt;br/&gt;
Lustre: MGS has stopped.&lt;br/&gt;
Lustre: server umount lustre-MDT0000 complete&lt;br/&gt;
Lustre: Removed LNI 10.151.26.38@o2ib&lt;br/&gt;
Lustre: OBD class driver, &lt;a href=&quot;http://wiki.whamcloud.com/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://wiki.whamcloud.com/&lt;/a&gt;&lt;br/&gt;
Lustre:     Lustre Version: 1.8.6.81&lt;br/&gt;
Lustre:     Build Version: lustre/scripts-1.8.6&lt;br/&gt;
Lustre: Listener bound to ib1:10.151.26.38:987:mlx4_0&lt;br/&gt;
Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1&lt;br/&gt;
Lustre: Added LNI 10.151.26.38@o2ib &lt;span class=&quot;error&quot;&gt;&amp;#91;8/64/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Lustre: Filtering OBD driver; &lt;a href=&quot;http://wiki.whamcloud.com/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://wiki.whamcloud.com/&lt;/a&gt;&lt;br/&gt;
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode&lt;br/&gt;
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode&lt;br/&gt;
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode&lt;br/&gt;
Lustre: MGS MGS started&lt;br/&gt;
Lustre: MGC10.151.26.38@o2ib: Reactivating import&lt;br/&gt;
Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.&lt;br/&gt;
Lustre: Setting parameter lustre-mdtlov.lov.stripesize in log lustre-MDT0000&lt;br/&gt;
Lustre: Enabling user_xattr&lt;br/&gt;
Lustre: Enabling ACL&lt;br/&gt;
Lustre: lustre-MDT0000: new disk, initializing&lt;br/&gt;
Lustre: lustre-MDT0000: Now serving lustre-MDT0000 on /dev/sdb1 with recovery enabled&lt;br/&gt;
Lustre: 30206:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups&lt;br/&gt;
Lustre: MGS: Regenerating lustre-OSTffff log by user request.&lt;br/&gt;
Lustre: lustre-MDT0000: temporarily refusing client connection from 10.151.25.182@o2ib&lt;br/&gt;
LustreError: 30130:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (&lt;del&gt;11)  req@ffff8103f64cc000 x1397073513545734/t0 o38&lt;/del&gt;&amp;gt;&amp;lt;?&amp;gt;@&amp;lt;?&amp;gt;:0/0 lens 368/0 e 0 to 0 dl 1332353120 ref 1 fl Interpret:/0/0 rc -11/0&lt;br/&gt;
Lustre: 30308:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OST0000_UUID&lt;br/&gt;
Lustre: 29699:0:(quota_master.c:1718:mds_quota_recovery()) Only 0/1 OSTs are active, abort quota recovery&lt;br/&gt;
Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans&lt;br/&gt;
Lustre: DEBUG MARKER: Using TIMEOUT=20&lt;br/&gt;
Lustre: DEBUG MARKER: sanity-quota test_32: @@@@@@ FAIL: Rehearsh didn&apos;t happen&lt;br/&gt;
Lustre: DEBUG MARKER: == test 99: Quota off =============================== == 11:08:33&lt;/p&gt;


&lt;p&gt;The &apos;dmesg&apos; from OST1 (service361) showed:&lt;br/&gt;
Lustre: DEBUG MARKER: == test 32: check lqs hash(bug 21846) ========================================== == 11:05:01&lt;br/&gt;
Lustre: OST lustre-OST0000 has stopped.&lt;br/&gt;
LustreError: 5972:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway&lt;br/&gt;
LustreError: 5972:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108&lt;br/&gt;
Lustre: 5972:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675719 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 6s ago has timed out (6s prior to deadline).&lt;br/&gt;
  req@ffff810407edd800 x1395992728675719/t0 o251-&amp;gt;MGS@MGC10.151.26.38@o2ib_0:26/25 lens 192/384 e 0 to 1 dl 1332352932 ref 1 fl Rpc:N/0/0 rc 0/0&lt;br/&gt;
Lustre: server umount lustre-OST0000 complete&lt;br/&gt;
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode&lt;br/&gt;
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode&lt;br/&gt;
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode&lt;br/&gt;
Lustre: 20460:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675720 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 0s ago has failed due to network error (5s prior to deadline).&lt;br/&gt;
  req@ffff8103ca5edc00 x1395992728675720/t0 o250-&amp;gt;MGS@MGC10.151.26.38@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1332353093 ref 1 fl Rpc:N/0/0 rc 0/0&lt;br/&gt;
LustreError: 7058:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8103b2ca7800 x1395992728675721/t0 o253-&amp;gt;MGS@MGC10.151.26.38@o2ib_0:26/25 lens 4736/4928 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0&lt;br/&gt;
LustreError: 7058:0:(obd_mount.c:1112:server_start_targets()) Required registration failed for lustre-OSTffff: -108&lt;br/&gt;
LustreError: 7058:0:(obd_mount.c:1670:server_fill_super()) Unable to start targets: -108&lt;br/&gt;
LustreError: 7058:0:(obd_mount.c:1453:server_put_super()) no obd lustre-OSTffff&lt;br/&gt;
LustreError: 7058:0:(obd_mount.c:147:server_deregister_mount()) lustre-OSTffff not registered&lt;br/&gt;
Lustre: server umount lustre-OSTffff complete&lt;br/&gt;
LustreError: 7058:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount  (-108)&lt;br/&gt;
Lustre: DEBUG MARKER: Using TIMEOUT=20&lt;br/&gt;
Lustre: DEBUG MARKER: sanity-quota test_32: @@@@@@ FAIL: Rehearsh didn&apos;t happen&lt;br/&gt;
Lustre: DEBUG MARKER: == test 99: Quota off =============================== == 11:08:33&lt;/p&gt;

</description>
                <environment>One mgs/mds, two OSS, two clients, lustre-1.8.6.81.&lt;br/&gt;
Server is running rhel5.7, client is running sles11sp1.&lt;br/&gt;
ofed 1.5.4.1.</environment>
        <key id="13646">LU-1246</key>
            <summary>SANITY_QUOTA test_32 failed in cleanup_and_setup_lustre with LOAD_MODULES_REMOTE=true</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="jaylan">Jay Lan</reporter>
                        <labels>
                    </labels>
                <created>Wed, 21 Mar 2012 14:43:11 +0000</created>
                <updated>Wed, 30 May 2012 18:31:00 +0000</updated>
                            <resolved>Wed, 30 May 2012 18:31:00 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="31773" author="jaylan" created="Wed, 21 Mar 2012 14:52:09 +0000"  >&lt;p&gt;Each time after failure, a &apos;lctl ping&apos; between mds and ost&apos;s (both direction) worked. Manually executing mount command from ost also worked. Only failed during the test.&lt;/p&gt;</comment>
                            <comment id="31774" author="pjones" created="Wed, 21 Mar 2012 14:57:55 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Could you please comment?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="31856" author="niu" created="Thu, 22 Mar 2012 09:35:16 +0000"  >&lt;p&gt;Hi, Jay&lt;/p&gt;

&lt;p&gt;Could you try to comment out the &quot;LOAD_MODULES_REMOTE=true&quot; in the sanity-quota test_32() to see if the problem will be gone?&lt;/p&gt;

&lt;p&gt;In the load_modules() of test-framework.sh, there is a comment:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;    # bug 19124
    # load modules on remote nodes optionally
    # lustre-tests have to be installed on these nodes
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Could you make sure that the lustre-tests are installed correctly on remote nodes(MDS &amp;amp; OSS)? Thanks.&lt;/p&gt;</comment>
                            <comment id="31884" author="jaylan" created="Thu, 22 Mar 2012 13:23:23 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;1. I turned off LOAD_MODULES_REMOTE=yes a few days ago, and the problem went away.&lt;br/&gt;
   After I turned it back on, the problem came back. That was why I put that&lt;br/&gt;
   condition on the Summary line.&lt;br/&gt;
2. The lustre-tests are installed correctly at MDS and OSS. The SANITY_QUOTA&lt;br/&gt;
   test_32 was the only test that failed on me this time. All other tests passed.&lt;/p&gt;</comment>
                            <comment id="31885" author="jaylan" created="Thu, 22 Mar 2012 13:26:31 +0000"  >&lt;p&gt;BTW, by &quot;all other tests&quot; I meant the test suites that Maloo runs when a new patch&lt;br/&gt;
is proposed. SANITY_QUOTA is only one of the test suites.&lt;/p&gt;</comment>
                            <comment id="31964" author="niu" created="Fri, 23 Mar 2012 03:15:12 +0000"  >&lt;p&gt;Thanks, Jay. &lt;/p&gt;

&lt;p&gt;I don&apos;t know why the OST can&apos;t communicate with the MGS in your case. Is it possible to get a full debug log on MDS &amp;amp; OSS? (you can set the PTLDEBUG to -1 on MDS &amp;amp; OSS node, I think the test will dump debug log automatically when test failed, or you can dump the debug log to file by &apos;lctl dk&apos;)&lt;/p&gt;</comment>
                            <comment id="32132" author="jaylan" created="Mon, 26 Mar 2012 14:06:45 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;I will have to do that later. The MDS and OSS nodes have been re-imaged to rhel6.2 with lustre-2.1.1 server code. When I am done with 2.1.1 I will re-image them back to 1.8.6 and provide you information you need.&lt;/p&gt;

&lt;p&gt;BTW, does this following message (cited from the dmesg of OSS1 (from the &quot;Description&quot; of this bug report) imply the timeout was first occured within MDS node?&lt;/p&gt;

&lt;p&gt;Lustre: 5972:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675719 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 6s ago has timed out (6s prior to deadline).&lt;/p&gt;</comment>
                            <comment id="32157" author="niu" created="Mon, 26 Mar 2012 23:06:26 +0000"  >&lt;p&gt;Thanks, Jay.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;BTW, does this following message (cited from the dmesg of OSS1 (from the &quot;Description&quot; of this bug report) imply the timeout was first occured within MDS node?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I think it indicating the OST to MGS request timeout.&lt;/p&gt;</comment>
                            <comment id="39657" author="jaylan" created="Wed, 30 May 2012 17:59:30 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;We have upgraded our servers to 2.1.1 last week, and I have not seen this problem in testing with 2.1 servers.&lt;/p&gt;

&lt;p&gt;Thus, this problem is not important to us any more. You may close it.&lt;/p&gt;</comment>
                            <comment id="39659" author="pjones" created="Wed, 30 May 2012 18:31:00 +0000"  >&lt;p&gt;ok thanks Jay&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvf7b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6108</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>