<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:15:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1342] Test failure on sanity-quota test_29</title>
                <link>https://jira.whamcloud.com/browse/LU-1342</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This looks like a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-492&quot; title=&quot;Test failure on sanity-quota test_29&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-492&quot;&gt;&lt;del&gt;LU-492&lt;/del&gt;&lt;/a&gt;, but my software contains the fix of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-492&quot; title=&quot;Test failure on sanity-quota test_29&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-492&quot;&gt;&lt;del&gt;LU-492&lt;/del&gt;&lt;/a&gt;. The patch of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-492&quot; title=&quot;Test failure on sanity-quota test_29&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-492&quot;&gt;&lt;del&gt;LU-492&lt;/del&gt;&lt;/a&gt; did not help in my testing.&lt;/p&gt;

&lt;p&gt;The git source of our code is at &lt;a href=&quot;https://github.com/jlan/lustre-nas/tree/nas-2.1.1&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/jlan/lustre-nas/tree/nas-2.1.1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The command I issued was:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;ONLY=29 cfg/nas.v3.sh SANITY_QUOTA&lt;br/&gt;
The script files nas.v3.sh and ncli_nas.v3.sh are attached.&lt;br/&gt;
The test log tarball sanity-quota-1335289931.tar.bz2 is also attached.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The failure is reproducible.&lt;/p&gt;

&lt;p&gt;test_29()&lt;br/&gt;
{&lt;br/&gt;
...&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;actually send a RPC to make service at_current confined within at_max&lt;br/&gt;
$LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR || error &quot;should succeed&quot;&lt;br/&gt;
&amp;lt;=== succeeded&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;#define OBD_FAIL_MDS_QUOTACTL_NET 0x12e&lt;br/&gt;
lustre_fail mds 0x12e&lt;br/&gt;
&amp;lt;==== fine&lt;/p&gt;

&lt;p&gt;$LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR &amp;amp; pid=$!&lt;br/&gt;
&amp;lt;==== &quot;setquota failed: Transport endpoint is not connected&quot;&lt;/p&gt;

&lt;p&gt;echo &quot;sleeping for 10 * 1.25 + 5 + 10 seconds&quot;&lt;br/&gt;
sleep 28&lt;br/&gt;
ps -p $pid &amp;amp;&amp;amp; error &quot;lfs hadn&apos;t finished by timeout&quot;&lt;br/&gt;
&amp;lt;==== the process still alive. Die later due to timeout.&lt;br/&gt;
...&lt;/p&gt;

&lt;p&gt;Is &quot;setquota failed: Transport endpoint is not connected&quot; error expected?&lt;br/&gt;
I saw that in the test log.&lt;br/&gt;
Was that the result of &quot;lustre_fail mds 0x12e&quot;, or did that mean the mds did not see the lustre_fail request? Remote commands were sent via pdsh.&lt;/p&gt;

&lt;p&gt;If I tried a &quot;sleep 40&quot; (instead of &quot;sleep 28&quot; after that, the lfs&lt;br/&gt;
command timed out before the check and the test passed. It seems&lt;br/&gt;
the sleep formula &quot;10 * 1.25 + 5 + 10 seconds&quot; is not long enough?&lt;/p&gt;
</description>
                <environment>Server: rhel6.2 with lustre-2.1.1&lt;br/&gt;
Client: rhel6.2 with lustre-client-2.1.1&lt;br/&gt;
&lt;br/&gt;
MDS/MGS: service360&lt;br/&gt;
OSS1:        service361&lt;br/&gt;
OSS2:        service362&lt;br/&gt;
Client1:       service333&lt;br/&gt;
Client2:       service334</environment>
        <key id="14173">LU-1342</key>
            <summary>Test failure on sanity-quota test_29</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="jaylan">Jay Lan</reporter>
                        <labels>
                    </labels>
                <created>Tue, 24 Apr 2012 14:15:17 +0000</created>
                <updated>Wed, 17 Apr 2013 12:50:44 +0000</updated>
                            <resolved>Sat, 22 Dec 2012 10:33:08 +0000</resolved>
                                    <version>Lustre 2.1.1</version>
                                    <fixVersion>Lustre 2.1.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="35440" author="pjones" created="Wed, 25 Apr 2012 08:05:27 +0000"  >&lt;p&gt;Bobi&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="35474" author="bobijam" created="Wed, 25 Apr 2012 22:58:41 +0000"  >&lt;p&gt;Guess the test needs take the net latency into the wait time value.&lt;/p&gt;

&lt;p&gt;Patch tracking at &lt;a href=&quot;http://review.whamcloud.com/2601&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2601&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="35507" author="bogl" created="Thu, 26 Apr 2012 12:57:21 +0000"  >&lt;p&gt;&quot;setquota failed: Transport endpoint is not connected&quot; is the expected error.&lt;br/&gt;
This is the error reported when the client rpc fails within the sleep time.&lt;br/&gt;
The write to fail_loc on your server is happening correctly.  It causes the MDS_QUOTACTL rpc handled on the server to abort &amp;amp; not ever return a reply.  This is shown by the following from the server log:&lt;/p&gt;

&lt;p&gt;00000100:00100000:7.0:1335289642.393283:0:14946:0:(service.c:1536:ptlrpc_server_handle_req_in()) got req x1398863346727737&lt;br/&gt;
00000100:00100000:7.0:1335289642.393297:0:14946:0:(service.c:1713:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+6:31674:x1398863346727737:12345-10.151.25.184@o2ib:48&lt;br/&gt;
00000004:00020000:7.0:1335289642.393301:0:14946:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=12e ***&lt;br/&gt;
00000100:00100000:7.0:1335289642.415971:0:14946:0:(service.c:1760:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+5:31674:x1398863346727737:12345-10.151.25.184@o2ib:48 Request procesed in 22680us (22700us total) trans 0 rc 0/-999&lt;/p&gt;

&lt;p&gt;In the client log I see the following related to the failing rpc (xid = 1398863346727737, opcode = 48 = MDS_QUOTACTL)&lt;/p&gt;

&lt;p&gt;00000100:00100000:5.0:1335289903.436104:0:31674:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:d1093c79-65be-f3fa-3770-e19614ebeee7:31674:1398863346727737:10.151.26.38@o2ib:48&lt;br/&gt;
00000100:00100000:5.0:1335289903.436135:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 23 seconds&lt;br/&gt;
00000100:00100000:5.0:1335289926.435314:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds&lt;br/&gt;
00000100:00100000:5.0:1335289927.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds&lt;br/&gt;
00000100:00100000:5.0:1335289928.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds&lt;br/&gt;
00000100:00100000:5.0:1335289929.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds&lt;br/&gt;
00000100:00100000:5.0:1335289930.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds&lt;br/&gt;
00000100:00100000:5.0:1335289931.435300:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds&lt;/p&gt;


&lt;p&gt;The initial sleep of 23 sec shown seems excessively high for the timeout of 10 set by the test script.  The formula mentioned in the script comment of makes it seem like the number should be more like 17 or 18 ( 10 * 1.25 + 5 ).   I&apos;m wondering if there are some other settings in your environment forcing the rpc timeouts to be higher than normal, for example ldlm_timeout or timeouts related to your interconnect (IB).&lt;/p&gt;

&lt;p&gt;In attempting to reproduce this failure locally with tcp interconnect I find my lfs process timing out and returning in way under 10 secs every time.  It never comes close to reaching the 28 sec sleep in the test script.&lt;/p&gt;</comment>
                            <comment id="35537" author="jaylan" created="Thu, 26 Apr 2012 20:17:35 +0000"  >&lt;p&gt;I found the problem.&lt;/p&gt;

&lt;p&gt;We set at_min=15 in our systems in addition to at_max.&lt;br/&gt;
The test_29 reset at_max to 10, but left at_min to a higher value, thus &lt;br/&gt;
forced timeout value to be 15 instead of 10.&lt;/p&gt;

&lt;p&gt;The test should save both at_max and at_min before the test, and restore&lt;br/&gt;
them back after the test.&lt;/p&gt;</comment>
                            <comment id="35541" author="bobijam" created="Thu, 26 Apr 2012 22:06:38 +0000"  >&lt;p&gt;The sleep time should be for the worst case, and I think I can improve the test script by checking hte lfs process before the deadline, and that would be better.&lt;/p&gt;</comment>
                            <comment id="35571" author="jaylan" created="Fri, 27 Apr 2012 14:53:34 +0000"  >&lt;p&gt;How do I add comment to &lt;a href=&quot;http://review.whamcloud.com/2601&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2601&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;The Patch Set 2 did work in my environment. However, remember that my problem&lt;br/&gt;
was because test_29 lowered at_max on client to a value smaller than at_min. Our site&lt;br/&gt;
set at_min to 15, so deadline=56 would work, but theoretically it can fail at a &lt;br/&gt;
site with a higher at_min value.&lt;/p&gt;


&lt;p&gt;BTW, I understand the first 10 of the formula &quot;2 * (10 * 1.25 + 5 + 10)&quot; is the&lt;br/&gt;
timeout value test_29 wants to set. I guess the second 10 is not the same timeout&lt;br/&gt;
value; otherwise, you would simply do 10 * 2.25 &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; So, what is the second 10?&lt;/p&gt;</comment>
                            <comment id="35581" author="pjones" created="Fri, 27 Apr 2012 15:05:32 +0000"  >&lt;p&gt;&amp;gt; How do I add comment to &lt;a href=&quot;http://review.whamcloud.com/2601&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2601&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;I would guess that the missing step would be to login to gerrit...&lt;/p&gt;</comment>
                            <comment id="44164" author="jaylan" created="Tue, 4 Sep 2012 16:23:47 +0000"  >&lt;p&gt;The patch set #7 was landed to master on July 12.&lt;br/&gt;
Can we landed this patch to b2_1 branch and close the ticket?&lt;/p&gt;</comment>
                            <comment id="44182" author="bobijam" created="Tue, 4 Sep 2012 21:17:11 +0000"  >&lt;p&gt;b2_1 patch port tracking at &lt;a href=&quot;http://review.whamcloud.com/3870&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3870&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="49603" author="pjones" created="Sat, 22 Dec 2012 10:33:08 +0000"  >&lt;p&gt;Landed for 2.1.4 and 2.4&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11229" name="nas.v3.sh" size="2662" author="jaylan" created="Tue, 24 Apr 2012 14:15:17 +0000"/>
                            <attachment id="11230" name="ncli_nas.v3.sh" size="2401" author="jaylan" created="Tue, 24 Apr 2012 14:15:17 +0000"/>
                            <attachment id="11231" name="sanity-quota-1335289931.tar.bz2" size="4828965" author="jaylan" created="Tue, 24 Apr 2012 14:15:17 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv693:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4547</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>