<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:16:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1396] Test failure on test suite mds-survey, subtest test_2</title>
                <link>https://jira.whamcloud.com/browse/LU-1396</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for sarah &amp;lt;sarah@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/9a890324-9abe-11e1-96af-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/9a890324-9abe-11e1-96af-52540035b04c&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The sub-test test_2 failed with the following error:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;mds-survey failed&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Info required for matching: mds-survey 2&lt;/p&gt;</description>
                <environment></environment>
        <key id="14391">LU-1396</key>
            <summary>Test failure on test suite mds-survey, subtest test_2</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="di.wang">Di Wang</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Thu, 10 May 2012 14:03:24 +0000</created>
                <updated>Mon, 30 Jul 2012 12:28:48 +0000</updated>
                            <resolved>Mon, 30 Jul 2012 12:28:48 +0000</resolved>
                                    <version>Lustre 2.3.0</version>
                                    <fixVersion>Lustre 2.3.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="38568" author="pjones" created="Thu, 10 May 2012 16:52:16 +0000"  >&lt;p&gt;Wangdi&lt;/p&gt;

&lt;p&gt;Could you please comment on this test failure?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="38570" author="di.wang" created="Thu, 10 May 2012 16:57:05 +0000"  >&lt;p&gt;Sure, I will investigate these mds-survey problems.&lt;/p&gt;</comment>
                            <comment id="39025" author="di.wang" created="Thu, 17 May 2012 17:31:07 +0000"  >&lt;p&gt;The problem for this bug is that during mds-survey destory,  MDT suddenly put too much destroy RPC on wire, so it can not send out requests in time, &lt;/p&gt;

&lt;p&gt;Lustre: 31465:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1336857426/real 0&amp;#93;&lt;/span&gt;  req@ffff88002d402400 x1401789656262897/t0(0) o6-&amp;gt;lustre-OST0002-osc-MDT0000@10.10.4.111@tcp:6/4 lens 512/400 e 0 to 1 dl 1336857463 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;br/&gt;
Lustre: lustre-OST0002-osc-MDT0000: Connection to lustre-OST0002 (at 10.10.4.111@tcp) was lost; in progress operations using this service will wait for recovery to complete&lt;br/&gt;
Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.4.111@tcp) was lost; in progress operations using this service will wait for recovery to complete&lt;br/&gt;
Lustre: 31466:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1336857435/real 0&amp;#93;&lt;/span&gt;  req@ffff880002103400 x1401789656280716/t0(0) o6-&amp;gt;lustre-OST0004-osc-MDT0000@10.10.4.111@tcp:6/4 lens 512/400 e 0 to 1 dl 1336857472 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;br/&gt;
Lustre: 31466:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 659 previous similar messages&lt;br/&gt;
Lustre: 31465:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1336857434/real 0&amp;#93;&lt;/span&gt;  req@ffff880008753c00 x1401789656279001/t0(0) o6-&amp;gt;lustre-OST0002-osc-MDT0000@10.10.4.111@tcp:6/4 lens 512/400 e 0 to 1 dl 1336857471 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;br/&gt;
Lustre: 31465:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 5983 previous similar messages&lt;/p&gt;



&lt;p&gt;Then MDS reconnect and sync, but failed, because OST is still busy with destory..  So it deactive the import, so we got the failure.&lt;/p&gt;

&lt;p&gt;MDS&lt;/p&gt;

&lt;p&gt;LustreError: 8319:0:(lov_log.c:158:lov_llog_origin_connect()) error osc_llog_connect tgt 1 (-11)&lt;br/&gt;
LustreError: 8320:0:(mds_lov.c:872:__mds_lov_synchronize()) lustre-OST0002_UUID failed at llog_origin_connect: -11&lt;/p&gt;



&lt;p&gt;14:20:12:LustreError: 8318:0:(osc_create.c:612:osc_create()) lustre-OST0000-osc-MDT0000: oscc recovery failed: -11&lt;br/&gt;
14:20:12:LustreError: 8318:0:(lov_obd.c:1066:lov_clear_orphans()) error in orphan recovery on OST idx 0/7: rc = -11&lt;br/&gt;
14:20:12:LustreError: 8318:0:(mds_lov.c:882:__mds_lov_synchronize()) lustre-OST0000_UUID failed at mds_lov_clear_orphans: -11&lt;/p&gt;


&lt;p&gt;OST&lt;/p&gt;

&lt;p&gt;14:18:39:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.4.110@tcp) reconnecting&lt;br/&gt;
14:18:40:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.4.110@tcp) refused reconnection, still busy with 59 active RPCs&lt;br/&gt;
14:18:40:Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.10.4.110@tcp) reconnecting&lt;/p&gt;


&lt;p&gt;So the root cause of this problem is that we put too much destroy RPC on wire between MDT and OST during mds-survey, but does not control it as we do for client-&amp;gt;OST. We probably should do that for mds-survey. But after lod/osp(2.4) is landed, MDT will destroy the objects through llog, instead of adding RPC immediately for each destroy. Or even do multiple destroy in one RPC. the problem will go away at that time.  Maybe we can live with it until then? Andreas, Oleg, what is your idea? &lt;/p&gt;


&lt;p&gt;Thanks&lt;/p&gt;








</comment>
                            <comment id="39237" author="adilger" created="Tue, 22 May 2012 17:10:31 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
there should be some form of RPC rate limiting for the MDS-&amp;gt;OSS communications.  A similar problem was hit in the past, and the MDS OSCs were put under the max_rpcs_in_flight (8) limit at that time.  Is there some reason why this code is bypassing that limit?  I can understand that 8 RPCs in flight is too few, but the solution is to increase the MDS OSC max_rpcs_in_flight limit (e.g. =50) instead of bypassing the limit entirely.&lt;/p&gt;</comment>
                            <comment id="39245" author="di.wang" created="Tue, 22 May 2012 17:23:02 +0000"  >&lt;p&gt;Andreas, I do not understand why destroy RPCs between MDS and OST bypassing that limit control neither. &lt;/p&gt;

&lt;p&gt;osc_destory() &lt;/p&gt;

&lt;p&gt;{&lt;br/&gt;
        .....................&lt;/p&gt;

&lt;p&gt;       /* don&apos;t throttle destroy RPCs for the MDT */&lt;br/&gt;
        if (!(cli-&amp;gt;cl_import-&amp;gt;imp_connect_flags_orig &amp;amp; OBD_CONNECT_MDS)) {&lt;br/&gt;
                req-&amp;gt;rq_interpret_reply = osc_destroy_interpret;&lt;br/&gt;
                if (!osc_can_send_destroy(cli)) &lt;/p&gt;
{
                        struct l_wait_info lwi = LWI_INTR(LWI_ON_SIGNAL_NOOP,
                                                          NULL);

                        /*
                         * Wait until the number of on-going destroy RPCs drops
                         * under max_rpc_in_flight
                         */
                        l_wait_event_exclusive(cli-&amp;gt;cl_destroy_waitq,
                                               osc_can_send_destroy(cli), &amp;amp;lwi);
                }
&lt;p&gt;        }&lt;/p&gt;

&lt;p&gt;        ....................&lt;br/&gt;
}&lt;/p&gt;

&lt;p&gt;Hmm, according to the blame log, it seems because if we throttle destroy RPC here, abort recovery will take very long time (bug 18049). But improve max_rpcs_in_flight to 50 might work here. I will cook a patch. Thanks.&lt;/p&gt;

</comment>
                            <comment id="39350" author="di.wang" created="Thu, 24 May 2012 17:29:49 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,2899&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,2899&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="39454" author="bobijam" created="Mon, 28 May 2012 07:54:01 +0000"  >&lt;p&gt;according to &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=16006#c93&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=16006#c93&lt;/a&gt; description&lt;/p&gt;

&lt;p&gt;Client evicting could make prlrpcd thread call path go through osc_destroy, and if RPC throttle makes ptlrpcd waiting for old RPCs completion, and while the same ptlrpcd is responsible for interpret reply, the deadlock situation ensues. &lt;/p&gt;

&lt;p&gt;Thus just add OST destroy RPCs throttle is not enough, the above scenario need take care. That&apos;s the reason the code has &quot;if (!(cli-&amp;gt;cl_import-&amp;gt;imp_connect_flags_orig &amp;amp; OBD_CONNECT_MDS))&quot; there.&lt;/p&gt;</comment>
                            <comment id="39780" author="di.wang" created="Thu, 31 May 2012 19:34:08 +0000"  >&lt;p&gt;hmm, you mean the same ptlrpcd is responsible for decreasing the cl_destroy_in_flight? Why those threads which already sent out &quot;destroy RPC&quot; will descrease it?  they are blocked too in this situation?&lt;/p&gt;</comment>
                            <comment id="39790" author="di.wang" created="Fri, 1 Jun 2012 01:56:28 +0000"  >&lt;p&gt;Bobi just advised me with this trace from bug 16006&lt;/p&gt;

&lt;p&gt;Call Trace:&amp;lt;ffffffffa0539d2d&amp;gt;{:osc:osc_destroy+1693} &amp;lt;ffffffff80133824&amp;gt;&lt;/p&gt;
{default_wake_function+0} &lt;br/&gt;
       &amp;lt;ffffffffa05c36f4&amp;gt;{:lov:lov_prep_destroy_set+1780} &lt;br/&gt;
       &amp;lt;ffffffffa05aa13e&amp;gt;{:lov:lov_destroy+1550} &amp;lt;ffffffffa05b7c4e&amp;gt;{:lov:lov_alloc_memmd+206} &lt;br/&gt;
       &amp;lt;ffffffffa05cc5dc&amp;gt;{:lov:lsm_unpackmd_plain+28} &amp;lt;ffffffffa05b8404&amp;gt;{:lov:lov_unpackmd+1124} &lt;br/&gt;
       &amp;lt;ffffffff8018ed9e&amp;gt;{dput+55} &amp;lt;ffffffffa06c6178&amp;gt;{:mds:mds_osc_destroy_orphan+3672} &lt;br/&gt;
       &amp;lt;ffffffff801eba5d&amp;gt;{_&lt;em&gt;up_read+16} &amp;lt;ffffffff8030e026&amp;gt;{&lt;/em&gt;_down_write+52} &lt;br/&gt;
       &amp;lt;ffffffffa06d5aca&amp;gt;{:mds:mds_destroy_export+2090} &amp;lt;ffffffffa04a3964&amp;gt;{:ptlrpc:ptlrpc_put_connection+484} &lt;br/&gt;
       &amp;lt;ffffffffa04163ce&amp;gt;{:obdclass:class_decref+62} &amp;lt;ffffffffa040479d&amp;gt;{:obdclass:class_export_destroy+381} &lt;br/&gt;
       &amp;lt;ffffffffa0408206&amp;gt;{:obdclass:obd_zombie_impexp_cull+150} &lt;br/&gt;
       &amp;lt;ffffffffa04cda35&amp;gt;{:ptlrpc:ptlrpcd_check+229} &amp;lt;ffffffffa04cdeea&amp;gt;{:ptlrpc:ptlrpcd+810} &lt;br/&gt;
       &amp;lt;ffffffff80133824&amp;gt;{default_wake_function+0}
&lt;p&gt; &amp;lt;ffffffffa049d8a0&amp;gt;{:ptlrpc:ptlrpc_expired_set+0} &lt;br/&gt;
       &amp;lt;ffffffffa049d8a0&amp;gt;{:ptlrpc:ptlrpc_expired_set+0} &amp;lt;ffffffff80110de3&amp;gt;&lt;/p&gt;
{child_rip+8}
&lt;p&gt; &lt;br/&gt;
       &amp;lt;ffffffffa04cdbc0&amp;gt;{:ptlrpc:ptlrpcd+0} &amp;lt;ffffffff80110ddb&amp;gt;&lt;/p&gt;
{child_rip+0}
&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;9:42:43 PM PDT&amp;#93;&lt;/span&gt; Bobi Jam: osc_destroy()-&amp;gt;osc_can_send_destroy()&lt;/p&gt;

&lt;p&gt;So osc destroy might be called in the middle of ptlrpcd, and we should not block ptlrpcd thread in this case. I will update my patch. Thanks&lt;/p&gt;</comment>
                            <comment id="42444" author="pjones" created="Mon, 30 Jul 2012 12:28:48 +0000"  >&lt;p&gt;Patch landed for 2.3. Please reopen if further work needed.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv6fr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4577</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>