<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:41:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11128] replay-single test timeout</title>
                <link>https://jira.whamcloud.com/browse/LU-11128</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for bobijam &amp;lt;bobijam@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;test_115 failed with the following error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;MDS dmesg keeps showing following error messages during several tests, and the test takes too much time.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 2545.541360] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.571570] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.210@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.618732] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.618926] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.619112] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;another hit also happens at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;test_80c &apos;Timeout occurred after 159 mins, last suite running was replay-single, restarting cluster to continue tests&apos; 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV&lt;br/&gt;
 replay-single test_115 - Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests&lt;br/&gt;
&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="52659">LU-11128</key>
            <summary>replay-single test timeout</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bzzz">Alex Zhuravlev</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Sun, 8 Jul 2018 13:33:12 +0000</created>
                <updated>Tue, 28 May 2019 17:29:53 +0000</updated>
                            <resolved>Tue, 2 Oct 2018 21:55:51 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                                    <fixVersion>Lustre 2.12.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="230369" author="adilger" created="Tue, 17 Jul 2018 18:52:55 +0000"  >&lt;p&gt;There are a significant number of test timeouts on replay-single (60 in the past 40 weeks), but they are spread across a large number of different subtests, so may be harder to notice.  Doing a subtest query for &lt;tt&gt;TIMEOUT&lt;/tt&gt; failures, but leaving the subtest field blank will show this.&lt;/p&gt;</comment>
                            <comment id="232035" author="bzzz" created="Thu, 16 Aug 2018 09:59:13 +0000"  >&lt;p&gt;I&apos;m working on this as it seem to affect many patches I&apos;ve been working on and more importantly it seem to be related to&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7236&quot; title=&quot;connections on demand&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7236&quot;&gt;&lt;del&gt;LU-7236&lt;/del&gt;&lt;/a&gt; (idle connections can disconnect).&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/32980/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32980/&lt;/a&gt;&#160;is a patch to reproduce/debug the issue. so far it looks like late DISCONNECT reply resets just reinitiated connection and it gets stuck.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="232307" author="bzzz" created="Mon, 20 Aug 2018 17:07:25 +0000"  >&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/32980/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32980&lt;/a&gt;&#160;has passed many replay-single runs, I think it&apos;s ready for inspection and regular testing.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="233442" author="adilger" created="Thu, 13 Sep 2018 05:40:40 +0000"  >&lt;p&gt;One question I had about this failure - how short/long is the idle connection timeout? It seems like we shouldn&apos;t be getting so many timeouts in the middle of actively running tests. Is there some correlation between the tests failing with this issue and the length of time the client is idle?&lt;/p&gt;

&lt;p&gt;When you think about it, we don&apos;t want the connections to be dropping after only a few seconds of idle time, or we may get big reconnection storms if the system is still mostly in use, which will also hurt performance because of dropped grant and such. &lt;/p&gt;</comment>
                            <comment id="233443" author="bzzz" created="Thu, 13 Sep 2018 05:43:14 +0000"  >&lt;p&gt;20s by default&lt;/p&gt;</comment>
                            <comment id="233484" author="adilger" created="Thu, 13 Sep 2018 21:22:58 +0000"  >&lt;p&gt;I wonder if 20s is too short by default?  Especially in the case of large systems where there may be thousands of clients that have nearly identical behaviour (e.g. active/idle at the same time, though possibly to different OSTs).&lt;/p&gt;

&lt;p&gt;On the one hand, the 20s timeout definitely good for finding issues with this code during testing, but I think the default should be longer (e.g. 60s or 300s) depending on how long it takes for a large number of clients to reconnect.  We could still set a shorter time in the test-framework to ensure the code continues to be tested.  For testing purposes, it might also make sense to have an option (e.g. &quot;&lt;tt&gt;lctl set_param osc.&amp;#42;.idle_timeout=debug&lt;/tt&gt;&quot; and &quot;&lt;tt&gt;...=nodebug&lt;/tt&gt;&quot; or similar) to print a message to the console when the client disconnects (e.g. &quot;&lt;tt&gt;testfs-OST0004: disconnect after 50s idle&lt;/tt&gt;&quot; and &quot;&lt;tt&gt;testfs-OST0004: reconnect after 650s idle&lt;/tt&gt;&quot; or similar) so that we can help debug problems related to this feature.  The console message should be enabled during testing.&lt;/p&gt;

&lt;p&gt;I see in the code that &lt;tt&gt;idle_timeout&lt;/tt&gt; has a maximum value of &lt;tt&gt;CONNECT_SWITCH_MAX = 50s&lt;/tt&gt;, which seems a bit short to me?  Is that because the &lt;tt&gt;OBD_PING&lt;/tt&gt; RPCs will keep the connection alive if it is longer than this?  What happens if &lt;tt&gt;ping_interval&lt;/tt&gt; (default 25s) is shorter than &lt;tt&gt;idle_timeout&lt;/tt&gt;?  Is that why the default idle_timeout is 20s?&lt;/p&gt;</comment>
                            <comment id="233510" author="gerrit" created="Fri, 14 Sep 2018 08:52:39 +0000"  >&lt;p&gt;Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/33168&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33168&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11128&quot; title=&quot;replay-single test timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11128&quot;&gt;&lt;del&gt;LU-11128&lt;/del&gt;&lt;/a&gt; ptlrpc: add debugging for idle connections&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 216c9c7cbcd38fa56ee2240c8a13066ad66b3f77&lt;/p&gt;</comment>
                            <comment id="233511" author="bzzz" created="Fri, 14 Sep 2018 09:28:29 +0000"  >&lt;p&gt;Andreas, I&apos;m fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently.&lt;/p&gt;

&lt;p&gt;ping reply is not counted:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lustre_msg_get_opc(req-&amp;gt;rq_reqmsg) != OBD_PING)
 req-&amp;gt;rq_import-&amp;gt;imp_last_reply_time = ktime_get_real_seconds();&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;then check for idle:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (now - imp-&amp;gt;imp_last_reply_time &amp;lt; imp-&amp;gt;imp_idle_timeout)
    &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;false&lt;/span&gt;;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="233569" author="adilger" created="Sat, 15 Sep 2018 05:15:04 +0000"  >&lt;p&gt;In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.&lt;/p&gt;</comment>
                            <comment id="233854" author="gerrit" created="Fri, 21 Sep 2018 03:31:11 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/33168/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33168/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11128&quot; title=&quot;replay-single test timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11128&quot;&gt;&lt;del&gt;LU-11128&lt;/del&gt;&lt;/a&gt; ptlrpc: add debugging for idle connections&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad&lt;/p&gt;</comment>
                            <comment id="234245" author="gerrit" created="Tue, 2 Oct 2018 21:23:04 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/32980/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32980/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11128&quot; title=&quot;replay-single test timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11128&quot;&gt;&lt;del&gt;LU-11128&lt;/del&gt;&lt;/a&gt; ptlrpc: new request vs disconnect race&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 93d20d171c20491a96e5e85d7442a002f300619d&lt;/p&gt;</comment>
                            <comment id="234247" author="pjones" created="Tue, 2 Oct 2018 21:55:51 +0000"  >&lt;p&gt;Landed for 2.12&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="52653">LU-11126</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="52839">LU-11183</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="32398">LU-7236</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="53273">LU-11362</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="53086">LU-11269</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="53363">LU-11405</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzyvr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>