<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:57:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6084] Tests are failed due to &apos;recovery is aborted by hard timeout&apos;</title>
                <link>https://jira.whamcloud.com/browse/LU-6084</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Many recovery tests start to fail because unexpected recovery abort due to hard timeout.&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/722432b2-80fa-11e4-9c9a-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/722432b2-80fa-11e4-9c9a-5254006e85c2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The sub-test test_4k failed with the following error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;onyx-35vm1.onyx.hpdd.intel.com evicted
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;MDS dmesg&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0000_UUID (at 10.2.4.141@tcp), waiting for all 6 known clients (0 recovered, 5 in progress, and 0 evicted) to recover in 0:25
Lustre: Skipped 90 previous similar messages
INFO: task tgt_recov:2119 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
tgt_recov     D 0000000000000000     0  2119      2 0x00000080
 ffff88006fb2fda0 0000000000000046 0000000000000000 ffff880002316880
 ffff88006fb2fd10 ffffffff81030b59 ffff88006fb2fd20 ffffffff810554f8
 ffff88006faad058 ffff88006fb2ffd8 000000000000fbc8 ffff88006faad058
Call Trace:
 [&amp;lt;ffffffff81030b59&amp;gt;] ? native_smp_send_reschedule+0x49/0x60
 [&amp;lt;ffffffff810554f8&amp;gt;] ? resched_task+0x68/0x80
 [&amp;lt;ffffffff8109b2ce&amp;gt;] ? prepare_to_wait+0x4e/0x80
 [&amp;lt;ffffffffa080d9c0&amp;gt;] ? check_for_clients+0x0/0x70 [ptlrpc]
 [&amp;lt;ffffffffa080ef2d&amp;gt;] target_recovery_overseer+0xad/0x2d0 [ptlrpc]
 [&amp;lt;ffffffffa080d610&amp;gt;] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
 [&amp;lt;ffffffff8109afa0&amp;gt;] ? autoremove_wake_function+0x0/0x40
 [&amp;lt;ffffffffa0815850&amp;gt;] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
 [&amp;lt;ffffffffa0815f34&amp;gt;] target_recovery_thread+0x6e4/0x1a20 [ptlrpc]
 [&amp;lt;ffffffff81061d12&amp;gt;] ? default_wake_function+0x12/0x20
 [&amp;lt;ffffffffa0815850&amp;gt;] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
 [&amp;lt;ffffffff8109abf6&amp;gt;] kthread+0x96/0xa0
 [&amp;lt;ffffffff8100c20a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff8109ab60&amp;gt;] ? kthread+0x0/0xa0
 [&amp;lt;ffffffff8100c200&amp;gt;] ? child_rip+0x0/0x20
Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Lustre: lustre-MDT0000: disconnecting 1 stale clients
Lustre: 2119:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 2119:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: 2119:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880079bf6980 x1487142925659804/t0(38654705688) o36-&amp;gt;c0baea22-119d-b8af-1550-c0592a66b0c4@10.2.4.138@tcp:277/0 lens 520/0 e 0 to 0 dl 1418252677 ref 1 fl Complete:/4/ffffffff rc 0/-1
Lustre: lustre-MDT0000: Recovery over after 3:00, of 6 clients 0 recovered and 6 were evicted.
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_4k: @@@@@@ FAIL: onyx-35vm1.onyx.hpdd.intel.com evicted 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>lustre-master build #2770</environment>
        <key id="28058">LU-6084</key>
            <summary>Tests are failed due to &apos;recovery is aborted by hard timeout&apos;</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                            <label>HB</label>
                    </labels>
                <created>Tue, 6 Jan 2015 18:14:28 +0000</created>
                <updated>Wed, 11 Mar 2015 12:30:49 +0000</updated>
                            <resolved>Mon, 9 Feb 2015 03:02:42 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                    <fixVersion>Lustre 2.7.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="102659" author="green" created="Tue, 6 Jan 2015 18:28:30 +0000"  >&lt;p&gt;So the backtrace has nothing to do with this bug. It&apos;s just a known harmless message due to recovery taking kind of long.&lt;/p&gt;</comment>
                            <comment id="103349" author="jlevi" created="Tue, 13 Jan 2015 18:08:32 +0000"  >&lt;p&gt;Mike,&lt;br/&gt;
Could you please comment on this one?&lt;br/&gt;
Thank you!&lt;/p&gt;</comment>
                            <comment id="103503" author="tappro" created="Wed, 14 Jan 2015 18:39:26 +0000"  >&lt;p&gt;15:08:55:Lustre: 2135:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout&lt;br/&gt;
15:08:55:Lustre: 2135:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery&lt;/p&gt;

&lt;p&gt;That mean test doesn&apos;t work as expected. Test (and many others in replay-var) removes one client during recovery, simulating missed client. It expects recovery to timed out and continue with VBR and existing clients, checking the result - was there eviction or not. But we see here that recovery was just aborted by hard timeout, so all clients are evicted no matter could they proceed with recovery or not. &lt;/p&gt;

&lt;p&gt;Therefore replay-vbr tests are broken in that sense, we must set hard timeout to bigger value to let recovery happen as expected. Also I suspect that we have the same situation as with replay-dual tests and patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5079&quot; title=&quot;conf-sanity test_47 timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5079&quot;&gt;&lt;del&gt;LU-5079&lt;/del&gt;&lt;/a&gt; which may cause recovery timeout extension if recovery happens one after another several times - like we have in replay-vbr.sh &lt;/p&gt;

&lt;p&gt;I know Yu Jian did related changes to replay-dual, same should be done for replay-vbr along with disabling or changing hard_timeout to bigger value&lt;/p&gt;</comment>
                            <comment id="103824" author="gerrit" created="Sun, 18 Jan 2015 13:15:41 +0000"  >&lt;p&gt;Mike Pershin (mike.pershin@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13447&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13447&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt; recovery: use soft timeout to limit recovery timer&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8447952e0056294e24fa1da98c6dbb0d9e149324&lt;/p&gt;</comment>
                            <comment id="103836" author="tappro" created="Mon, 19 Jan 2015 06:58:24 +0000"  >&lt;p&gt;The reason is commit df89c74a320278acac7466a83393af6abd99932b of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4119&quot; title=&quot;recovery time hard doesn&amp;#39;t limit recovery duration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4119&quot;&gt;&lt;del&gt;LU-4119&lt;/del&gt;&lt;/a&gt; ldlm: abort recovery by time_hard&lt;/p&gt;

&lt;p&gt;This commit itself is correct but the extend_recovery_timer() code originally had limit for recovery timeout as time_hard. Previously that wasn&apos;t problem, because recovery wasn&apos;t aborted by hard timeout and timer could be set several times during recovery even with hard timeout limit.&lt;br/&gt;
After that patch the whole recovery is limited by hard timeout and aborted. That means the single timer extended to hard timeout can consume whole recovery time just for waiting for missed clients after what recovery will be just aborted hardly. Meanwhile that timer is set at least 3 times during recovery while:&lt;br/&gt;
1. waiting for clients to connect&lt;br/&gt;
2. waiting for clients to replay requests&lt;br/&gt;
3. waiting for lock replays&lt;/p&gt;

&lt;p&gt;Patch changes extend_recovery_timer() logic limiting it with (hard timeout / 3), so timer can expire 3 times during single recovery without reaching hard limit and recovery may finish softly with VBR.&lt;/p&gt;

&lt;p&gt;There is one thing to think about, from these 3 recovery steps the first one is more important and expected to lasts long because of clients reconnections, other two cases with request replay and lock replay usually happen quite fast, timer is needed here just for case when some of already connected clients are died. Therefore we might use (hard_timeout / 2) limit considering that half of recovery time can be spent for reconnection waiting, and other half for recovery completion. That would reduce the evictions of clients missing connection stage.&lt;/p&gt;</comment>
                            <comment id="104086" author="simmonsja" created="Tue, 20 Jan 2015 21:12:10 +0000"  >&lt;p&gt;Will this be needed along with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4119&quot; title=&quot;recovery time hard doesn&amp;#39;t limit recovery duration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4119&quot;&gt;&lt;del&gt;LU-4119&lt;/del&gt;&lt;/a&gt; for b2_5?&lt;/p&gt;</comment>
                            <comment id="104173" author="tappro" created="Wed, 21 Jan 2015 14:28:50 +0000"  >&lt;p&gt;James, yes, I think so.&lt;/p&gt;</comment>
                            <comment id="104616" author="gerrit" created="Sat, 24 Jan 2015 13:35:07 +0000"  >&lt;p&gt;Mike Pershin (mike.pershin@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13520&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13520&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt; ptlrpc: don&apos;t remember server stat from early reply&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d69461cb7aca7400f19bae8a244f473aaa3d3760&lt;/p&gt;</comment>
                            <comment id="104617" author="tappro" created="Sat, 24 Jan 2015 13:43:53 +0000"  >&lt;p&gt;The second patch is to cover another issue which cause request timeout grow and recovery time grow as result. Usually server doesn&apos;t return request processing time to the client  for recovery requests to don&apos;t pollute AT history with wrong data. But except normal reply there can be early reply and it passes server processing time to the client as well causing the same issue.&lt;/p&gt;</comment>
                            <comment id="104843" author="tappro" created="Tue, 27 Jan 2015 15:20:15 +0000"  >&lt;p&gt;The first patch is abandoned, in fact the check for recovery_time_hard was correct, the problem is not there. The second patch is updated and should resolve issue. It prevent AT stats from being updated with data from recovering requests on both server and client. &lt;/p&gt;</comment>
                            <comment id="106219" author="gerrit" created="Mon, 9 Feb 2015 02:40:43 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/13520/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13520/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt; ptlrpc: prevent request timeout grow due to recovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 84f813bf639a7d078e19a3cf41f7c06a3824caa9&lt;/p&gt;</comment>
                            <comment id="106220" author="pjones" created="Mon, 9 Feb 2015 03:02:42 +0000"  >&lt;p&gt;Landed for 2.7&lt;/p&gt;</comment>
                            <comment id="106227" author="gerrit" created="Mon, 9 Feb 2015 06:27:35 +0000"  >&lt;p&gt;Jian Yu (jian.yu@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13685&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13685&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt; ptlrpc: prevent request timeout grow due to recovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_5&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2b1cb3b7dfa56fe1a67508b1113ee64453b5bab3&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="21496">LU-4119</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="27988">LU-6057</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="28684">LU-6238</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzx39j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16931</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>