<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:37:24 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3844] Double recovery period in 1.8.9 after OSS failure</title>
                <link>https://jira.whamcloud.com/browse/LU-3844</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We frequently observe that the recovery process repeats itself after reaching the timer expiration.  Often times it reaches the first timer expiration because a client has died, so it is going to go through the whole recovery period again the second time.&lt;/p&gt;

&lt;p&gt;In this example, recovery took 60s rather than the better case of 30.  Does this fall under the case implied by the wording &apos;Will be in recovery for at least 30:00&apos;.  During the failure several OSS had to be rebooted.&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;blakec@widow-mgmt3 ~&amp;#93;&lt;/span&gt;$ tail -10000 /data/log/apps/lustrekernel|grep oss12a2|grep -i recov|grep widow1-OST00b5:&lt;br/&gt;
Aug 27 14:16:46 widow-oss12a2 kernel: [  544.793348] Lustre: widow1-OST00b5: Now serving widow1-OST00b5 on /dev/mpath/widow-ddn12a-l48 with recovery enabled&lt;br/&gt;
Aug 27 14:16:46 widow-oss12a2 kernel: [  544.793363] Lustre: widow1-OST00b5: Will be in recovery for at least 30:00, or until 12345 clients reconnect&lt;br/&gt;
Aug 27 14:17:13 widow-oss12a2 kernel: [  572.156957] LustreError: 16171:0:(ldlm_lib.c:946:target_handle_connect()) widow1-OST00b5: denying connection for new client 172.30.221.31@o2ib3 (66a3b9fa-4a34-e8cb-c90e-1c98fa38f2e6): 12336 clients in recovery for 1772s&lt;br/&gt;
Aug 27 14:19:01 widow-oss12a2 kernel: [  679.468170] LustreError: 15390:0:(ldlm_lib.c:946:target_handle_connect()) widow1-OST00b5: denying connection for new client 172.30.221.25@o2ib3 (e65b92af-bef9-b8b2-5069-d8894f06fb19): 12294 clients in recovery for 1665s&lt;br/&gt;
Aug 27 14:55:51 widow-oss12a2 kernel: [ 2885.008362] LustreError: 16132:0:(ldlm_lib.c:946:target_handle_connect()) widow1-OST00b5: denying connection for new client 172.30.221.18@o2ib3 (5ac80ce4-58aa-16fd-8c53-889df4ba3118): 12174 clients in recovery for 1254s&lt;br/&gt;
Aug 27 14:59:35 widow-oss12a2 kernel: [ 3108.624936] Lustre: 15386:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) widow1-OST00b5: 12173 recoverable clients remain&lt;br/&gt;
Aug 27 15:16:46 widow-oss12a2 kernel: [ 4137.296402] Lustre: widow1-OST00b5: Recovery period over after 60:00, of 12345 clients 12342 recovered and 2 were evicted.&lt;br/&gt;
Aug 27 15:16:46 widow-oss12a2 kernel: [ 4137.296415] Lustre: widow1-OST00b5: sending delayed replies to recovered clients&lt;/p&gt;

&lt;p&gt;In an earlier case, only the MSD failed, and recovery finished in less that 30min because all clients reconnected. I&apos;m sure there are cases where recovery just takes 30min, but mostly now when it goes to 30min, it will also go to 60 min&lt;/p&gt;

&lt;p&gt;obd_timeout is set to 600&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@widow-oss10a1 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/lustre/timeout  &lt;br/&gt;
600&lt;/p&gt;

&lt;p&gt;ldlm timeout is set to 200&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@widow-oss10a1 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/lustre/ldlm_timeout &lt;br/&gt;
200&lt;/p&gt;
</description>
                <environment>kernel 2.6.18-348.3.1.el5, rhel5.9, distribution-provided ofed, gni behind o2iblnd routers</environment>
        <key id="20641">LU-3844</key>
            <summary>Double recovery period in 1.8.9 after OSS failure</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="blakecaldwell">Blake Caldwell</reporter>
                        <labels>
                    </labels>
                <created>Tue, 27 Aug 2013 19:39:41 +0000</created>
                <updated>Wed, 28 Aug 2013 21:46:00 +0000</updated>
                            <resolved>Wed, 28 Aug 2013 21:45:59 +0000</resolved>
                                    <version>Lustre 1.8.9</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="65203" author="jamesanunez" created="Tue, 27 Aug 2013 20:32:25 +0000"  >&lt;p&gt;Hongchao, &lt;/p&gt;

&lt;p&gt;Would you please comment on this one?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
James&lt;/p&gt;</comment>
                            <comment id="65235" author="hongchao.zhang" created="Wed, 28 Aug 2013 11:24:47 +0000"  >&lt;p&gt;Hi Blake,&lt;/p&gt;

&lt;p&gt;the extra recovery period is caused by VBR (version based recovery).&lt;/p&gt;

&lt;p&gt;in target_recovery_check_and_stop, after the first recovery period (3*obd_timeout = 1800s = 30m) is expired, obd_device-&amp;gt;obd_version_recov will be set and&lt;br/&gt;
the extra recovery period (30m) is started by calling &quot;reset_recovery_timer&quot;.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; target_recovery_check_and_stop(struct obd_device *obd)
{
        &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; abort_recovery = 0;
                
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (obd-&amp;gt;obd_stopping || !obd-&amp;gt;obd_recovering)
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 1;
                
        spin_lock_bh(&amp;amp;obd-&amp;gt;obd_processing_task_lock);
        abort_recovery = obd-&amp;gt;obd_abort_recovery;
        obd-&amp;gt;obd_abort_recovery = 0;
        spin_unlock_bh(&amp;amp;obd-&amp;gt;obd_processing_task_lock);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!abort_recovery)
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0;
        &lt;span class=&quot;code-comment&quot;&gt;/** check &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; fs version-capable */&lt;/span&gt;
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (target_fs_version_capable(obd)) {
                class_handle_stale_exports(obd);
        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; {
                CWARN(&lt;span class=&quot;code-quote&quot;&gt;&quot;Versions are not supported by ldiskfs, VBR is OFF\n&quot;&lt;/span&gt;);
                class_disconnect_stale_exports(obd, exp_flags_from_obd(obd));
        }
        &lt;span class=&quot;code-comment&quot;&gt;/* VBR: no clients are remained to replay, stop recovery */&lt;/span&gt;
        spin_lock_bh(&amp;amp;obd-&amp;gt;obd_processing_task_lock);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (obd-&amp;gt;obd_recovering &amp;amp;&amp;amp; obd-&amp;gt;obd_recoverable_clients == 0) {
                spin_unlock_bh(&amp;amp;obd-&amp;gt;obd_processing_task_lock);
                target_stop_recovery(obd, 0);
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 1;
        }
        &lt;span class=&quot;code-comment&quot;&gt;/* always check versions now */&lt;/span&gt;
        obd-&amp;gt;obd_version_recov = 1;
        cfs_waitq_signal(&amp;amp;obd-&amp;gt;obd_next_transno_waitq);
        spin_unlock_bh(&amp;amp;obd-&amp;gt;obd_processing_task_lock);
        &lt;span class=&quot;code-comment&quot;&gt;/* reset timer, recovery will proceed with versions now */&lt;/span&gt;
        reset_recovery_timer(obd, OBD_RECOVERY_TIME_SOFT, 1);
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="65286" author="blakecaldwell" created="Wed, 28 Aug 2013 17:51:19 +0000"  >&lt;p&gt;Thanks much for the explanation!&lt;/p&gt;</comment>
                            <comment id="65310" author="jamesanunez" created="Wed, 28 Aug 2013 20:54:07 +0000"  >&lt;p&gt;Blake, &lt;/p&gt;

&lt;p&gt;Is there anything else we need to do under this ticket or should we close it?&lt;/p&gt;

&lt;p&gt;Thanks, &lt;br/&gt;
James&lt;/p&gt;</comment>
                            <comment id="65312" author="blakecaldwell" created="Wed, 28 Aug 2013 21:00:41 +0000"  >&lt;p&gt;This can be closed.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvze7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9949</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>