<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:30:56 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9973] MDT recovery status never completed </title>
                <link>https://jira.whamcloud.com/browse/LU-9973</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi, &lt;/p&gt;

&lt;p&gt;When I re-mount mdts, MDT try to recovery  all clients. But the mdts recovery status never completed ( several hours).  The recovery process will not to be end even reach hard limit (900). I can only abort recovery process by &quot;lctl --device MDTxxx abort_recovery&quot;.&lt;br/&gt;
1.Can I abort recovery process by &quot;mount.lustre -o avort_recov  xx&quot; ?  I can not execute  command sucessful. (command hang) &lt;br/&gt;
2.Can I evict all stalled clients to end the recovery process?&lt;br/&gt;
3.Is there any side effect to abort recovery process by &quot;lctl   abort_recovery&quot; ?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;


&lt;p&gt;[  570.019086] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Timed out tx for 192.168.5.202@o2ib6: 7 seconds&lt;br/&gt;
[  622.166399] Lustre: 18740:0:(ldlm_lib.c:1784:extend_recovery_timer()) hpcfs-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1&lt;br/&gt;
[  622.973445] Lustre: 26593:0:(ldlm_lib.c:1784:extend_recovery_timer()) hpcfs-MDT0001: extended recovery timer reaching hard limit: 900, extend: 1&lt;br/&gt;
[  622.973448] Lustre: 26593:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 2 previous similar messages&lt;br/&gt;
[  682.170260] Lustre: 18740:0:(ldlm_lib.c:1784:extend_recovery_timer()) hpcfs-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1&lt;br/&gt;
[  682.170264] Lustre: 18740:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 2 previous similar messages&lt;br/&gt;
[  684.027392] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Timed out tx for 192.168.7.202@o2ib8: 7 seconds&lt;br/&gt;
[  684.027397] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Skipped 3 previous similar messages&lt;br/&gt;
[  693.713019] Lustre: 14236:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1505204590/real 0&amp;#93;&lt;/span&gt;  req@ffff887c5585ad00 x1578320959384656/t0(0) o38-&amp;gt;hpcfs-MDT0000-osp-MDT0001@192.168.8.202@o2ib9:24/4 lens 520/544 e 0 to 1 dl 1505204596 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
[  693.713024] Lustre: 14236:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages&lt;br/&gt;
[  737.721925] LustreError: 11-0: hpcfs-OST0000-osc-MDT0001: operation ost_connect to node 192.168.17.212@o2ib18 failed: rc = -19&lt;/p&gt;</description>
                <environment></environment>
        <key id="48274">LU-9973</key>
            <summary>MDT recovery status never completed </summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="sebg-crd-pm">sebg-crd-pm</reporter>
                        <labels>
                    </labels>
                <created>Tue, 12 Sep 2017 11:20:45 +0000</created>
                <updated>Tue, 19 Sep 2017 00:13:55 +0000</updated>
                                            <version>Lustre 2.10.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="208146" author="pjones" created="Tue, 12 Sep 2017 17:11:40 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="208149" author="adilger" created="Tue, 12 Sep 2017 17:41:46 +0000"  >&lt;p&gt;It looks like you may be having some sort of communication problem between MDT0000 and MDT0001?  The MDS recovery can not complete with multiple MDTs until they are all available, and it looks like there timeouts during connect (o38 = &lt;tt&gt;MDS_CONNECT&lt;/tt&gt;):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Request sent has timed out for sent delay: [sent 1505204590/real 0] req@ffff887c5585ad00 x1578320959384656/t0(0) o38-&amp;gt;hpcfs-MDT0000-osp-MDT0001@192.168.8.202@o2ib9
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What is also a bit strange to me is that you appear to have a large number of different IB networks?  Just in the messages posted here, I see &lt;tt&gt;o2ib6&lt;/tt&gt;, &lt;tt&gt;o2ib8&lt;/tt&gt;, &lt;tt&gt;o2ib9&lt;/tt&gt;, and &lt;tt&gt;o2ib18&lt;/tt&gt; reported.  Do you have an extremely large number of clients, several different filesystems, or are you configuring a separate LNet network for each host?  That isn&apos;t &lt;em&gt;necessarily&lt;/em&gt; a source of problems, but it is unusual and opens more chance of incorrect configuration causing problems.&lt;/p&gt;</comment>
                            <comment id="208708" author="sebg-crd-pm" created="Tue, 19 Sep 2017 00:13:55 +0000"  >&lt;p&gt;Thank you for your advise. I can get MDT recovery status  completed now.&lt;br/&gt;
Too many  o2ibx made MDTs need spent much  time to try create workable connections.&lt;br/&gt;
And too many logs need spent more than 8 hours to update them. (The update log rate too slow, so I clear update log fininally)&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzjzb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>