<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:21:12 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8863] LFSCK fails to complete, node cannot recover after LFSCK aborted. </title>
                <link>https://jira.whamcloud.com/browse/LU-8863</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;LFSCK fails to complete on lola-8, MDT0000:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;                                lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
                        fi
                fi
2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO     lfsck started on lola-8
2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 15s
2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 45s
2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 105s
2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 225s
2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 465s
2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 945s
2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 1905s
2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 3825s
2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO     lfsck still in progress &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; soaked-MDT0000 after 7665s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I aborted LFSCK with lfsck_stop&lt;br/&gt;
The LFSCK stopped, but clients and other servers were not able to re-connect.&lt;br/&gt;
Example client:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7
Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10)
Lustre: Skipped 1 previous similar message
LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can&apos;t stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4
LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue. &lt;br/&gt;
I dumped the Lustre log on lola-8 while it was in LFSCK, attached.&lt;br/&gt;
Also, the lfsck_layout&lt;/p&gt;</description>
                <environment>Soak cluster lustre: 2.8.60_5_gcc5601d</environment>
        <key id="41752">LU-8863</key>
            <summary>LFSCK fails to complete, node cannot recover after LFSCK aborted. </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="cliffw">Cliff White</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Tue, 22 Nov 2016 22:26:09 +0000</created>
                <updated>Fri, 12 Aug 2022 22:00:37 +0000</updated>
                            <resolved>Fri, 12 Aug 2022 22:00:37 +0000</resolved>
                                    <version>Lustre 2.9.0</version>
                    <version>Lustre 2.10.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="174741" author="pjones" created="Tue, 22 Nov 2016 22:40:06 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Please could you advise on this one? The system is left in the hung state so you should be able to access it directly if that would help.&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="174742" author="cliffw" created="Tue, 22 Nov 2016 22:46:19 +0000"  >&lt;p&gt;After reboot of second MDT (lola-9) reconnection is still hung.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Lustre: Skipped 8 previous similar messages
Lustre: soaked-MDT0000: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.109@o2ib10, removing former export from same NID
Lustre: soaked-MDT0000: Client soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) refused connection, still busy with 7 references
format at ldlm_lib.c:1221:target_handle_connect doesn&apos;t end in newline
Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0001_UUID (at 192.168.1.109@o2ib10) because it is indicating it is a &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; client
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="174842" author="laisiyao" created="Wed, 23 Nov 2016 15:30:47 +0000"  >&lt;p&gt;I just applied access to Lola: DCO-6349. I&apos;ll check the system tomorrow.&lt;/p&gt;</comment>
                            <comment id="174854" author="cliffw" created="Wed, 23 Nov 2016 16:46:53 +0000"  >&lt;p&gt;I have been restarting the 4 MDS nodes to attempt to clear this, upon restarting the 4th MDS, (192.168.1.111@o2ib)  MDT000 finally completed recovery.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Lustre: soaked-MDT0000: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.110@o2ib10, removing former export from same NID
Lustre: Skipped 47 previous similar messages
Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib10) refused connection, still busy with 10 references
Lustre: Skipped 47 previous similar messages
Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1479919197/real 1479919241]  req@ffff88041ce450c0 x1551735784078192/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1479919252 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: MGS: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
format at ldlm_lib.c:1221:target_handle_connect doesn&apos;t end in newline
Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0003_UUID (at 192.168.1.111@o2ib10) because it is indicating it is a &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; client
Lustre: soaked-MDT0000: recovery is timed out, evict stale exports
Lustre: soaked-MDT0000: disconnecting 12 stale clients
Lustre: 5967:0:(ldlm_lib.c:1624:abort_req_replay_queue()) @@@ aborted:  req@ffff880821355080 x1551097408033824/t0(171804417562) o35-&amp;gt;d7ef0a77-d6cc-d6e6-b6ee-f0c4dd8b3805@192.168.1.113@o2ib100:-1/-1 lens 512/0 e 2753 to 0 dl 1479919489 ref 1 fl Complete:/4/ffffffff rc 0/-1
LustreError: 5967:0:(ldlm_lib.c:1645:abort_lock_replay_queue()) @@@ aborted:  req@ffff880827c96050 x1551509616864032/t0(0) o101-&amp;gt;soaked-MDT0003-mdtlov_UUID@192.168.1.111@o2ib10:-1/-1 lens 328/0 e 2752 to 0 dl 1479919496 ref 1 fl Complete:/40/ffffffff rc 0/-1
Lustre: soaked-MDT0000: Denying connection &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; client c3576180-37f0-1109-b79d-ead98e580c5d(at 192.168.1.127@o2ib100), waiting &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 21 known clients (6 recovered, 3 in progress, and 12 evicted) to recover in 21187373:09
Lustre: Skipped 3 previous similar messages
Lustre: 5967:0:(ldlm_lib.c:2035:target_recovery_overseer()) soaked-MDT0000 recovery is aborted by hard timeout
Lustre: 5967:0:(ldlm_lib.c:2045:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: soaked-MDT0000: Recovery over after 1147:10, of 21 clients 6 recovered and 15 were evicted.
Lustre: soaked-MDT0000: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10)
Lustre: soaked-MDT0000: Connection restored to 3f3093dd-91db-9fe9-0a49-09ac8c98d9ec (at 192.168.1.125@o2ib100)
Lustre: Skipped 1 previous similar message
Lustre: soaked-MDT0000: Connection restored to b111dd6e-3c52-31d6-0957-ebaf9d411485 (at 192.168.1.123@o2ib100)
Lustre: Skipped 1 previous similar message
Lustre: soaked-MDT0000: Connection restored to 0ac581dd-e230-df88-61be-a489f0ddcab5 (at 192.168.1.126@o2ib100)
Lustre: Skipped 1 previous similar message
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However, clients remain hung, still investigating&lt;/p&gt;</comment>
                            <comment id="175088" author="yong.fan" created="Sat, 26 Nov 2016 11:05:49 +0000"  >&lt;p&gt;According to the logs, both the namespace LFSCK and layout LFSCK were running before the lfsck_stop. Except some inconsistent owners were repaired by layout LFSCK, all others looks normally. In fact, the inconsistent owner is fake because of async chown operation during the layout LFSCK. So from the LFSCK view, during the log interval, there were no other inconsistency found.&lt;/p&gt;

&lt;p&gt;But just like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8742&quot; title=&quot;lfsck &amp;gt; 1000 seconds&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8742&quot;&gt;&lt;del&gt;LU-8742&lt;/del&gt;&lt;/a&gt;, the LFSCK ran some &quot;slowly&quot;, as to you have to stop lfsck_stop manually. But according to the logs, the LFSCK was really in scanning, not hung. As to how long it will take (already more than 720 seconds), depends on the objects count in the system.&lt;/p&gt;

&lt;p&gt;About the system hung issue, I cannot establish any relationship with the uncompleted LFSCK, because the LFSCK runs at background, and it did not found insistency. Means from the LFSCK view, the system was mountable. As the last comment described, recovery timeout finally, some clients evicted, that may be related with why &apos;clients remain hung&apos;.&lt;/p&gt;</comment>
                            <comment id="176282" author="adilger" created="Fri, 2 Dec 2016 21:13:05 +0000"  >&lt;p&gt;Cliff, have you checked if DNE recovery and filesystem usage is always blocked until all MDTs are available?  It may be that LFSCK is being blocked by the DNE recovery?  Preferably, we&apos;d want the filesystem to be usable as soon as MDT0000 is up (at least for those parts that are on available MDTs), but it may be that DNE2 recovery is waiting for all MDTs to be up before it allows recovery to complete?&lt;/p&gt;</comment>
                            <comment id="176283" author="cliffw" created="Fri, 2 Dec 2016 21:29:19 +0000"  >&lt;p&gt;This is new to me, i am not sure how to check for DNE recovery as opposed to just recovery. The filesystem recovery completes before we trigger LFSCK in soak. &lt;/p&gt;</comment>
                            <comment id="181981" author="cliffw" created="Tue, 24 Jan 2017 19:53:46 +0000"  >&lt;p&gt;Hitting this again on latest master&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;]# lctl get_param mdd.*.lfsck_layout
mdd.soaked-MDT0000.lfsck_layout=
name: lfsck_layout
magic: 0xb1734d76
version: 2
status: scanning-phase1
flags:
param:
last_completed_time: 1485198303
time_since_last_completed: 89193 seconds
latest_start_time: 1485275978
time_since_latest_start: 11518 seconds
last_checkpoint_time: 1485287462
time_since_last_checkpoint: 34 seconds
latest_start_position: 77
last_checkpoint_position: 769655424
first_failure_position: 0
success_count: 2
repaired_dangling: 807
repaired_unmatched_pair: 0
repaired_multiple_referenced: 0
repaired_orphan: 0
repaired_inconsistent_owner: 2310456
repaired_others: 0
skipped: 0
failed_phase1: 0
failed_phase2: 0
checked_phase1: 4033934
checked_phase2: 0
run_time_phase1: 11517 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 350 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 246 items/sec
real-time_speed_phase2: N/A
current_position: 773325008
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Logs&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="40166">LU-8647</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="24147" name="lola-8.layout.txt" size="880" author="cliffw" created="Tue, 22 Nov 2016 22:26:09 +0000"/>
                            <attachment id="24148" name="lola-8.nov21-2016.txt.gz" size="2740095" author="cliffw" created="Tue, 22 Nov 2016 22:26:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzywd3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>