<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:56:50 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6057] replay-dual test_9 failed - post-failover df: 1</title>
                <link>https://jira.whamcloud.com/browse/LU-6057</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;While running the LFSCK Phase 3 test plan, replay-dual test 9 failed with the error in the routine fail():&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
c12: stat: cannot read file system information for `/lustre/scratch&apos;: Input/output error
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;replay-dual test 10 failed with the same error message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
c11: stat: cannot read file system information for `/lustre/scratch&apos;: Input/output error
pdsh@c13: c11: ssh exited with exit code 1
c13: stat: cannot read file system information for `/lustre/scratch&apos;: Input/output error
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The test results are at &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/78dc0abe-861b-11e4-ac52-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/78dc0abe-861b-11e4-ac52-5254006e85c2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It&#8217;s not clear from the logs what is related to this error. For test 9, the client that could not stat the file, c12, has the following in dmesg right before the test fails&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00800000:00020000:5.0:1418766109.528501:0:25671:0:(lmv_obd.c:1477:lmv_statfs()) can&apos;t stat MDS #0 (scratch-MDT0000-mdc-ffff8808028cbc00), error -5
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the primary MDS, MDS0, the recovery looks like it having issues:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: *** cfs_fail_loc=119, val=2147483648***
LustreError: 12646:0:(ldlm_lib.c:2384:target_send_reply_msg()) @@@ dropping reply  req@ffff880d0ee74c80 x1487677070285728/t128849018882(128849018882) o36-&amp;gt;558cba8f-7f43-4143-5d8a-c7adfced85eb@192.168.2.112@o2ib:308/0 lens 488/448 e 0 to 0 dl 1418766108 ref 1 fl Complete:/4/0 rc 0/0
Lustre: scratch-MDT0000: recovery is timed out, evict stale exports
Lustre: scratch-MDT0000: disconnecting 1 stale clients
Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) Skipped 2 previous similar messages
Lustre: 12646:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880275011380 x1487683234832604/t0(128849018884) o36-&amp;gt;d08d2f7b-4c89-7208-ad20-237f0ed0a102@192.168.2.113@o2ib:294/0 lens 488/0 e 6 to 0 dl 1418766094 ref 1 fl Complete:/4/ffffffff rc 0/-1
Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 12646:0:(ldlm_lib.c:2060:target_recovery_thread()) too long recovery - read logs
Lustre: scratch-MDT0000: Recovery over after 3:01, of 7 clients 1 recovered and 6 were evicted.
LustreError: dumping log to /tmp/lustre-log.1418766079.12646
Lustre: Skipped 3 previous similar messages
Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
Lustre: DEBUG MARKER: replay-dual test_9: @@@@@@ FAIL: post-failover df: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>OpenSFS cluster running lustre-master tag 2.6.91 build # 2771 with two MDSs with one MDT each, three OSSs with two OSTs each and three clients.</environment>
        <key id="27988">LU-6057</key>
            <summary>replay-dual test_9 failed - post-failover df: 1</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Fri, 19 Dec 2014 17:34:54 +0000</created>
                <updated>Thu, 12 Feb 2015 18:30:06 +0000</updated>
                            <resolved>Tue, 20 Jan 2015 02:59:31 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="102514" author="yong.fan" created="Mon, 5 Jan 2015 01:03:33 +0000"  >&lt;p&gt;Because some of the client missed to replay the expected requests during the MDS0 restart, they were evicted by the MDS0. So it is normal that the subsequent statfs() on those evicted client got failure. Unfortunately, the debug log on the MDS0 only contained the events after the recovery, so we cannot know what happened during the recovery and what caused the eviction.&lt;/p&gt;

&lt;p&gt;According to our current LFSCK implementation, if former LFSCK was paused or crashed before the MDS0 restart, then it would be auto resumed after the MDS0 recovery. I found related logs on MDS0:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000004:00000001:0.0:1418766079.340082:0:12646:0:(mdt_handler.c:5813:mdt_postrecov()) Process entered
00000004:00000001:0.0:1418766079.340083:0:12646:0:(mdd_device.c:1458:mdd_iocontrol()) Process entered
00100000:00000001:0.0:1418766079.340085:0:12646:0:(lfsck_lib.c:2618:lfsck_start()) Process entered
00100000:00000001:0.0:1418766079.340086:0:12646:0:(lfsck_lib.c:2632:lfsck_start()) Process leaving via put (rc=0 : 0 : 0x0)
00000004:00000001:0.0:1418766079.340088:0:12646:0:(mdd_device.c:1478:mdd_iocontrol()) Process leaving (rc=0 : 0 : 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That means the MDS0 tried to resume the LFSCK after the recovery, because there were no inconsistency or paused/crashed LFSCK before the MDS0 restart, the lfsck_start() exit directly as expected. So replay-dual test_9 failure is not related with LFSCK, but more like the master branch general issues. Unfortunately, because of no recovery logs on the MDS0, it is not easy to give further investigation.&lt;/p&gt;</comment>
                            <comment id="102529" author="yong.fan" created="Mon, 5 Jan 2015 15:00:00 +0000"  >&lt;p&gt;The client on the node c12 was evicted. According to its log, the c12 client tried to replay the create RPC after the MDS0 restart:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000080:00000001:6.0:1418765877.138456:0:25481:0:(namei.c:927:ll_mknod()) Process entered
00000080:00200000:6.0:1418765877.138457:0:25481:0:(namei.c:931:ll_mknod()) VFS Op:name=fsa-c12, dir=[0x200000007:0x1:0x0](ffff88080bc08bb8) mode 100644 dev 0
...
00000100:00000001:6.0:1418765877.338798:0:25510:0:(client.c:2140:ptlrpc_set_wait()) Process entered
00000100:00000001:6.0:1418765877.338798:0:25510:0:(client.c:1423:ptlrpc_send_new_req()) Process entered
00000100:00000040:6.0:1418765877.338801:0:25510:0:(lustre_net.h:3328:ptlrpc_rqphase_move()) @@@ move req &quot;New&quot; -&amp;gt; &quot;Rpc&quot;  req@ffff88080a994680 x1487677070285744/t0(0) o36-&amp;gt;scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/568 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
...
00000100:00100000:4.0:1418765898.178881:0:6899:0:(client.c:2544:ptlrpc_free_committed()) @@@ stopping search  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36-&amp;gt;scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 1 fl Complete:R/4/0 rc 0/0
00000100:00000001:4.0:1418765898.178886:0:6899:0:(client.c:2576:ptlrpc_free_committed()) Process leaving
00000100:00080000:4.0:1418765898.178887:0:6899:0:(recover.c:93:ptlrpc_replay_next()) import ffff88081075c000 from scratch-MDT0000_UUID committed 124554051594 last 0
00000100:00000001:4.0:1418765898.178890:0:6899:0:(client.c:2842:ptlrpc_replay_req()) Process entered
00000100:00080000:4.0:1418765898.178892:0:6899:0:(client.c:2866:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36-&amp;gt;scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 1 fl New:R/4/0 rc 0/0
00000100:00000001:4.0:1418765898.178897:0:6899:0:(client.c:2643:ptlrpc_request_addref()) Process entered
00000100:00000001:4.0:1418765898.178898:0:6899:0:(client.c:2645:ptlrpc_request_addref()) Process leaving (rc=18446612166851774080 : -131906857777536 : ffff88080a994680)
00000100:00000040:4.0:1418765898.178901:0:6899:0:(ptlrpcd.c:246:ptlrpcd_add_req()) @@@ add req [ffff88080a994680] to pc [ptlrpcd_rcv:-1]  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36-&amp;gt;scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 2 fl New:R/4/0 rc 0/0
...
00000100:00000001:2.0:1418766079.080356:0:6906:0:(ptlrpcd.c:363:ptlrpcd_check()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:3.0:1418766079.088831:0:6899:0:(client.c:1176:ptlrpc_check_status()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
00000100:00000001:3.0:1418766079.088834:0:6899:0:(client.c:1340:after_reply()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
00000100:00000001:7.0:1418766079.088837:0:6902:0:(client.c:2001:ptlrpc_expired_set()) Process entered
00000100:00000001:7.0:1418766079.088838:0:6902:0:(client.c:2037:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1)
00000100:00000040:3.0:1418766079.088838:0:6899:0:(lustre_net.h:3328:ptlrpc_rqphase_move()) @@@ move req &quot;Rpc&quot; -&amp;gt; &quot;Interpret&quot;  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36-&amp;gt;scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/192 e 0 to 0 dl 1418766109 ref 2 fl Rpc:R/4/0 rc -107/-107
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When the c12 replay the create RPC, it got the failure -107 finally. Such -107 was because the MDS0 has marked the recovery as failed.&lt;/p&gt;

&lt;p&gt;On the MDS0 side, some expected RPCs were not sent to the MDS0 for replay, (we do not what the missed RPCs are), as to the recovery was not completed. Unfortunately, the MDS0 debug logs only contained the information after the recovery.&lt;/p&gt;</comment>
                            <comment id="102724" author="jamesanunez" created="Wed, 7 Jan 2015 07:23:19 +0000"  >&lt;p&gt;No other programs were running while replay-dual was running. &lt;/p&gt;</comment>
                            <comment id="103834" author="tappro" created="Mon, 19 Jan 2015 06:36:49 +0000"  >&lt;p&gt;There is unexpected recovery abort so test cannot be considered as valid. It is related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt;, I mean recovery abort issue only.&lt;/p&gt;</comment>
                            <comment id="103968" author="yong.fan" created="Tue, 20 Jan 2015 02:59:31 +0000"  >&lt;p&gt;It is another failure instance of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="28058">LU-6084</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="28684">LU-6238</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzx2wn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16869</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>