<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:03:07 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6773] DNE2 Failover and recovery soak testing</title>
                <link>https://jira.whamcloud.com/browse/LU-6773</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like&#160;dbench that is doing renames should function correctly in a striped directory.&lt;br/&gt;
1. Setup Lustre with 4 MDS (each MDS has one MDT), 4 OSTs, and at least 8 clients.&lt;br/&gt;
2. Each client will create a striped directory (with stripe count = 4). Under each striped directory,&lt;br/&gt;
1. 1/2 of clients will keep doing tar,&#160;untar in the striped directory.&lt;br/&gt;
2. 1/2 of clients will do&#160;dbench under striped directory.&lt;br/&gt;
3. Randomly reboot one of the MDSes at least once every 30 minutes and fail over to the backup MDS if the test configuration allows it.&lt;br/&gt;
4. The test should keep running at least 24 hours without report application error&lt;br/&gt;
The goal of the failover and recovery soak testing is not necessarily to resolve every issue found during testing, especially non-DNE issues, but rather to have a good idea of the relative stability of DNE + Async Commits during recovery.&lt;/p&gt;</description>
                <environment></environment>
        <key id="30857">LU-6773</key>
            <summary>DNE2 Failover and recovery soak testing</summary>
                <type id="3" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11318&amp;avatarType=issuetype">Task</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jamesanunez">James Nunez</assignee>
                                    <reporter username="rhenwood">Richard Henwood</reporter>
                        <labels>
                    </labels>
                <created>Mon, 29 Jun 2015 22:14:37 +0000</created>
                <updated>Thu, 14 Jun 2018 21:41:37 +0000</updated>
                            <resolved>Wed, 26 Aug 2015 16:42:59 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="120444" author="di.wang" created="Mon, 6 Jul 2015 17:18:55 +0000"  >&lt;p&gt;The rpm has been installed on all of nodes, &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/33136/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/33136/&lt;/a&gt; .&lt;/p&gt;</comment>
                            <comment id="120816" author="rhenwood" created="Thu, 9 Jul 2015 14:00:58 +0000"  >&lt;p&gt;Also: Please collect logs from this work during the run and attached them to this ticket.&lt;/p&gt;</comment>
                            <comment id="121393" author="rhenwood" created="Wed, 15 Jul 2015 20:05:21 +0000"  >&lt;p&gt;An Update on this activity:&lt;/p&gt;

&lt;p&gt;James Nunez has been engaged with recovery testing on the OpenSFS cluster hosted by IU for the last two weeks. Over the past week, IU have been fully engaged helping us with a stretch goal to enable a fail-over configuration. No mechanism to gracefully force the logical drives to all run on a single controller could be identified. It is expected that physically pulling a controller may force a fail-over but this activity can not be scheduled for a 24 hour duration, required by the test.&lt;/p&gt;

&lt;p&gt;I&apos;m investigating alternatives.&lt;/p&gt;</comment>
                            <comment id="121394" author="rhenwood" created="Wed, 15 Jul 2015 20:08:00 +0000"  >&lt;p&gt;During discussion, we have identified three stages of testing for this ticket:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;soft recovery: forced unmount of an active file system. Remount after a period of time.&lt;/li&gt;
	&lt;li&gt;hard recovery: hard reboot an MDS of an active file system.&lt;/li&gt;
	&lt;li&gt;hard recovery with fail-over: hard reboot an MDS of an active file system. The file system remains available through-out.&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="122207" author="di.wang" created="Sat, 25 Jul 2015 07:06:53 +0000"  >&lt;p&gt;With build &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/33580/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/33580/&lt;/a&gt; , hard reboot(10 mins reboot interval). The test fails after 35 failover.  The target is 48 failover.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Duration:               86400
Server failover period: 600 seconds
Exited after:           22249 seconds
Number of failovers before exit:
mds1: 6 times
mds2: 12 times
mds3: 9 times
mds4: 8 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: FAIL: rc=7
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="122422" author="di.wang" created="Tue, 28 Jul 2015 16:22:19 +0000"  >&lt;p&gt;The test fails after 41 failover with build &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/33612/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/33612/&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Duration:               86400
Server failover period: 1800 seconds
Exited after:           72283 seconds
Number of failovers before exit:
mds1: 10 times
mds2: 10 times
mds3: 10 times
mds4: 11 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: FAIL: rc=7
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="122942" author="di.wang" created="Sun, 2 Aug 2015 01:41:42 +0000"  >&lt;p&gt;Ok, this test just passed with the build &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/33759/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/33759/&lt;/a&gt; Here is the test log. &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;==== Checking the clients loads AFTER failover -- failure NOT OK
mds4 has failed over 9 times, and counting...
2015-08-01 16:59:58 Terminating clients loads ...
Duration:               86400
Server failover period: 1800 seconds
Exited after:           84832 seconds
Number of failovers before exit:
mds1: 16 times
mds2: 7 times
mds3: 16 times
mds4: 9 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: PASS: rc=0
PASS failover_mds (84837s)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="123034" author="di.wang" created="Mon, 3 Aug 2015 16:44:00 +0000"  >&lt;p&gt;The build &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/33759/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/33759/&lt;/a&gt; is based on master with the patch&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/15812/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15812/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6928&quot; title=&quot;Version mismatch during DNE replay&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6928&quot;&gt;&lt;del&gt;LU-6928&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15793/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15793/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6924&quot; title=&quot;remote regular file are missing after recovery.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6924&quot;&gt;&lt;del&gt;LU-6924&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15730/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15730/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6846&quot; title=&quot;dt_record_write()) ASSERTION( dt-&amp;gt;do_body_ops-&amp;gt;dbo_write ) failed: &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6846&quot;&gt;&lt;del&gt;LU-6846&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15725/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15725/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6905&quot; title=&quot;For OSP to MDT, it should rename ost_conn(server)_uuid to mdt_conn(server)_uuid&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6905&quot;&gt;&lt;del&gt;LU-6905&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15721/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15721/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6896&quot; title=&quot;update llog object is missing during recovery.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6896&quot;&gt;&lt;del&gt;LU-6896&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15691/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15691/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6875&quot; title=&quot;thandle_get_sub_by_dt dereferences ERR_PTR pointer on error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6875&quot;&gt;&lt;del&gt;LU-6875&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15690/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15690/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6881&quot; title=&quot;sub_trans_commit_cb() is racy&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6881&quot;&gt;&lt;del&gt;LU-6881&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15682/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15682/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6882&quot; title=&quot;osp_prep_update_req() doesn&amp;#39;t set rq_bulk_write properly&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6882&quot;&gt;&lt;del&gt;LU-6882&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15595/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15595/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6846&quot; title=&quot;dt_record_write()) ASSERTION( dt-&amp;gt;do_body_ops-&amp;gt;dbo_write ) failed: &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6846&quot;&gt;&lt;del&gt;LU-6846&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15594/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15594/&lt;/a&gt;  (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6819&quot; title=&quot;LBUG ASSERTION( tdtd-&amp;gt;tdtd_last_update_transno &amp;lt;= transno ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6819&quot;&gt;&lt;del&gt;LU-6819&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/15576/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/15576/&lt;/a&gt; (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6840&quot; title=&quot;update memory reply data in DNE update replay &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6840&quot;&gt;&lt;del&gt;LU-6840&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/14497/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/14497/&lt;/a&gt; (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6475&quot; title=&quot;race between open and migration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6475&quot;&gt;&lt;del&gt;LU-6475&lt;/del&gt;&lt;/a&gt;)&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/13224/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13224/&lt;/a&gt; (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6852&quot; title=&quot;MDS is evicted during 24-24 hours failover.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6852&quot;&gt;&lt;del&gt;LU-6852&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                            <outwardlinks description="is blocking">
                                        <issuelink>
            <issuekey id="31106">LU-6858</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is blocked by">
                                        <issuelink>
            <issuekey id="31055">LU-6837</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="31059">LU-6840</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="31087">LU-6852</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="31033">LU-6831</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="18538" name="recovery-mds-scale.suite_log.c24.log" size="715717" author="di.wang" created="Sun, 2 Aug 2015 01:41:42 +0000"/>
                            <attachment id="18481" name="recovery-mds-scale.suite_log.c24.log" size="251157" author="di.wang" created="Sat, 25 Jul 2015 07:06:53 +0000"/>
                            <attachment id="18504" name="recovery-mds-scale.test_failover_mds.test_log.c24.log" size="624926" author="di.wang" created="Tue, 28 Jul 2015 16:22:19 +0000"/>
                            <attachment id="18539" name="test_logs.tgz" size="222" author="di.wang" created="Sun, 2 Aug 2015 01:41:42 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxgtb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>