<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:40:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4138] Problem with migrating from 1 MDT to 2 MDT</title>
                <link>https://jira.whamcloud.com/browse/LU-4138</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Background:&lt;/p&gt;

&lt;p&gt;Objective is to upgrade our Lustre software from 1.8.7 to 2.4.*. &lt;br/&gt;
    We also want to split our current active/standby MDS with shared MDT to&lt;br/&gt;
    2 MDTs and active/active. &lt;/p&gt;

&lt;p&gt;    The requirement is, all data will be in place during the upgrade/MDS &lt;br/&gt;
    split&lt;/p&gt;

&lt;p&gt;= We went though Lustre software upgrading from 1.8.7 (CentOS/el5) to&lt;br/&gt;
  2.4.0 (CentOS/el6) successfully. During this process, we kept the &lt;br/&gt;
  1 MDS/MDT.&lt;/p&gt;

&lt;p&gt;= We then configured 2 other machines for the new MDS servers. &lt;br/&gt;
  We transfered network interfaces to one of the new MDS servers.&lt;/p&gt;

&lt;p&gt;= We formated MDT in new MDS&lt;/p&gt;

&lt;p&gt;mkfs.lustre --reformat --fsname=rhino --param mdt.quota_type=ug --mgs --mdt --index=0 /dev/md0&lt;/p&gt;

&lt;p&gt;= We copy the existing MDT contents and ea.bak files over the new server.&lt;br/&gt;
(with Gnu tar version 1.27)&lt;/p&gt;

&lt;p&gt;/usr/local/bin/tar czvf /share/apps/tmp/rhino_mdt.tgz --sparse .&lt;/p&gt;

&lt;p&gt;getfattr &lt;del&gt;R -d -m &apos;.*&apos; -e hex -P . &amp;gt; /tmp/ea&lt;/del&gt;$(date +%Y%m%d).bak&lt;/p&gt;


&lt;p&gt;= We then run&lt;/p&gt;

&lt;p&gt;/usr/local/bin/tar xzvpf /share/apps/tmp/rhino_mdt.tgz --sparse&lt;/p&gt;

&lt;p&gt;setfattr --restore=/share/apps/tmp/ea-20131023.bak&lt;/p&gt;

&lt;p&gt;= We attempted to mount new MDT:&lt;/p&gt;

&lt;p&gt;mount -t lustre /dev/md1 /rhino&lt;/p&gt;

&lt;p&gt;= We got errors:&lt;/p&gt;

&lt;p&gt;mount.lustre: mount /dev/md1 at /rhino failed: File exists&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;from dmesg&amp;#93;&lt;/span&gt;&lt;/p&gt;


&lt;p&gt;LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
Lustre: 13422:0:(mgs_llog.c:238:mgs_fsdb_handler()) MDT using 1.8 OSC name scheme&lt;br/&gt;
LustreError: 140-5: Server rhino-MDT0000 requested index 0, but that index is already in use. Use --writeconf to force&lt;br/&gt;
LustreError: 13376:0:(mgs_llog.c:3625:mgs_write_log_target()) Can&apos;t get index (-98)&lt;br/&gt;
LustreError: 13376:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write rhino-MDT0000 log (-98)&lt;br/&gt;
LustreError: 13321:0:(obd_mount_server.c:1124:server_register_target()) rhino-MDT0000: error registering with the MGS: rc = -98 (not fatal)&lt;br/&gt;
Lustre: 13423:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt&lt;br/&gt;
Lustre: rhino-MDT0000: used disk, loading&lt;br/&gt;
Lustre: 13423:0:(mdt_handler.c:4946:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.&lt;br/&gt;
Lustre: 13423:0:(mdt_handler.c:4946:mdt_process_config()) Skipped 1 previous similar message&lt;br/&gt;
LustreError: 13423:0:(genops.c:320:class_newdev()) Device rhino-OST0000-osc already exists at 8, won&apos;t add&lt;br/&gt;
LustreError: 13423:0:(obd_config.c:374:class_attach()) Cannot create device rhino-OST0000-osc of type osp : -17&lt;br/&gt;
LustreError: 13423:0:(obd_config.c:1553:class_config_llog_handler()) MGC192.168.95.245@tcp: cfg command failed: rc = -17&lt;br/&gt;
Lustre:    cmd=cf001 0:rhino-OST0000-osc  1:osp  2:rhino-mdtlov_UUID  &lt;br/&gt;
LustreError: 15c-8: MGC192.168.95.245@tcp: The configuration from log &apos;rhino-MDT0000&apos; failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.&lt;br/&gt;
LustreError: 13321:0:(obd_mount_server.c:1258:server_start_targets()) failed to start server rhino-MDT0000: -17&lt;br/&gt;
LustreError: 13321:0:(obd_mount_server.c:1700:server_fill_super()) Unable to start targets: -17&lt;br/&gt;
LustreError: 13321:0:(obd_mount_server.c:849:lustre_disconnect_lwp()) rhino-MDT0000-lwp-MDT0000: Can&apos;t end config log rhino-client.&lt;br/&gt;
LustreError: 13321:0:(obd_mount_server.c:1427:server_put_super()) rhino-MDT0000: failed to disconnect lwp. (rc=-2)&lt;br/&gt;
Lustre: Failing over rhino-MDT0000&lt;br/&gt;
LustreError: 137-5: rhino-MDT0000_UUID: not available for connect from 192.168.95.248@tcp (no target)&lt;br/&gt;
Lustre: 13321:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1382564544/real 1382564544&amp;#93;&lt;/span&gt;  req@ffff880343c20c00 x1449708353487088/t0(0) o251-&amp;gt;MGC192.168.95.245@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1382564550 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Lustre: server umount rhino-MDT0000 complete&lt;br/&gt;
LustreError: 13321:0:(obd_mount.c:1275:lustre_fill_super()) Unable to mount  (-17)&lt;/p&gt;</description>
                <environment>CentOS 6.3, Lustre 2.4.0</environment>
        <key id="21612">LU-4138</key>
            <summary>Problem with migrating from 1 MDT to 2 MDT</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="mdiep">Minh Diep</assignee>
                                    <reporter username="haisong">Haisong Cai</reporter>
                        <labels>
                            <label>Sdsc</label>
                    </labels>
                <created>Wed, 23 Oct 2013 21:59:06 +0000</created>
                <updated>Fri, 8 May 2015 21:09:07 +0000</updated>
                            <resolved>Fri, 8 May 2015 21:09:07 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="69701" author="pjones" created="Thu, 24 Oct 2013 00:01:08 +0000"  >&lt;p&gt;Yu, Jian&lt;/p&gt;

&lt;p&gt;Could you please advise on this one&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="69760" author="mdiep" created="Thu, 24 Oct 2013 15:32:07 +0000"  >&lt;p&gt;Haisong,&lt;/p&gt;

&lt;p&gt;I just noticed this&lt;br/&gt;
LustreError: 15c-8: MGC192.168.95.245@tcp: The configuration from log &apos;rhino-MDT0000&apos; failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.&lt;/p&gt;

&lt;p&gt;the ip address on the MGC seems to be the old/previous ip, can you confirm? If you moved the MDS with ip change, we should --writeconf to wipe the ip as well, no?&lt;/p&gt;</comment>
                            <comment id="69770" author="adilger" created="Thu, 24 Oct 2013 16:20:43 +0000"  >&lt;p&gt;Better would be to use &quot;lctl --replace-nids&quot; instead of a whole writeconf c&lt;/p&gt;</comment>
                            <comment id="69821" author="adilger" created="Thu, 24 Oct 2013 18:07:58 +0000"  >&lt;p&gt;I think the goal is to replace the old MDS hardware with a new node and disks. &lt;/p&gt;

&lt;p&gt;Lustre is confused because you are formatting the new MDS and then restoring from the tar backup, but for some reason the MDT thinks it is new. Maybe the label on the MDS needs to be fixed?  What does &quot;e2label /dev/md1&quot; report?&lt;/p&gt;

&lt;p&gt;I submitted a patch under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14&quot; title=&quot;live replacement of OST&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14&quot;&gt;&lt;del&gt;LU-14&lt;/del&gt;&lt;/a&gt; to add the &quot;--replace&quot; option to mkfs.lustre for this case, but it is not in 2.4.&lt;/p&gt;

&lt;p&gt;Note that I would also recommend to use Lustre 2.4.1 instead of 2.4.0 so you get the other fixes included there. &lt;/p&gt;</comment>
                            <comment id="69824" author="mdiep" created="Thu, 24 Oct 2013 18:20:19 +0000"  >&lt;ol&gt;
	&lt;li&gt;e2label /dev/md1&lt;br/&gt;
rhino-MDT0000&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;the new server has been configure with the same ip address as the old one. My question above was incorrect because I login a different server. We tried the lctl replace_nids /dev/md1 nids but it require to have mgt/mdt mounted which failed&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lustre-mds-0-0 modprobe.d&amp;#93;&lt;/span&gt;# lctl replace_nids /dev/md1  192.168.95.245@tcp&lt;br/&gt;
No device found for name MGS: Invalid argument&lt;br/&gt;
This command must be run on the MGS.&lt;br/&gt;
error: replace_nids: Invalid argument&lt;/p&gt;</comment>
                            <comment id="69837" author="mdiep" created="Thu, 24 Oct 2013 19:52:11 +0000"  >&lt;p&gt;here are the command that I ran to make it worked. I will try a couple more different scenarios&lt;/p&gt;

&lt;p&gt;  658  reboot&lt;br/&gt;
  659  mkfs.lustre --reformat --fsname=rhino --param mdt.quota_type=ug --mgs --mdt --index=0 /dev/md0&lt;br/&gt;
  660  mount -t ldiskfs /dev/md0 /mnt&lt;br/&gt;
  661  cd /mnt&lt;br/&gt;
  662  ls&lt;br/&gt;
  663  /usr/local/bin/tar xzvpf /share/apps/tmp/rhino_mdt.tgz --sparse&lt;br/&gt;
  664  setfattr --restore=/share/apps/tmp/ea-20131023.bak&lt;br/&gt;
  665  cd&lt;br/&gt;
  666  umount /mnt&lt;br/&gt;
  667  tunefs.lustre --writeconf --reformat --fsname=rhino /dev/md0&lt;br/&gt;
  668  tunefs.lustre --writeconf --reformat --fsname=rhino --mgs --mdt /dev/md0&lt;br/&gt;
  669  mount -t lustre /dev/md0 /rhino/&lt;br/&gt;
  670  lctl dl&lt;/p&gt;</comment>
                            <comment id="69864" author="yujian" created="Fri, 25 Oct 2013 02:45:23 +0000"  >&lt;p&gt;Hi Minh,&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;tunefs.lustre --writeconf --reformat --fsname=rhino --mgs --mdt /dev/md0&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;IMHO, &quot;--reformat&quot; option is not needed here. So, do I understand correctly that running &quot;tunefs.lustre --writeconf&quot; can resolve the original &quot;index is already in use&quot; failure because &quot;LDD_F_WRITECONF&quot; flag is set?&lt;/p&gt;</comment>
                            <comment id="69866" author="mdiep" created="Fri, 25 Oct 2013 04:13:14 +0000"  >&lt;p&gt;Hi YuJian,&lt;/p&gt;

&lt;p&gt;no, tunefs.lustre --writeconf --mgs --mdt /dev/md0 did not solve the issue. In this case, this is a bug.&lt;br/&gt;
I have to use either (not both) --reformat --fsname=rhino or --reformat --fsname=rhino --mgs --mdt.&lt;/p&gt;
</comment>
                            <comment id="71181" author="haisong" created="Fri, 8 Nov 2013 23:02:34 +0000"  >&lt;p&gt;Minh,&lt;/p&gt;

&lt;p&gt;   The IP address is the one from previous server running 1.8.7.&lt;br/&gt;
   Our test has been &quot;data in place&quot; from existing servers of 1.8.7 to&lt;br/&gt;
2.4.* we going to keep IPs.&lt;br/&gt;
   Am I missing something here?&lt;/p&gt;

&lt;p&gt;   It&apos;s true that in our test case, MGC192.168.95.245@tcp was used by&lt;br/&gt;
another hardware.&lt;br/&gt;
   But we have move the cable from old to new. The old server has no&lt;br/&gt;
Lustre running.&lt;/p&gt;

&lt;p&gt;Haisong&lt;/p&gt;</comment>
                            <comment id="71550" author="mdiep" created="Thu, 14 Nov 2013 17:17:22 +0000"  >&lt;p&gt;I found that we needed to --writeconf and unmount all the OST while working on the MDS/MDT. So far it seems like user error. I will take this bug and verify further.&lt;/p&gt;</comment>
                            <comment id="71638" author="mdiep" created="Fri, 15 Nov 2013 17:16:16 +0000"  >&lt;p&gt;Procedure to restore and upgrade MDS and add another MDT&lt;/p&gt;

&lt;p&gt;On MDS0&lt;/p&gt;

&lt;p&gt;1. mkfs.lustre --reformat --fsname=rhino --param mdt.quota_type=ug --mgs --mdt --index=0 /dev/md0&lt;br/&gt;
2. mount -t ldiskfs /dev/md0 /mnt&lt;br/&gt;
3. cd /mnt&lt;br/&gt;
4. /usr/local/bin/tar xzvpf /share/apps/tmp/rhino_mdt.tgz --sparse&lt;br/&gt;
5. setfattr --restore=/share/apps/tmp/ea-20131023.bak&lt;br/&gt;
6. cd; umount /mnt&lt;br/&gt;
7. tunefs.lustre &#8212;erase-params /dev/md0&lt;br/&gt;
8. tunefs.lustre --writeconf &#8212;param=&#8220;failover.node=&amp;lt;MDS1 nid&amp;gt; --mgs --mdt /dev/md0&lt;br/&gt;
9. mount -t lustre &#8211;o writeconf /dev/md0 /rhino/&lt;br/&gt;
10. Umount /rhino&lt;br/&gt;
11. Mount &#8211;t ldiskfs /dev/md0 /mnt&lt;br/&gt;
12. Ls &#8211;l /mnt/CONFIG* (check the timestamp of the file to see if it&#8217;s current)&lt;br/&gt;
13. llog_reader /mnt/CONFIG*/rhino-MDT0000 ( the output should show about 7 to 9 lines with timestamp current)&lt;br/&gt;
14. Umount /mnt&lt;br/&gt;
15. Mount &#8211;t lustre /dev/md0 /rhino&lt;/p&gt;

&lt;p&gt;On MDS1&lt;/p&gt;

&lt;p&gt;1. mkfs.lustre --reformat --fsname rhino --param mdt.quota_type=ug --mgsnode &amp;lt;MDS0 nid&amp;gt; --failnode &amp;lt;MDS0 nid&amp;gt; --mdt --index 1 /dev/md1&lt;br/&gt;
2.mount &#8211;t lustre /dev/md1 /rhino&lt;br/&gt;
3. Lctl dl on both MDS0 and MDS1 to see if they have MDT0001&lt;/p&gt;

&lt;p&gt;On OSS (repeat on all OSS)&lt;/p&gt;

&lt;p&gt;1. Tunefs.lustre &#8212;-erase-params /dev/sdb (repeat for all devices) &lt;br/&gt;
2. Tunefs.lustre &#8212;writeconf &#8212;param=&quot;ost.quota_type=ug -&#8212;param=&quot;failover.mode=failout&#8221; &#8212;mgsnode=&amp;lt;MDS0 nid&amp;gt; &#8212;mgsnode=&amp;lt;MDS1 nid&amp;gt; &#8212;ost /dev/sdb (repeat on all devices)&lt;br/&gt;
3. Mount &#8211;t lustre &#8211;o writeconf /dev/sdb /rhino/sdb (repeat on all devices)&lt;/p&gt;

&lt;p&gt;On Clients&lt;/p&gt;

&lt;p&gt;Mount &#8211;t lustre &amp;lt;MDS0 nid&amp;gt;:&amp;lt;MDS1 nid&amp;gt;:/rhino /rhino&lt;/p&gt;



&lt;p&gt;NOTE: do not use --servicenode option due to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="72048" author="mdiep" created="Thu, 21 Nov 2013 16:19:20 +0000"  >&lt;p&gt;latest status: We were able to test the failover while traffics going to both MDTs.&lt;/p&gt;

&lt;p&gt;SDSC continue testing and will tear the test cluster down and start over again from 1.8.9 to make sure before the production upgrade&lt;/p&gt;</comment>
                            <comment id="79079" author="jfc" created="Wed, 12 Mar 2014 01:08:51 +0000"  >&lt;p&gt;Hello Minh and Haisong,&lt;br/&gt;
Any further progress on this issue?&lt;br/&gt;
Should we mark this as resolved?&lt;br/&gt;
Thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="79082" author="mdiep" created="Wed, 12 Mar 2014 01:19:06 +0000"  >&lt;p&gt;sure, you can close it for now. we&apos;ll open when we move to 2 MDT later this year&lt;/p&gt;</comment>
                            <comment id="114768" author="adilger" created="Fri, 8 May 2015 21:09:07 +0000"  >&lt;p&gt;Closing per last comments.&lt;/p&gt;

&lt;p&gt;Note that with newer Lustre it is possible to use &lt;tt&gt;mkfs.lustre --replace --index=0&lt;/tt&gt; to replace an existing target with one that has been restored from a file-level backup.  This has been tested with OST replacement, but it should also work with MDTs (it marks the target so that it doesn&apos;t try to register with the MGS as a new device).&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw6pj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>11229</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>