<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:37:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3829] MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running</title>
                <link>https://jira.whamcloud.com/browse/LU-3829</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;If multiple --mgsnode arguments are provided to mkfs.lustre while formatting an MDT, then the mount of this MDT fails on the MDS where the MGS is not running. &lt;/p&gt;

&lt;p&gt;Reproduction Steps:&lt;br/&gt;
Step 1) On MDS0, run the following  script:&lt;br/&gt;
mgs_dev=&apos;/dev/mapper/vg_v-mgs&apos;&lt;br/&gt;
mds0_dev=&apos;/dev/mapper/vg_v-mdt&apos;&lt;/p&gt;

&lt;p&gt;mgs_pri_nid=&apos;10.10.11.210@tcp1&apos;&lt;br/&gt;
mgs_sec_nid=&apos;10.10.11.211@tcp1&apos;&lt;/p&gt;

&lt;p&gt;mkfs.lustre --mgs --reformat $mgs_dev&lt;br/&gt;
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid --failnode=$mgs_sec_nid --reformat --fsname=v --mdt --index=0 $mds0_dev&lt;/p&gt;

&lt;p&gt;mount -t lustre $mgs_dev /lustre/mgs/&lt;br/&gt;
mount -t lustre $mds0_dev  /lustre/v/mdt&lt;/p&gt;

&lt;p&gt;So the MGS and MDT0 will be mounted on MDS0.&lt;/p&gt;

&lt;p&gt;Step 2.1) On MDS1:&lt;br/&gt;
mdt1_dev=&apos;/dev/mapper/vg_mdt1_v-mdt1&apos;&lt;br/&gt;
mdt2_dev=&apos;/dev/mapper/vg_mdt2_v-mdt2&apos;&lt;/p&gt;

&lt;p&gt;mgs_pri_nid=&apos;10.10.11.210@tcp1&apos;&lt;br/&gt;
mgs_sec_nid=&apos;10.10.11.211@tcp1&apos;&lt;/p&gt;

&lt;p&gt;mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid  --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1 $mdt1_dev # Does not mount.&lt;/p&gt;

&lt;p&gt;mount -t lustre $mdt1_dev /lustre/v/mdt1&lt;/p&gt;

&lt;p&gt;The mount of MDT1 will fail with the following error:&lt;br/&gt;
mount.lustre: mount /dev/mapper/vg_mdt1_v-mdt1 at /lustre/v/mdt1 failed: Input/output error&lt;br/&gt;
Is the MGS running?&lt;/p&gt;

&lt;p&gt;These are messages from Lustre logs while trying to mount MDT1:&lt;br/&gt;
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1377197751/real 1377197751&amp;#93;&lt;/span&gt;  req@ffff880027956c00 x1444089351391184/t0(0) o250-&amp;gt;MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl 1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880027956800 x1444089351391188/t0(0) o253-&amp;gt;MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
LustreError: 15f-b: v-MDT0001: cannot register this server with the MGS: rc = -5. Is the MGS running?&lt;br/&gt;
LustreError: 8059:0:(obd_mount_server.c:1732:server_fill_super()) Unable to start targets: -5&lt;br/&gt;
LustreError: 8059:0:(obd_mount_server.c:848:lustre_disconnect_lwp()) v-MDT0000-lwp-MDT0001: Can&apos;t end config log v-client.&lt;br/&gt;
LustreError: 8059:0:(obd_mount_server.c:1426:server_put_super()) v-MDT0001: failed to disconnect lwp. (rc=-2)&lt;br/&gt;
LustreError: 8059:0:(obd_mount_server.c:1456:server_put_super()) no obd v-MDT0001&lt;br/&gt;
LustreError: 8059:0:(obd_mount_server.c:137:server_deregister_mount()) v-MDT0001 not registered&lt;br/&gt;
Lustre: server umount v-MDT0001 complete&lt;br/&gt;
LustreError: 8059:0:(obd_mount.c:1277:lustre_fill_super()) Unable to mount  (-5) &lt;/p&gt;

&lt;p&gt;Step 2.2) On MDS1:&lt;br/&gt;
mdt1_dev=&apos;/dev/mapper/vg_mdt1_v-mdt1&apos;&lt;br/&gt;
mdt2_dev=&apos;/dev/mapper/vg_mdt2_v-mdt2&apos;&lt;/p&gt;

&lt;p&gt;mgs_pri_nid=&apos;10.10.11.210@tcp1&apos;&lt;br/&gt;
mgs_sec_nid=&apos;10.10.11.211@tcp1&apos;&lt;/p&gt;

&lt;p&gt;mkfs.lustre --mgsnode=$mgs_pri_nid  --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1 $mdt1_dev&lt;/p&gt;

&lt;p&gt;mount -t lustre $mdt1_dev /lustre/v/mdt1&lt;/p&gt;


&lt;p&gt;With this MDT1 will mount successfully. The only difference is that second &quot;--mgsnode&quot; is not provided during mkfs.lustre.&lt;/p&gt;

&lt;p&gt;Step 3: On MDS1 again:&lt;br/&gt;
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid  --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=2 $mdt2_dev&lt;br/&gt;
mount -t lustre $mdt2_dev /lustre/v/mdt2&lt;/p&gt;

&lt;p&gt;Once MDT1 is mounted, then using a second &quot;--mgsnode&quot; option works without any errors and mount of MDT2 succeeds.&lt;/p&gt;

&lt;p&gt;Lustre Versions: Reproducible on 2.4.0 and 2.4.91 versions.&lt;/p&gt;

&lt;p&gt;Conclusion: Due to this bug, MDTs do not mount on MDSs that are not running the MGS. With the workaround, HA will not be properly configured.&lt;br/&gt;
Also note that this issue is not related to DNE. Same issue and &quot;workaround&quot; applies to an MDT of a different filesystem on MDS1 as well.&lt;/p&gt;</description>
                <environment></environment>
        <key id="20587">LU-3829</key>
            <summary>MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="kalpak">Kalpak Shah</reporter>
                        <labels>
                    </labels>
                <created>Fri, 23 Aug 2013 07:06:57 +0000</created>
                <updated>Thu, 26 Jan 2017 04:24:11 +0000</updated>
                            <resolved>Wed, 2 Oct 2013 18:19:09 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                    <version>Lustre 2.5.0</version>
                                    <fixVersion>Lustre 2.5.0</fixVersion>
                    <fixVersion>Lustre 2.4.2</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>17</watches>
                                                                            <comments>
                            <comment id="64954" author="spimpale" created="Fri, 23 Aug 2013 14:12:51 +0000"  >&lt;p&gt;In the above case, while mounting an MDT on MDS1 one of the mgsnode is MDS1 itself.&lt;/p&gt;

&lt;p&gt;It looks like ptlrpc_uuid_to_peer() calculates the distance to NIDs using LNetDist() and chooses the one with the least distance. (which in this case turns out to be MDS1 itself which does not have a running MGS)&lt;/p&gt;

&lt;p&gt;Removing MDS1 from mgsnode and adding a different node worked for me.&lt;/p&gt;</comment>
                            <comment id="65322" author="pjones" created="Wed, 28 Aug 2013 22:22:55 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please comment on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="65335" author="bobijam" created="Thu, 29 Aug 2013 00:31:07 +0000"  >&lt;p&gt;in Step 2.1&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;These are messages from Lustre logs while trying to mount MDT1:&lt;br/&gt;
...&lt;br/&gt;
Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1377197751/real 1377197751&amp;#93;&lt;/span&gt; req@ffff880027956c00 x1444089351391184/t0(0) o250-&amp;gt;MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl 1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880027956800 x1444089351391188/t0(0) o253-&amp;gt;MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
LustreError: 15f-b: v-MDT0001: cannot register this server with the MGS: rc = -5. Is the MGS running?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;so this is MDS1 (which has 10.10.11.211) trying to connect MGS service on MDS0 (10.10.11.210), is it?&lt;/p&gt;</comment>
                            <comment id="65339" author="spimpale" created="Thu, 29 Aug 2013 06:18:44 +0000"  >&lt;p&gt;Yes. But as you can see while trying to connect it has incorrectly calculated the peer to be 0@lo&lt;br/&gt;
If we just give a single mgsnode (10.10.11.210) we see the following message in the logs&lt;/p&gt;

&lt;p&gt;(import.c:480:import_select_connection()) MGC10.10.11.210@tcp: connect to NID 10.10.11.210@tcp last attempt 0&lt;/p&gt;

&lt;p&gt;But if two mgsnodes are given (MDS0 and MDS1 in this case) we see the following message&lt;/p&gt;

&lt;p&gt;(import.c:480:import_select_connection()) MGC10.10.11.210@tcp: connect to NID 0@lo last attempt 0&lt;/p&gt;</comment>
                            <comment id="65352" author="bobijam" created="Thu, 29 Aug 2013 11:03:21 +0000"  >&lt;p&gt;It&apos;s a complex configuration which IIRC manual does not mention (separate MGS and MDS while specifying multiple mgsnode for MDT device).&lt;/p&gt;

&lt;p&gt;The problem of it is:&lt;/p&gt;

&lt;p&gt;When you specify multiple mgsnode for MDT, it will add (remote_uuid, remote_nid) pairs, in this case MDS1 has (MGC10.10.11.210@tcp1, 10.10.11.210@tcp1) and (MGC10.10.11.210@tcp1, 10.10.11.211@tcp1).&lt;/p&gt;

&lt;p&gt;When mgc tries to start upon this MDT, it tries to find correct peer nid (aka. remote nid) for the MGC10.10.11.210, and it has two choices, it definitely always choose 10.10.11.211@tcp1 as its peer nid (its source nid is always 10.10.11.211@tcp1 or 0@lo), since it thought remote nid 10.10.11.211@tcp1 has shortest distance (0), and try to connect MGS service on 10.10.11.211@tcp1, and it could not succeed. &lt;/p&gt;

&lt;p&gt;When the import retries, it repeat above procedure and fails out eventually.&lt;/p&gt;</comment>
                            <comment id="65353" author="apittman" created="Thu, 29 Aug 2013 12:21:49 +0000"  >&lt;p&gt;This is a normal configuration for running two filesystems on a active/active pair of MDS nodes, in addition running DNE with a single filesystem and two MDTs on a active/active pair of MDS nodes is affected as well.  Ideally we&apos;d locate the MGS on entirely separate nodes to allow IR however that requires additional hardware and is a relatively expensive solution so whilst doing that would avoid this problem it&apos;s not a universal solution.&lt;/p&gt;</comment>
                            <comment id="65427" author="bobijam" created="Fri, 30 Aug 2013 11:04:58 +0000"  >&lt;p&gt;Would you please try this patch &lt;a href=&quot;http://review.whamcloud.com/7509&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7509&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;The idea is to change import connection setting, and it does not always set the closest connection as its import connection, it adds all possible connections as the import connection, giving import a chance to try other connections.&lt;/p&gt;</comment>
                            <comment id="65435" author="apittman" created="Fri, 30 Aug 2013 13:40:43 +0000"  >&lt;p&gt;I&apos;ve applied this patch to the 2.4.0 tag on b2_4 and can confirm that initial checks are that MDTs are starting correctly now.&lt;/p&gt;

&lt;p&gt;We are now seeing a different issue, I am unable to start a client on a node which is already running as a MDT/OSS:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@victrix-oss0 ~&amp;#93;&lt;/span&gt;# modprobe lustre&lt;br/&gt;
WARNING: Error inserting lov (/lib/modules/2.6.32-358.14.1.el6_lustre.es61.x86_64/updates/kernel/fs/lustre/lov.ko): Operation already in progress&lt;br/&gt;
FATAL: Error inserting lustre (/lib/modules/2.6.32-358.14.1.el6_lustre.es61.x86_64/updates/kernel/fs/lustre/lustre.ko): Operation already in progress&lt;/p&gt;

&lt;p&gt;The log file is here:&lt;br/&gt;
Aug 30 06:21:46 victrix-oss0 kernel: : LustreError: 3339:0:(lprocfs_status.c:497:lprocfs_register()) entry &apos;osc&apos; already registered&lt;br/&gt;
Aug 30 06:21:46 victrix-oss0 kernel: : LustreError: 165-2: Nothing registered for client mount! Is the &apos;lustre&apos; module loaded?&lt;br/&gt;
Aug 30 06:21:46 victrix-oss0 kernel: : LustreError: 3341:0:(obd_mount.c:1297:lustre_fill_super()) Unable to mount  (-19)&lt;/p&gt;

&lt;p&gt;This doesn&apos;t look directly related to the changeset however we weren&apos;t seeing it before and this changeset should be the only change to the build.  I&apos;ll keep investigating this second problem, let me know if you want me to file it as a separate issue.&lt;/p&gt;</comment>
                            <comment id="65442" author="kalpak" created="Fri, 30 Aug 2013 14:54:56 +0000"  >&lt;p&gt;The failure to load lov module definitely looks like an unrelated error. &lt;/p&gt;</comment>
                            <comment id="65768" author="apittman" created="Wed, 4 Sep 2013 19:24:04 +0000"  >&lt;p&gt;We found a issue with our build system, whilst I was picking up this patch I also picked up some other changes which is where the LOV issue came from.&lt;/p&gt;

&lt;p&gt;This patch fixes the issue for us in all cases we&apos;ve found, would it be possible to have it in 2.4.1?&lt;/p&gt;</comment>
                            <comment id="65788" author="bobijam" created="Thu, 5 Sep 2013 02:35:37 +0000"  >&lt;p&gt;patch tracking for b2_4 (&lt;a href=&quot;http://review.whamcloud.com/7553&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7553&lt;/a&gt;)&lt;/p&gt;</comment>
                            <comment id="67039" author="kalpak" created="Thu, 19 Sep 2013 17:21:54 +0000"  >&lt;p&gt;This is with 2.4.1 + patchsets from &lt;a href=&quot;http://review.whamcloud.com/7553&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7553&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With patchset 2, we are seeing an issue whereby remote directory operations across MDSes fail. health_check on these MDS reports something like:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@gemina-mds0 ~&amp;#93;&lt;/span&gt;# cat /proc/fs/lustre/health_check &lt;br/&gt;
device v-MDT0003-osp-MDT0000 reported unhealthy&lt;br/&gt;
device v-MDT0002-osp-MDT0001 reported unhealthy&lt;br/&gt;
device v-MDT0002-osp-MDT0000 reported unhealthy&lt;br/&gt;
NOT HEALTHY&lt;/p&gt;

&lt;p&gt;And it remains in this unhealthy state. Clients are unable to perform remote directory operations at all.&lt;/p&gt;

&lt;p&gt;With patchset 1, we do not see this issue. &lt;/p&gt;</comment>
                            <comment id="67335" author="james beal" created="Tue, 24 Sep 2013 11:38:43 +0000"  >&lt;p&gt;Another data point, running on a pure lustre 2.4.1 system with a file system that we have upgraded from lustre 1.8. We have the MGS and MDT on separate nodes.&lt;/p&gt;

&lt;p&gt;In summary at least for this filesystem the MDT has to be mounted at least once collocated with the MGS after a writeconf ( including the upgrade from lustre 1.8 to lustre 2.4.1).&lt;/p&gt;

&lt;p&gt;Initially the MDT was failing to mount with the following messages in the kernel log.&lt;/p&gt;

&lt;p&gt;Sep 24 08:31:56 lus03-mds2 kernel: Lustre: Lustre: Build Version: v2_4_1_0--CHANGED-2.6.32-jb23-358.18.1.el6-lustre-2.4.1&lt;br/&gt;
Sep 24 08:31:56 lus03-mds2 kernel: LNet: Added LNI 172.17.128.134@tcp &lt;span class=&quot;error&quot;&gt;&amp;#91;8/256/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Sep 24 08:31:56 lus03-mds2 kernel: LNet: Accept secure, port 988&lt;br/&gt;
Sep 24 08:33:00 lus03-mds2 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
Sep 24 08:33:11 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804a59b0800 x1447043180527624/t0(0) o253-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:11 lus03-mds2 kernel: LustreError: 14026:0:(obd_mount_server.c:1124:server_register_target()) lus03-MDT0000: error registering with the MGS: rc = -5 (not fatal)&lt;br/&gt;
Sep 24 08:33:13 lus03-mds2 kernel: LustreError: 137-5: lus03-MDT0000_UUID: not available for connect from 172.17.97.43@tcp (no target)&lt;br/&gt;
Sep 24 08:33:18 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880462faec00 x1447043180527628/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:23 lus03-mds2 kernel: LustreError: 137-5: lus03-MDT0000_UUID: not available for connect from 172.17.97.209@tcp (no target)&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88049c84dc00 x1447043180527632/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 14113:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: lus03-MDT0000: used disk, loading&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: LustreError: 14113:0:(sec_config.c:1115:sptlrpc_target_local_read_conf()) missing llog context&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 14113:0:(mdt_handler.c:4948:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1380008004/real 1380008004&amp;#93;&lt;/span&gt;  req@ffff880498bc7000 x1447043180527640/t0(0) o8-&amp;gt;lus03-OST0001-osc@172.17.128.130@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008109 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1380008004/real 1380008004&amp;#93;&lt;/span&gt;  req@ffff88044f451000 x1447043180527648/t0(0) o8-&amp;gt;lus03-OST0003-osc@172.17.128.131@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008109 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:25 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1380008005/real 1380008005&amp;#93;&lt;/span&gt;  req@ffff88047389e000 x1447043180527668/t0(0) o8-&amp;gt;lus03-OST0008-osc@172.17.128.130@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008110 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:25 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 4 previous similar messages&lt;br/&gt;
Sep 24 08:33:26 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1380008006/real 1380008006&amp;#93;&lt;/span&gt;  req@ffff88047389e000 x1447043180527704/t0(0) o8-&amp;gt;lus03-OST0011-osc@172.17.128.132@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:26 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages&lt;br/&gt;
Sep 24 08:33:33 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880498810400 x1447043180527756/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880498810400 x1447043180527760/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 13a-8: Failed to get MGS log lus03-client and no local copy.&lt;br/&gt;
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 15c-8: MGC172.17.128.135@tcp: The configuration from log &apos;lus03-client&apos; failed (-107). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.&lt;br/&gt;
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 14026:0:(obd_mount_server.c:1275:server_start_targets()) lus03-MDT0000: failed to start LWP: -107&lt;br/&gt;
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 14026:0:(obd_mount_server.c:1700:server_fill_super()) Unable to start targets: -107&lt;br/&gt;
Sep 24 08:33:38 lus03-mds2 kernel: Lustre: Failing over lus03-MDT0000&lt;/p&gt;

&lt;p&gt;I chatted to Sven at DDN and in the first instance we tried to fix the &quot; rename obd type from mds to mdt&quot; by a tunefs.lustre --writeconf  setting &quot;mdd.quota_type=ug&quot;, this did not fix the issue ( however it is worth noting that I didn&apos;t use erase-config, so that I ended up repeating the writeconf later ), We then looked at &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3829&quot; title=&quot;MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3829&quot;&gt;&lt;del&gt;LU-3829&lt;/del&gt;&lt;/a&gt; and as a test we mounted the MDT on the MGS node and that succeeded. Out of curiosity I unmounted the MDT from the MGS node and tried to mount it on it&apos;s native MDT node, this succeeded.&lt;/p&gt;

&lt;p&gt;Sep 24 09:51:48 lus03-mds2 kernel: Lustre: Lustre: Build Version: v2_4_1_0--CHANGED-2.6.32-jb23-358.18.1.el6-lustre-2.4.1&lt;br/&gt;
Sep 24 09:51:48 lus03-mds2 kernel: LNet: Added LNI 172.17.128.134@tcp &lt;span class=&quot;error&quot;&gt;&amp;#91;8/256/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Sep 24 09:51:48 lus03-mds2 kernel: LNet: Accept secure, port 988&lt;br/&gt;
Sep 24 09:51:48 lus03-mds2 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
Sep 24 09:52:00 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804b67e9c00 x1447048205303816/t0(0) o253-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:00 lus03-mds2 kernel: LustreError: 21871:0:(obd_mount_server.c:1124:server_register_target()) lus03-MDT0000: error registering with the MGS: rc = -5 (not fatal)&lt;br/&gt;
Sep 24 09:52:06 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88046f6e2c00 x1447048205303820/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88046f6e2c00 x1447048205303824/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21992:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: lus03-MDT0000: used disk, loading&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: LustreError: 21992:0:(sec_config.c:1115:sptlrpc_target_local_read_conf()) missing llog context&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21992:0:(mdt_handler.c:4948:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21906:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1380012732/real 1380012732&amp;#93;&lt;/span&gt;  req@ffff88049dba5400 x1447048205303832/t0(0) o8-&amp;gt;lus03-OST0001-osc@172.17.128.130@tcp:28/4 lens 400/544 e 0 to 1 dl 1380012837 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21906:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1380012732/real 1380012732&amp;#93;&lt;/span&gt;  req@ffff8804a56a7400 x1447048205303840/t0(0) o8-&amp;gt;lus03-OST0003-osc@172.17.128.131@tcp:28/4 lens 400/544 e 0 to 1 dl 1380012837 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:18 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880486069400 x1447048205303948/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:24 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880486069400 x1447048205303952/t0(0) o101-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Sep 24 09:52:24 lus03-mds2 kernel: LustreError: 11-0: lus03-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.&lt;br/&gt;
Sep 24 09:52:35 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880486069400 x1447048205303960/t0(0) o253-&amp;gt;MGC172.17.128.135@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;/p&gt;

&lt;p&gt;We noted that the writeconf had append the params so I repeated the process with erase-conf and tried to remount the MDT on its native node which failed, I remounted the MDT collocated with the MGS and it succeeded, after this I unmounted the MDT and it mounted successfully on the MDTs native node.&lt;/p&gt;</comment>
                            <comment id="67637" author="bobijam" created="Thu, 26 Sep 2013 01:17:33 +0000"  >&lt;p&gt;Hi Kalpak,&lt;/p&gt;

&lt;p&gt;Can you collect the lustre debug log of the remote directory operation failure? thanks.&lt;/p&gt;</comment>
                            <comment id="68180" author="pjones" created="Wed, 2 Oct 2013 18:19:09 +0000"  >&lt;p&gt;Landed for 2.5.0. Should also land to b2_4 for 2.4.2&lt;/p&gt;</comment>
                            <comment id="144491" author="kjstrosahl" created="Thu, 3 Mar 2016 15:07:30 +0000"  >&lt;p&gt;I&apos;m experiencing the same bug in 2.5.3 when trying to attach an ost while the mgs is on its failover node.&lt;/p&gt;

&lt;p&gt;Mar  3 08:50:45 scoss1501 kernel: Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64&lt;br/&gt;
Mar  3 08:50:51 scoss1501 kernel: Lustre: 7401:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1457013046/real 1457013046&amp;#93;&lt;/span&gt;  req@ffff88202ba58c00 x1527788910673924/t0(0) o250-&amp;gt;MGC&amp;lt;head A&amp;gt;@o2ib@&amp;lt;head A&amp;gt;@o2ib:26/25 lens 400/544 e 0 to 1 dl 1457013051 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88202ba58800 x1527788910673928/t0(0) o253-&amp;gt;MGC&amp;lt;head A&amp;gt;@o2ib@&amp;lt;head A&amp;gt;@o2ib:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 15f-b: lustre2-OST0042: cannot register this server with the MGS: rc = -5. Is the MGS running?&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:1723:server_fill_super()) Unable to start targets: -5&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:845:lustre_disconnect_lwp()) lustre2-MDT0000-lwp-OST0042: Can&apos;t end config log lustre2-client.&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:1420:server_put_super()) lustre2-OST0042: failed to disconnect lwp. (rc=-2)&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:1450:server_put_super()) no obd lustre2-OST0042&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:135:server_deregister_mount()) lustre2-OST0042 not registered&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: Lustre: server umount lustre2-OST0042 complete&lt;br/&gt;
Mar  3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount  (-5)&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="38169">LU-8397</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvz2v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9893</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>