[LU-3829] MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running Created: 23/Aug/13  Updated: 26/Jan/17  Resolved: 02/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Critical
Reporter: Kalpak Shah (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-8397 take comma as separator of mgsnode's ... Resolved
Severity: 2
Rank (Obsolete): 9893

 Description   

If multiple --mgsnode arguments are provided to mkfs.lustre while formatting an MDT, then the mount of this MDT fails on the MDS where the MGS is not running.

Reproduction Steps:
Step 1) On MDS0, run the following script:
mgs_dev='/dev/mapper/vg_v-mgs'
mds0_dev='/dev/mapper/vg_v-mdt'

mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'

mkfs.lustre --mgs --reformat $mgs_dev
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid --failnode=$mgs_sec_nid --reformat --fsname=v --mdt --index=0 $mds0_dev

mount -t lustre $mgs_dev /lustre/mgs/
mount -t lustre $mds0_dev /lustre/v/mdt

So the MGS and MDT0 will be mounted on MDS0.

Step 2.1) On MDS1:
mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'

mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'

mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1 $mdt1_dev # Does not mount.

mount -t lustre $mdt1_dev /lustre/v/mdt1

The mount of MDT1 will fail with the following error:
mount.lustre: mount /dev/mapper/vg_mdt1_v-mdt1 at /lustre/v/mdt1 failed: Input/output error
Is the MGS running?

These are messages from Lustre logs while trying to mount MDT1:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on. Opts:
Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1377197751/real 1377197751] req@ffff880027956c00 x1444089351391184/t0(0) o250->MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl 1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880027956800 x1444089351391188/t0(0) o253->MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15f-b: v-MDT0001: cannot register this server with the MGS: rc = -5. Is the MGS running?
LustreError: 8059:0:(obd_mount_server.c:1732:server_fill_super()) Unable to start targets: -5
LustreError: 8059:0:(obd_mount_server.c:848:lustre_disconnect_lwp()) v-MDT0000-lwp-MDT0001: Can't end config log v-client.
LustreError: 8059:0:(obd_mount_server.c:1426:server_put_super()) v-MDT0001: failed to disconnect lwp. (rc=-2)
LustreError: 8059:0:(obd_mount_server.c:1456:server_put_super()) no obd v-MDT0001
LustreError: 8059:0:(obd_mount_server.c:137:server_deregister_mount()) v-MDT0001 not registered
Lustre: server umount v-MDT0001 complete
LustreError: 8059:0:(obd_mount.c:1277:lustre_fill_super()) Unable to mount (-5)

Step 2.2) On MDS1:
mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'

mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'

mkfs.lustre --mgsnode=$mgs_pri_nid --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1 $mdt1_dev

mount -t lustre $mdt1_dev /lustre/v/mdt1

With this MDT1 will mount successfully. The only difference is that second "--mgsnode" is not provided during mkfs.lustre.

Step 3: On MDS1 again:
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=2 $mdt2_dev
mount -t lustre $mdt2_dev /lustre/v/mdt2

Once MDT1 is mounted, then using a second "--mgsnode" option works without any errors and mount of MDT2 succeeds.

Lustre Versions: Reproducible on 2.4.0 and 2.4.91 versions.

Conclusion: Due to this bug, MDTs do not mount on MDSs that are not running the MGS. With the workaround, HA will not be properly configured.
Also note that this issue is not related to DNE. Same issue and "workaround" applies to an MDT of a different filesystem on MDS1 as well.



 Comments   
Comment by Swapnil Pimpale (Inactive) [ 23/Aug/13 ]

In the above case, while mounting an MDT on MDS1 one of the mgsnode is MDS1 itself.

It looks like ptlrpc_uuid_to_peer() calculates the distance to NIDs using LNetDist() and chooses the one with the least distance. (which in this case turns out to be MDS1 itself which does not have a running MGS)

Removing MDS1 from mgsnode and adding a different node worked for me.

Comment by Peter Jones [ 28/Aug/13 ]

Bobijam

Could you please comment on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 29/Aug/13 ]

in Step 2.1

These are messages from Lustre logs while trying to mount MDT1:
...
Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1377197751/real 1377197751] req@ffff880027956c00 x1444089351391184/t0(0) o250->MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl 1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880027956800 x1444089351391188/t0(0) o253->MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15f-b: v-MDT0001: cannot register this server with the MGS: rc = -5. Is the MGS running?

so this is MDS1 (which has 10.10.11.211) trying to connect MGS service on MDS0 (10.10.11.210), is it?

Comment by Swapnil Pimpale (Inactive) [ 29/Aug/13 ]

Yes. But as you can see while trying to connect it has incorrectly calculated the peer to be 0@lo
If we just give a single mgsnode (10.10.11.210) we see the following message in the logs

(import.c:480:import_select_connection()) MGC10.10.11.210@tcp: connect to NID 10.10.11.210@tcp last attempt 0

But if two mgsnodes are given (MDS0 and MDS1 in this case) we see the following message

(import.c:480:import_select_connection()) MGC10.10.11.210@tcp: connect to NID 0@lo last attempt 0

Comment by Zhenyu Xu [ 29/Aug/13 ]

It's a complex configuration which IIRC manual does not mention (separate MGS and MDS while specifying multiple mgsnode for MDT device).

The problem of it is:

When you specify multiple mgsnode for MDT, it will add (remote_uuid, remote_nid) pairs, in this case MDS1 has (MGC10.10.11.210@tcp1, 10.10.11.210@tcp1) and (MGC10.10.11.210@tcp1, 10.10.11.211@tcp1).

When mgc tries to start upon this MDT, it tries to find correct peer nid (aka. remote nid) for the MGC10.10.11.210, and it has two choices, it definitely always choose 10.10.11.211@tcp1 as its peer nid (its source nid is always 10.10.11.211@tcp1 or 0@lo), since it thought remote nid 10.10.11.211@tcp1 has shortest distance (0), and try to connect MGS service on 10.10.11.211@tcp1, and it could not succeed.

When the import retries, it repeat above procedure and fails out eventually.

Comment by Ashley Pittman (Inactive) [ 29/Aug/13 ]

This is a normal configuration for running two filesystems on a active/active pair of MDS nodes, in addition running DNE with a single filesystem and two MDTs on a active/active pair of MDS nodes is affected as well. Ideally we'd locate the MGS on entirely separate nodes to allow IR however that requires additional hardware and is a relatively expensive solution so whilst doing that would avoid this problem it's not a universal solution.

Comment by Zhenyu Xu [ 30/Aug/13 ]

Would you please try this patch http://review.whamcloud.com/7509 ?

The idea is to change import connection setting, and it does not always set the closest connection as its import connection, it adds all possible connections as the import connection, giving import a chance to try other connections.

Comment by Ashley Pittman (Inactive) [ 30/Aug/13 ]

I've applied this patch to the 2.4.0 tag on b2_4 and can confirm that initial checks are that MDTs are starting correctly now.

We are now seeing a different issue, I am unable to start a client on a node which is already running as a MDT/OSS:
[root@victrix-oss0 ~]# modprobe lustre
WARNING: Error inserting lov (/lib/modules/2.6.32-358.14.1.el6_lustre.es61.x86_64/updates/kernel/fs/lustre/lov.ko): Operation already in progress
FATAL: Error inserting lustre (/lib/modules/2.6.32-358.14.1.el6_lustre.es61.x86_64/updates/kernel/fs/lustre/lustre.ko): Operation already in progress

The log file is here:
Aug 30 06:21:46 victrix-oss0 kernel: : LustreError: 3339:0:(lprocfs_status.c:497:lprocfs_register()) entry 'osc' already registered
Aug 30 06:21:46 victrix-oss0 kernel: : LustreError: 165-2: Nothing registered for client mount! Is the 'lustre' module loaded?
Aug 30 06:21:46 victrix-oss0 kernel: : LustreError: 3341:0:(obd_mount.c:1297:lustre_fill_super()) Unable to mount (-19)

This doesn't look directly related to the changeset however we weren't seeing it before and this changeset should be the only change to the build. I'll keep investigating this second problem, let me know if you want me to file it as a separate issue.

Comment by Kalpak Shah (Inactive) [ 30/Aug/13 ]

The failure to load lov module definitely looks like an unrelated error.

Comment by Ashley Pittman (Inactive) [ 04/Sep/13 ]

We found a issue with our build system, whilst I was picking up this patch I also picked up some other changes which is where the LOV issue came from.

This patch fixes the issue for us in all cases we've found, would it be possible to have it in 2.4.1?

Comment by Zhenyu Xu [ 05/Sep/13 ]

patch tracking for b2_4 (http://review.whamcloud.com/7553)

Comment by Kalpak Shah (Inactive) [ 19/Sep/13 ]

This is with 2.4.1 + patchsets from http://review.whamcloud.com/7553.

With patchset 2, we are seeing an issue whereby remote directory operations across MDSes fail. health_check on these MDS reports something like:
[root@gemina-mds0 ~]# cat /proc/fs/lustre/health_check
device v-MDT0003-osp-MDT0000 reported unhealthy
device v-MDT0002-osp-MDT0001 reported unhealthy
device v-MDT0002-osp-MDT0000 reported unhealthy
NOT HEALTHY

And it remains in this unhealthy state. Clients are unable to perform remote directory operations at all.

With patchset 1, we do not see this issue.

Comment by James Beal [ 24/Sep/13 ]

Another data point, running on a pure lustre 2.4.1 system with a file system that we have upgraded from lustre 1.8. We have the MGS and MDT on separate nodes.

In summary at least for this filesystem the MDT has to be mounted at least once collocated with the MGS after a writeconf ( including the upgrade from lustre 1.8 to lustre 2.4.1).

Initially the MDT was failing to mount with the following messages in the kernel log.

Sep 24 08:31:56 lus03-mds2 kernel: Lustre: Lustre: Build Version: v2_4_1_0--CHANGED-2.6.32-jb23-358.18.1.el6-lustre-2.4.1
Sep 24 08:31:56 lus03-mds2 kernel: LNet: Added LNI 172.17.128.134@tcp [8/256/0/180]
Sep 24 08:31:56 lus03-mds2 kernel: LNet: Accept secure, port 988
Sep 24 08:33:00 lus03-mds2 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
Sep 24 08:33:11 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8804a59b0800 x1447043180527624/t0(0) o253->MGC172.17.128.135@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 08:33:11 lus03-mds2 kernel: LustreError: 14026:0:(obd_mount_server.c:1124:server_register_target()) lus03-MDT0000: error registering with the MGS: rc = -5 (not fatal)
Sep 24 08:33:13 lus03-mds2 kernel: LustreError: 137-5: lus03-MDT0000_UUID: not available for connect from 172.17.97.43@tcp (no target)
Sep 24 08:33:18 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880462faec00 x1447043180527628/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 08:33:23 lus03-mds2 kernel: LustreError: 137-5: lus03-MDT0000_UUID: not available for connect from 172.17.97.209@tcp (no target)
Sep 24 08:33:24 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88049c84dc00 x1447043180527632/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 14113:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: lus03-MDT0000: used disk, loading
Sep 24 08:33:24 lus03-mds2 kernel: LustreError: 14113:0:(sec_config.c:1115:sptlrpc_target_local_read_conf()) missing llog context
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 14113:0:(mdt_handler.c:4948:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1380008004/real 1380008004] req@ffff880498bc7000 x1447043180527640/t0(0) o8->lus03-OST0001-osc@172.17.128.130@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008109 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 24 08:33:24 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1380008004/real 1380008004] req@ffff88044f451000 x1447043180527648/t0(0) o8->lus03-OST0003-osc@172.17.128.131@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008109 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 24 08:33:25 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1380008005/real 1380008005] req@ffff88047389e000 x1447043180527668/t0(0) o8->lus03-OST0008-osc@172.17.128.130@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008110 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 24 08:33:25 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Sep 24 08:33:26 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1380008006/real 1380008006] req@ffff88047389e000 x1447043180527704/t0(0) o8->lus03-OST0011-osc@172.17.128.132@tcp:28/4 lens 400/544 e 0 to 1 dl 1380008111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 24 08:33:26 lus03-mds2 kernel: Lustre: 13962:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Sep 24 08:33:33 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880498810400 x1447043180527756/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 14026:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880498810400 x1447043180527760/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 13a-8: Failed to get MGS log lus03-client and no local copy.
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 15c-8: MGC172.17.128.135@tcp: The configuration from log 'lus03-client' failed (-107). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 14026:0:(obd_mount_server.c:1275:server_start_targets()) lus03-MDT0000: failed to start LWP: -107
Sep 24 08:33:38 lus03-mds2 kernel: LustreError: 14026:0:(obd_mount_server.c:1700:server_fill_super()) Unable to start targets: -107
Sep 24 08:33:38 lus03-mds2 kernel: Lustre: Failing over lus03-MDT0000

I chatted to Sven at DDN and in the first instance we tried to fix the " rename obd type from mds to mdt" by a tunefs.lustre --writeconf setting "mdd.quota_type=ug", this did not fix the issue ( however it is worth noting that I didn't use erase-config, so that I ended up repeating the writeconf later ), We then looked at LU-3829 and as a test we mounted the MDT on the MGS node and that succeeded. Out of curiosity I unmounted the MDT from the MGS node and tried to mount it on it's native MDT node, this succeeded.

Sep 24 09:51:48 lus03-mds2 kernel: Lustre: Lustre: Build Version: v2_4_1_0--CHANGED-2.6.32-jb23-358.18.1.el6-lustre-2.4.1
Sep 24 09:51:48 lus03-mds2 kernel: LNet: Added LNI 172.17.128.134@tcp [8/256/0/180]
Sep 24 09:51:48 lus03-mds2 kernel: LNet: Accept secure, port 988
Sep 24 09:51:48 lus03-mds2 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
Sep 24 09:52:00 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8804b67e9c00 x1447048205303816/t0(0) o253->MGC172.17.128.135@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 09:52:00 lus03-mds2 kernel: LustreError: 21871:0:(obd_mount_server.c:1124:server_register_target()) lus03-MDT0000: error registering with the MGS: rc = -5 (not fatal)
Sep 24 09:52:06 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88046f6e2c00 x1447048205303820/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 09:52:12 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88046f6e2c00 x1447048205303824/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21992:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: lus03-MDT0000: used disk, loading
Sep 24 09:52:12 lus03-mds2 kernel: LustreError: 21992:0:(sec_config.c:1115:sptlrpc_target_local_read_conf()) missing llog context
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21992:0:(mdt_handler.c:4948:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21906:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1380012732/real 1380012732] req@ffff88049dba5400 x1447048205303832/t0(0) o8->lus03-OST0001-osc@172.17.128.130@tcp:28/4 lens 400/544 e 0 to 1 dl 1380012837 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 24 09:52:12 lus03-mds2 kernel: Lustre: 21906:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1380012732/real 1380012732] req@ffff8804a56a7400 x1447048205303840/t0(0) o8->lus03-OST0003-osc@172.17.128.131@tcp:28/4 lens 400/544 e 0 to 1 dl 1380012837 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 24 09:52:18 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880486069400 x1447048205303948/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 09:52:24 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880486069400 x1447048205303952/t0(0) o101->MGC172.17.128.135@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Sep 24 09:52:24 lus03-mds2 kernel: LustreError: 11-0: lus03-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Sep 24 09:52:35 lus03-mds2 kernel: LustreError: 21871:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880486069400 x1447048205303960/t0(0) o253->MGC172.17.128.135@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1

We noted that the writeconf had append the params so I repeated the process with erase-conf and tried to remount the MDT on its native node which failed, I remounted the MDT collocated with the MGS and it succeeded, after this I unmounted the MDT and it mounted successfully on the MDTs native node.

Comment by Zhenyu Xu [ 26/Sep/13 ]

Hi Kalpak,

Can you collect the lustre debug log of the remote directory operation failure? thanks.

Comment by Peter Jones [ 02/Oct/13 ]

Landed for 2.5.0. Should also land to b2_4 for 2.4.2

Comment by Kurt J. Strosahl (Inactive) [ 03/Mar/16 ]

I'm experiencing the same bug in 2.5.3 when trying to attach an ost while the mgs is on its failover node.

Mar 3 08:50:45 scoss1501 kernel: Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64
Mar 3 08:50:51 scoss1501 kernel: Lustre: 7401:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1457013046/real 1457013046] req@ffff88202ba58c00 x1527788910673924/t0(0) o250->MGC<head A>@o2ib@<head A>@o2ib:26/25 lens 400/544 e 0 to 1 dl 1457013051 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88202ba58800 x1527788910673928/t0(0) o253->MGC<head A>@o2ib@<head A>@o2ib:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Mar 3 08:50:57 scoss1501 kernel: LustreError: 15f-b: lustre2-OST0042: cannot register this server with the MGS: rc = -5. Is the MGS running?
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:1723:server_fill_super()) Unable to start targets: -5
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:845:lustre_disconnect_lwp()) lustre2-MDT0000-lwp-OST0042: Can't end config log lustre2-client.
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:1420:server_put_super()) lustre2-OST0042: failed to disconnect lwp. (rc=-2)
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:1450:server_put_super()) no obd lustre2-OST0042
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount_server.c:135:server_deregister_mount()) lustre2-OST0042 not registered
Mar 3 08:50:57 scoss1501 kernel: Lustre: server umount lustre2-OST0042 complete
Mar 3 08:50:57 scoss1501 kernel: LustreError: 7385:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount (-5)

Generated at Sat Feb 10 01:37:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.