Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Attempting to use lctl replace_nids per Lustre manual to change NIDs for OSDs. The intention is to convert servers that were formatted without servicenode entries, so that the targets can be configured for failover.
The documentation is ambiguous, and my attempt to use this command fails, insofar as any attempt to use lctl conf_param after using lctl replace_nids fails. I have tried several experiments, without success. The following is an outline of the process followed, which covers several variations.
If the replace_nids command is not suitable for this exercise, then the documentation should clarify the use cases for which it is suitable.
A very simple test was also attempted, whereby the MDS NID was changed from 10.10.2.12@tcp0 to 10.10.2.14@tcp0. The result is the same (see last test case).
Note: failure here constitutes an inability to alter Lustre parameters (in the example, changing the quota settings fails). The file system does mount and can be used by a client.
I'd like to have the documentation clarified with the exact syntax and process, as well as use cases for the lctl replace_nids command, in case there is something I have missed.
Format MGS, MDT0000, OST0000, OST0001 as lidskfs, no failover:
mds1: mkfs.lustre --mgs /dev/sda mds2: mkfs.lustre --mdt --index 0 --mgsnode 10.10.2.11@tcp0 --fsname demo /dev/sdb oss1: mkfs.lustre --ost --index 0 --fsname demo --mgsnode 10.10.2.11@tcp0 /dev/sda oss2: mkfs.lustre --ost --index 1 --fsname demo --mgsnode 10.10.2.11@tcp0 /dev/sdb
Mount on client and confirm FS is operating correctly (create files, check stripes, check lfs df).
Use a simple check that parameters can be set persistently:
lctl conf_param demo.quota.ost=ug ... lctl conf_param demo.quota.ost=none
Umount client, MDT0000, OST0000, OST0001. MGS remains online.
Run tunefs.lustre on MDT0000, adding servicenodes:
tunefs.lustre --erase-params \ --servicenode 10.10.2.12@tcp0:10.10.2.11@tcp0 \ --mgsnode 10.10.2.11@tcp0 --mgsnode 10.10.2.12@tcp0 \ /dev/sdb
On MGS, run lctl replace_nids:
lctl replace_nids demo-MDT0000 10.10.2.12@tcp0:10.10.2.11@tcp0
Remount MDT00000
MGS syslog contains:
Dec 9 00:13:45 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
Remount OST0000, OST0001, client in sequence.
Verify FS is online, files still accessible on client.
Re-run a simple check that parameters can be set persistently:
lctl conf_param demo.quota.ost=ug
Returns:
error: conf_param: File exists
MGS syslog reports errors:
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(llog.c:336:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:1446:record_start_log()) MGS: can't start log demo-MDT0000.bak: rc = -17
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:3626:mgs_write_log_param()) err -17 on param 'quota.ost=none'
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_handler.c:993:mgs_iocontrol()) MGS: setparam err: rc = -17
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_handler.c:993:mgs_iocontrol()) Skipped 1 previous similar message
Umount client, MDT0000, OST0000, OST0001. MGS remains online.
Run tunefs.lustre on MDT0000, adding servicenodes:
tunefs.lustre --erase-params \ --servicenode 10.10.2.12@tcp0:10.10.2.11@tcp0 \ --mgsnode 10.10.2.11@tcp0 --mgsnode 10.10.2.12@tcp0 \ /dev/sdb
On MGS, run lctl replace_nids, using comma separator instead of colon, following Lustre manual explicitly:
lctl replace_nids demo-MDT0000 10.10.2.12@tcp0,10.10.2.11@tcp0
Remount MDT00000
MGS syslog contains:
Dec 9 00:33:00 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
Remount MDT0000, OST0000, OST0001, client
Re-run a simple check that parameters can be set persistently:
[root@rh7z-mds1 ~]# lctl conf_param demo.quota.ost=none error: conf_param: File exists
MGS syslog reports errors:
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(llog.c:336:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:1446:record_start_log()) MGS: can't start log demo-MDT0000.bak: rc = -17
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:3626:mgs_write_log_param()) err -17 on param 'quota.ost=none'
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_handler.c:993:mgs_iocontrol()) MGS: setparam err: rc = -17
Umount client, MDT0000, OST0000, OST0001. MGS remains online.
Run tunefs.lustre on MDT0000 with a single servicenode NID and a single mgsnode:
tunefs.lustre --erase-params --servicenode 10.10.2.12@tcp0 --mgsnode 10.10.2.11@tcp0 /dev/sdb
On MGS, run lctl replace_nids:
[root@rh7z-mds1 ~]# lctl replace_nids demo-MDT0000 10.10.2.12@tcp0
Remount MDT0000, OST0000, OST0001, client
MGS syslog contains:
Dec 9 00:38:31 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
Try to change quota setting again, MGS reports the same error.
Umount client, MDT0000, OST0000, OST0001. MGS remains online.
Run tunefs.lustre on MDT0000 with the equivalent of the original settings:
tunefs.lustre --erase-params --mgsnode 10.10.2.11@tcp0 /dev/sdb
Remount MDT0000, OST0000, OST0001, client
Re-run a simple check that parameters can be set persistently:
[root@rh7z-mds1 ~]# lctl conf_param demo.quota.ost=none error: conf_param: File exists
MGS syslog reports same error.
Umount all, reformat all targets to create new FS.
Mount MGT, MDT0000, OST0000, OST0001
Verify that client can mount the FS.
Run quota test as before.
Umount client, MDT0000, OST0000, OST0001
Remove kernel modules on MDT0000 host.
Change IPv4 address from 10.10.2.12 to 10.10.2.14, reload lnet module and verify that new NID is applied.
On MGT, run:
lctl replace_nids demo-MDT0000 10.10.2.14@tcp0
Remount MDT0000, OST0000, OST0001, client
Verify that FS is usable.
Re-run quota conf_param test. Fails as before.