[LU-8948] Cannot change conf_param settings after changing the NID of a Lustre OSD using lctl replace_nids Created: 09/Dec/16 Updated: 16/Jul/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Malcolm Cowe (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Attempting to use lctl replace_nids per Lustre manual to change NIDs for OSDs. The intention is to convert servers that were formatted without servicenode entries, so that the targets can be configured for failover. The documentation is ambiguous, and my attempt to use this command fails, insofar as any attempt to use lctl conf_param after using lctl replace_nids fails. I have tried several experiments, without success. The following is an outline of the process followed, which covers several variations. If the replace_nids command is not suitable for this exercise, then the documentation should clarify the use cases for which it is suitable. A very simple test was also attempted, whereby the MDS NID was changed from 10.10.2.12@tcp0 to 10.10.2.14@tcp0. The result is the same (see last test case). Note: failure here constitutes an inability to alter Lustre parameters (in the example, changing the quota settings fails). The file system does mount and can be used by a client. I'd like to have the documentation clarified with the exact syntax and process, as well as use cases for the lctl replace_nids command, in case there is something I have missed.
Format MGS, MDT0000, OST0000, OST0001 as lidskfs, no failover: mds1: mkfs.lustre --mgs /dev/sda mds2: mkfs.lustre --mdt --index 0 --mgsnode 10.10.2.11@tcp0 --fsname demo /dev/sdb oss1: mkfs.lustre --ost --index 0 --fsname demo --mgsnode 10.10.2.11@tcp0 /dev/sda oss2: mkfs.lustre --ost --index 1 --fsname demo --mgsnode 10.10.2.11@tcp0 /dev/sdb Mount on client and confirm FS is operating correctly (create files, check stripes, check lfs df). Use a simple check that parameters can be set persistently: lctl conf_param demo.quota.ost=ug ... lctl conf_param demo.quota.ost=none Umount client, MDT0000, OST0000, OST0001. MGS remains online. Run tunefs.lustre on MDT0000, adding servicenodes: tunefs.lustre --erase-params \ --servicenode 10.10.2.12@tcp0:10.10.2.11@tcp0 \ --mgsnode 10.10.2.11@tcp0 --mgsnode 10.10.2.12@tcp0 \ /dev/sdb On MGS, run lctl replace_nids: lctl replace_nids demo-MDT0000 10.10.2.12@tcp0:10.10.2.11@tcp0 Remount MDT00000 MGS syslog contains: Dec 9 00:13:45 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
Remount OST0000, OST0001, client in sequence. Verify FS is online, files still accessible on client. Re-run a simple check that parameters can be set persistently: lctl conf_param demo.quota.ost=ug Returns: error: conf_param: File exists MGS syslog reports errors: Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(llog.c:336:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:1446:record_start_log()) MGS: can't start log demo-MDT0000.bak: rc = -17
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:3626:mgs_write_log_param()) err -17 on param 'quota.ost=none'
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_handler.c:993:mgs_iocontrol()) MGS: setparam err: rc = -17
Dec 9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_handler.c:993:mgs_iocontrol()) Skipped 1 previous similar message
Umount client, MDT0000, OST0000, OST0001. MGS remains online. Run tunefs.lustre on MDT0000, adding servicenodes: tunefs.lustre --erase-params \ --servicenode 10.10.2.12@tcp0:10.10.2.11@tcp0 \ --mgsnode 10.10.2.11@tcp0 --mgsnode 10.10.2.12@tcp0 \ /dev/sdb On MGS, run lctl replace_nids, using comma separator instead of colon, following Lustre manual explicitly: lctl replace_nids demo-MDT0000 10.10.2.12@tcp0,10.10.2.11@tcp0 Remount MDT00000 MGS syslog contains: Dec 9 00:33:00 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
Remount MDT0000, OST0000, OST0001, client Re-run a simple check that parameters can be set persistently: [root@rh7z-mds1 ~]# lctl conf_param demo.quota.ost=none error: conf_param: File exists MGS syslog reports errors: Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(llog.c:336:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:1446:record_start_log()) MGS: can't start log demo-MDT0000.bak: rc = -17
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:3626:mgs_write_log_param()) err -17 on param 'quota.ost=none'
Dec 9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_handler.c:993:mgs_iocontrol()) MGS: setparam err: rc = -17
Umount client, MDT0000, OST0000, OST0001. MGS remains online. Run tunefs.lustre on MDT0000 with a single servicenode NID and a single mgsnode: tunefs.lustre --erase-params --servicenode 10.10.2.12@tcp0 --mgsnode 10.10.2.11@tcp0 /dev/sdb On MGS, run lctl replace_nids: [root@rh7z-mds1 ~]# lctl replace_nids demo-MDT0000 10.10.2.12@tcp0 Remount MDT0000, OST0000, OST0001, client MGS syslog contains: Dec 9 00:38:31 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
Try to change quota setting again, MGS reports the same error. Umount client, MDT0000, OST0000, OST0001. MGS remains online. Run tunefs.lustre on MDT0000 with the equivalent of the original settings: tunefs.lustre --erase-params --mgsnode 10.10.2.11@tcp0 /dev/sdb Remount MDT0000, OST0000, OST0001, client Re-run a simple check that parameters can be set persistently: [root@rh7z-mds1 ~]# lctl conf_param demo.quota.ost=none error: conf_param: File exists MGS syslog reports same error. Umount all, reformat all targets to create new FS. Mount MGT, MDT0000, OST0000, OST0001 Verify that client can mount the FS. Run quota test as before. Umount client, MDT0000, OST0000, OST0001 Remove kernel modules on MDT0000 host. Change IPv4 address from 10.10.2.12 to 10.10.2.14, reload lnet module and verify that new NID is applied. On MGT, run: lctl replace_nids demo-MDT0000 10.10.2.14@tcp0 Remount MDT0000, OST0000, OST0001, client Verify that FS is usable. Re-run quota conf_param test. Fails as before.
|
| Comments |
| Comment by Andreas Dilger [ 15/Jun/17 ] |
|
Hi Artem, could you please comment on this issue, since you wrote replace_nids originally. |
| Comment by Artem Blagodarenko (Inactive) [ 16/Jun/17 ] |
|
Strange, .bak file is accessed during conference_param (mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17 replace_nids create .bak file before changing configs, so probably there is some kind of conflict with configs_param. Investigating. |
| Comment by Artem Blagodarenko (Inactive) [ 16/Jun/17 ] |
|
Found the reason. |
| Comment by Andreas Dilger [ 17/Jun/17 ] |
|
I think I commented in a separate patchthat we should not process all of the files in the CONFIGS directory, but instead only the files of the form fsname-<MDT,OST,client,sptlrpc,params> and other known names. This avoids all sorts of problems if other backup file names are used. |
| Comment by Artem Blagodarenko (Inactive) [ 17/Jun/17 ] |
|
> This avoids all sorts of problems if other backup file names are used. |
| Comment by Artem Blagodarenko (Inactive) [ 16/Jul/19 ] |
|
Can we close this issue as |