Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8948

Cannot change conf_param settings after changing the NID of a Lustre OSD using lctl replace_nids

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Attempting to use lctl replace_nids per Lustre manual to change NIDs for OSDs. The intention is to convert servers that were formatted without servicenode entries, so that the targets can be configured for failover.

      The documentation is ambiguous, and my attempt to use this command fails, insofar as any attempt to use lctl conf_param after using lctl replace_nids fails. I have tried several experiments, without success. The following is an outline of the process followed, which covers several variations.

      If the replace_nids command is not suitable for this exercise, then the documentation should clarify the use cases for which it is suitable.

      A very simple test was also attempted, whereby the MDS NID was changed from 10.10.2.12@tcp0 to 10.10.2.14@tcp0. The result is the same (see last test case).

      Note: failure here constitutes an inability to alter Lustre parameters (in the example, changing the quota settings fails). The file system does mount and can be used by a client.

      I'd like to have the documentation clarified with the exact syntax and process, as well as use cases for the lctl replace_nids command, in case there is something I have missed.

       


      Format MGS, MDT0000, OST0000, OST0001 as lidskfs, no failover:

      mds1: mkfs.lustre --mgs /dev/sda
      mds2: mkfs.lustre --mdt --index 0 --mgsnode 10.10.2.11@tcp0 --fsname demo /dev/sdb
      oss1: mkfs.lustre --ost --index 0 --fsname demo --mgsnode 10.10.2.11@tcp0 /dev/sda
      oss2: mkfs.lustre --ost --index 1 --fsname demo --mgsnode 10.10.2.11@tcp0 /dev/sdb
      
      
      
      

      Mount on client and confirm FS is operating correctly (create files, check stripes, check lfs df).

      Use a simple check that parameters can be set persistently:

      lctl conf_param demo.quota.ost=ug
      ...
      lctl conf_param demo.quota.ost=none
      
      
      
      

      Umount client, MDT0000, OST0000, OST0001. MGS remains online.

      Run tunefs.lustre on MDT0000, adding servicenodes:

      tunefs.lustre --erase-params \
        --servicenode 10.10.2.12@tcp0:10.10.2.11@tcp0 \
        --mgsnode 10.10.2.11@tcp0 --mgsnode 10.10.2.12@tcp0 \
        /dev/sdb
      
      
      
      

      On MGS, run lctl replace_nids:

      lctl replace_nids demo-MDT0000 10.10.2.12@tcp0:10.10.2.11@tcp0
      
      
      
      

      Remount MDT00000

      MGS syslog contains:

      Dec  9 00:13:45 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
      
      
      
      

      Remount OST0000, OST0001, client in sequence.

      Verify FS is online, files still accessible on client.

      Re-run a simple check that parameters can be set persistently:

      lctl conf_param demo.quota.ost=ug
      
      
      
      

      Returns:

      error: conf_param: File exists
      
      
      
      

      MGS syslog reports errors:

      Dec  9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(llog.c:336:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
      Dec  9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:1446:record_start_log()) MGS: can't start log demo-MDT0000.bak: rc = -17
      Dec  9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17
      Dec  9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_llog.c:3626:mgs_write_log_param()) err -17 on param 'quota.ost=none'
      Dec  9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_handler.c:993:mgs_iocontrol()) MGS: setparam err: rc = -17
      Dec  9 00:14:56 rh7z-mds1 kernel: LustreError: 4879:0:(mgs_handler.c:993:mgs_iocontrol()) Skipped 1 previous similar message
      
      
      
      

      Umount client, MDT0000, OST0000, OST0001. MGS remains online.

      Run tunefs.lustre on MDT0000, adding servicenodes:

      tunefs.lustre --erase-params \
        --servicenode 10.10.2.12@tcp0:10.10.2.11@tcp0 \
        --mgsnode 10.10.2.11@tcp0 --mgsnode 10.10.2.12@tcp0 \
        /dev/sdb
      
      
      
      

      On MGS, run lctl replace_nids, using comma separator instead of colon, following Lustre manual explicitly:

      lctl replace_nids demo-MDT0000 10.10.2.12@tcp0,10.10.2.11@tcp0
      
      
      
      

      Remount MDT00000

      MGS syslog contains:

      Dec  9 00:33:00 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
      
      
      
      

      Remount MDT0000, OST0000, OST0001, client

      Re-run a simple check that parameters can be set persistently:

      [root@rh7z-mds1 ~]# lctl conf_param demo.quota.ost=none
      error: conf_param: File exists
      
      
      
      

      MGS syslog reports errors:

      Dec  9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(llog.c:336:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
      Dec  9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:1446:record_start_log()) MGS: can't start log demo-MDT0000.bak: rc = -17
      Dec  9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:1543:mgs_write_log_direct_all()) MGS: writing log demo-MDT0000.bak: rc = -17
      Dec  9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_llog.c:3626:mgs_write_log_param()) err -17 on param 'quota.ost=none'
      Dec  9 00:33:52 rh7z-mds1 kernel: LustreError: 4969:0:(mgs_handler.c:993:mgs_iocontrol()) MGS: setparam err: rc = -17
      
      
      
      

      Umount client, MDT0000, OST0000, OST0001. MGS remains online.

      Run tunefs.lustre on MDT0000 with a single servicenode NID and a single mgsnode:

      tunefs.lustre --erase-params --servicenode 10.10.2.12@tcp0 --mgsnode 10.10.2.11@tcp0 /dev/sdb
      
      
      
      

      On MGS, run lctl replace_nids:

      [root@rh7z-mds1 ~]# lctl replace_nids demo-MDT0000 10.10.2.12@tcp0
      
      
      
      

      Remount MDT0000, OST0000, OST0001, client

      MGS syslog contains:

      Dec  9 00:38:31 rh7z-mds1 kernel: Lustre: Found index 0 for demo-MDT0000, updating log
      
      
      
      

      Try to change quota setting again, MGS reports the same error.


      Umount client, MDT0000, OST0000, OST0001. MGS remains online.

      Run tunefs.lustre on MDT0000 with the equivalent of the original settings:

      tunefs.lustre --erase-params --mgsnode 10.10.2.11@tcp0 /dev/sdb
      
      
      
      

      Remount MDT0000, OST0000, OST0001, client

      Re-run a simple check that parameters can be set persistently:

      [root@rh7z-mds1 ~]# lctl conf_param demo.quota.ost=none
      error: conf_param: File exists
      
      
      
      

      MGS syslog reports same error.


      Umount all, reformat all targets to create new FS.

      Mount MGT, MDT0000, OST0000, OST0001

      Verify that client can mount the FS.

      Run quota test as before.

      Umount client, MDT0000, OST0000, OST0001

      Remove kernel modules on MDT0000 host.

      Change IPv4 address from 10.10.2.12 to 10.10.2.14, reload lnet module and verify that new NID is applied.

      On MGT, run:

      lctl replace_nids demo-MDT0000 10.10.2.14@tcp0
      
      
      

      Remount MDT0000, OST0000, OST0001, client

      Verify that FS is usable.

      Re-run quota conf_param test. Fails as before.

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            malkolm Malcolm Cowe (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: