Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8414

DNE: Setting of remote_dir_gid parameter not persistent

Details

    • 3
    • 9223372036854775807

    Description

      Error occured during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713). MDSes have been configured using ldiskfs, OSTs using zfs. Test environment consist of 4 MDSes with 1 MDT each, 6 OSSes with 4 OSTs each. MDS and OSS nodes are configured in active-active HA configuration.
      Roles:

      • lola-8 MGS/MDS
      • lola-[9-11] MDS

      DNE has been enabled using the command sequence (see Lustre manual page 96):

      pdsh -g mds 'lctl set_param mdt.*.enable_remote_dir=1'
      pdsh -g mds 'lctl set_param mdt.*.enable_remote_dir_gid=-1'
      especially
      pdsh -w lola-8 'lctl set_param -P mdt.*.enable_remote_dir=1'
      pdsh -w lola-8 'lctl set_param -P mdt.*.enable_remote_dir_git=-1'
      

      (The later two commands only work on MGS node).
      Problem occur after each of the node lola-[9-11] have been restarted or resourcres had been failover / failedback.
      While parameter 'enable_remote_dir' is persistent on the non MGS MDSes, the parameter 'enable_remote_dir_gid' isn't.
      Therefore the command:

      [soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1  /mnt/soaked/soaktest/hsm_rbh/
      error on LL_IOC_LMV_SETSTRIPE '/mnt/soaked/soaktest/hsm_rbh/' (3): Operation not permitted
      error: setdirstripe: create stripe dir '/mnt/soaked/soaktest/hsm_rbh/' failed
      
      ---------------
      --> Remote dir setting:
      ----------------
      lola-8
      ----------------
      Remote dir_gid setting soaked-MDT0000: -1
      ----------------
      lola-9
      ----------------
      Remote dir_gid setting soaked-MDT0001: 0
      ----------------
      lola-10
      ----------------
      Remote dir_gid setting soaked-MDT0002: -1
      ----------------
      lola-11
      ----------------
      Remote dir_gid setting soaked-MDT0003: 0
      

      failed. This will break all test (slurm) jobs that rely on this functionality.

      After setting the parameters on the nodes again the command

      [soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1  /mnt/soaked/soaktest/hsm_rbh/
      ^A2[soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1  -D /mnt/soaked/soaktest/hsm_rbh/
      [soaktest@lola-16 ~]$ lfs getdirstripe  /mnt/soaked/soaktest/hsm_rbh/
      /mnt/soaked/soaktest/hsm_rbh/
      [soaktest@lola-16 ~]$ lfs getdirstripe  /mnt/soaked/soaktest/hsm_rbh/
      /mnt/soaked/soaktest/hsm_rbh/
      lmv_stripe_count: 4 lmv_stripe_offset: 1
      mdtidx           FID[seq:oid:ver]
           1           [0x240007160:0x3:0x0]
           2           [0x28000d714:0x3:0x0]
           3           [0x2c000a810:0x1:0x0]
           0           [0x20000fe01:0x3:0x0]
      

      end successful.

      Attachments

        Issue Links

          Activity

            [LU-8414] DNE: Setting of remote_dir_gid parameter not persistent

            Sorry I didn't read and thought carefully. Indeed conf_param fixes the problem:

            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=0
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=0
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=0
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=0
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0
            [root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir=1'
            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=0
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=0
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0
            [root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir_gid=-1'
            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
            
            heckes Frank Heckes (Inactive) added a comment - Sorry I didn't read and thought carefully. Indeed conf_param fixes the problem: [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2 lola-8: mdt.soaked-MDT0000.enable_remote_dir=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0 [root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir=1' [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2 lola-8: mdt.soaked-MDT0000.enable_remote_dir=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0 [root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir_gid=-1' [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1

            Frank, as shown in my previous comment, the format of conf_param and set_param are different. For conf_param you need to specify the filesystem name first instead of the device name:

            lctl conf_param soaked.mdt.enable_remote_dir=1
            lctl conf_param soaked.mdt.enable_remote_dir_gid=-1
            
            adilger Andreas Dilger added a comment - Frank, as shown in my previous comment, the format of conf_param and set_param are different. For conf_param you need to specify the filesystem name first instead of the device name: lctl conf_param soaked.mdt.enable_remote_dir=1 lctl conf_param soaked.mdt.enable_remote_dir_gid=-1

            I checked with tip of master (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916)

            Usage of conf_param had (has) no effect:

            [root@lola-16 ~]# pdsh -g mds 'lctl conf_param mdt.*.enable_remote_dir=1 ; lctl conf_param mdt.*.enable_remote_dir_gid=-1'
            lola-11: No device found for name MGS: Invalid argument
            lola-11: This command must be run on the MGS.
            lola-11: error: conf_param: No such device
            lola-11: No device found for name MGS: Invalid argument
            lola-11: This command must be run on the MGS.
            lola-11: error: conf_param: No such device
            lola-8: error: conf_param: Invalid argument
            lola-8: error: conf_param: Invalid argument
            lola-10: No device found for name MGS: Invalid argument
            lola-10: This command must be run on the MGS.
            lola-10: error: conf_param: No such device
            lola-10: No device found for name MGS: Invalid argument
            lola-10: This command must be run on the MGS.
            lola-10: error: conf_param: No such device
            lola-9: No device found for name MGS: Invalid argument
            lola-9: This command must be run on the MGS.
            lola-9: error: conf_param: No such device
            lola-9: No device found for name MGS: Invalid argument
            lola-9: This command must be run on the MGS.
            lola-9: error: conf_param: No such device
            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=0
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=0
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=0
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=0
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 
            

            Checking with set_param -P:

            [root@lola-16 ~]# pdsh -g mds 'lctl set_param -P mdt.*.enable_remote_dir=1 ; lctl set_param -P mdt.*.enable_remote_dir_gid=-1'
            lola-11: No device found for name MGS: Invalid argument
            lola-11: This command must be run on the MGS.
            lola-11: error: executing set_param: No such device
            lola-11: No device found for name MGS: Invalid argument
            lola-11: This command must be run on the MGS.
            lola-11: error: executing set_param: No such device
            lola-10: No device found for name MGS: Invalid argument
            lola-10: This command must be run on the MGS.
            lola-10: error: executing set_param: No such device
            lola-10: No device found for name MGS: Invalid argument
            lola-10: This command must be run on the MGS.
            lola-10: error: executing set_param: No such device
            lola-9: No device found for name MGS: Invalid argument
            lola-9: This command must be run on the MGS.
            lola-9: error: executing set_param: No such device
            lola-9: No device found for name MGS: Invalid argument
            lola-9: This command must be run on the MGS.
            lola-9: error: executing set_param: No such device
            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1
            

            After crash and remount of mdt-1 of lola-9 parameter turns out to be persistent configured:

            [root@lola-16 ~]# date
            Sun Sep 18 02:54:52 PDT 2016
            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1
            

            Checked after crash of lola-11

            Sun Sep 18 07:34:04 PDT 2016
            [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
            lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
            lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
            lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
            lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
            lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1
            

            So the set_param -P - seems to work, but the non relevant error messages are confusing. (This is a similar problem as described in LU-8456)

            heckes Frank Heckes (Inactive) added a comment - I checked with tip of master (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916 ) Usage of conf_param had (has) no effect: [root@lola-16 ~]# pdsh -g mds 'lctl conf_param mdt.*.enable_remote_dir=1 ; lctl conf_param mdt.*.enable_remote_dir_gid=-1' lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: conf_param: No such device lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: conf_param: No such device lola-8: error: conf_param: Invalid argument lola-8: error: conf_param: Invalid argument lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: conf_param: No such device lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: conf_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: conf_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: conf_param: No such device [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 Checking with set_param -P : [root@lola-16 ~]# pdsh -g mds 'lctl set_param -P mdt.*.enable_remote_dir=1 ; lctl set_param -P mdt.*.enable_remote_dir_gid=-1' lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: executing set_param: No such device lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: executing set_param: No such device lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: executing set_param: No such device lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: executing set_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: executing set_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: executing set_param: No such device [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 After crash and remount of mdt-1 of lola-9 parameter turns out to be persistent configured: [root@lola-16 ~]# date Sun Sep 18 02:54:52 PDT 2016 [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 Checked after crash of lola-11 Sun Sep 18 07:34:04 PDT 2016 [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 So the set_param -P - seems to work, but the non relevant error messages are confusing. (This is a similar problem as described in LU-8456 )

            I'll check this tomorrow.

            heckes Frank Heckes (Inactive) added a comment - I'll check this tomorrow.

            Frank, instead of using lctl set_param -P, which is known to have some problems (LU-7004, LU-7183) can you please try:

            lctl conf_param myth.mdt.enable_remote_dir=1
            lctl conf_param myth.mdt.enable_remote_dir_gid=-1
            

            since this is the "traditional" way of setting permanent parameters and uses a different mechanism. If this fixes your problem then we should also fix the manual.

            adilger Andreas Dilger added a comment - Frank, instead of using lctl set_param -P , which is known to have some problems ( LU-7004 , LU-7183 ) can you please try: lctl conf_param myth.mdt.enable_remote_dir=1 lctl conf_param myth.mdt.enable_remote_dir_gid=-1 since this is the "traditional" way of setting permanent parameters and uses a different mechanism. If this fixes your problem then we should also fix the manual.
            laisiyao Lai Siyao added a comment -

            I tested remount 2 mdt on 2 mds servers, and see both parameters are persistent.

            Besides in your test maybe both parameters were not persistent, but because remote_dir_gid default value is 1, so it looks like it's persistent.

            Could you get debug log on non-MGS mds's during failover or restart?

            laisiyao Lai Siyao added a comment - I tested remount 2 mdt on 2 mds servers, and see both parameters are persistent. Besides in your test maybe both parameters were not persistent, but because remote_dir_gid default value is 1, so it looks like it's persistent. Could you get debug log on non-MGS mds's during failover or restart?

            We are still seeing this issue with current tip of master

            cliffw Cliff White (Inactive) added a comment - We are still seeing this issue with current tip of master

            To state this explicit. The usage for setdirstripe commands works without problems in case all 'enabled_remote_dir_gid' - parameters are set to -1 on all MDS nodes.

            heckes Frank Heckes (Inactive) added a comment - To state this explicit. The usage for setdirstripe commands works without problems in case all 'enabled_remote_dir_gid' - parameters are set to -1 on all MDS nodes.

            People

              laisiyao Lai Siyao
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: