[LU-8414] DNE: Setting of remote_dir_gid parameter not persistent Created: 19/Jul/16  Updated: 13/Oct/16  Resolved: 13/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Frank Heckes (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-master/3406


Issue Links:
Related
is related to LU-7004 fix "lctl set_param -P" to allow depr... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error occured during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713). MDSes have been configured using ldiskfs, OSTs using zfs. Test environment consist of 4 MDSes with 1 MDT each, 6 OSSes with 4 OSTs each. MDS and OSS nodes are configured in active-active HA configuration.
Roles:

  • lola-8 MGS/MDS
  • lola-[9-11] MDS

DNE has been enabled using the command sequence (see Lustre manual page 96):

pdsh -g mds 'lctl set_param mdt.*.enable_remote_dir=1'
pdsh -g mds 'lctl set_param mdt.*.enable_remote_dir_gid=-1'
especially
pdsh -w lola-8 'lctl set_param -P mdt.*.enable_remote_dir=1'
pdsh -w lola-8 'lctl set_param -P mdt.*.enable_remote_dir_git=-1'

(The later two commands only work on MGS node).
Problem occur after each of the node lola-[9-11] have been restarted or resourcres had been failover / failedback.
While parameter 'enable_remote_dir' is persistent on the non MGS MDSes, the parameter 'enable_remote_dir_gid' isn't.
Therefore the command:

[soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1  /mnt/soaked/soaktest/hsm_rbh/
error on LL_IOC_LMV_SETSTRIPE '/mnt/soaked/soaktest/hsm_rbh/' (3): Operation not permitted
error: setdirstripe: create stripe dir '/mnt/soaked/soaktest/hsm_rbh/' failed

---------------
--> Remote dir setting:
----------------
lola-8
----------------
Remote dir_gid setting soaked-MDT0000: -1
----------------
lola-9
----------------
Remote dir_gid setting soaked-MDT0001: 0
----------------
lola-10
----------------
Remote dir_gid setting soaked-MDT0002: -1
----------------
lola-11
----------------
Remote dir_gid setting soaked-MDT0003: 0

failed. This will break all test (slurm) jobs that rely on this functionality.

After setting the parameters on the nodes again the command

[soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1  /mnt/soaked/soaktest/hsm_rbh/
^A2[soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1  -D /mnt/soaked/soaktest/hsm_rbh/
[soaktest@lola-16 ~]$ lfs getdirstripe  /mnt/soaked/soaktest/hsm_rbh/
/mnt/soaked/soaktest/hsm_rbh/
[soaktest@lola-16 ~]$ lfs getdirstripe  /mnt/soaked/soaktest/hsm_rbh/
/mnt/soaked/soaktest/hsm_rbh/
lmv_stripe_count: 4 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
     1           [0x240007160:0x3:0x0]
     2           [0x28000d714:0x3:0x0]
     3           [0x2c000a810:0x1:0x0]
     0           [0x20000fe01:0x3:0x0]

end successful.



 Comments   
Comment by Frank Heckes (Inactive) [ 19/Jul/16 ]

To state this explicit. The usage for setdirstripe commands works without problems in case all 'enabled_remote_dir_gid' - parameters are set to -1 on all MDS nodes.

Comment by Cliff White (Inactive) [ 17/Aug/16 ]

We are still seeing this issue with current tip of master

Comment by Lai Siyao [ 05/Sep/16 ]

I tested remount 2 mdt on 2 mds servers, and see both parameters are persistent.

Besides in your test maybe both parameters were not persistent, but because remote_dir_gid default value is 1, so it looks like it's persistent.

Could you get debug log on non-MGS mds's during failover or restart?

Comment by Andreas Dilger [ 08/Sep/16 ]

Frank, instead of using lctl set_param -P, which is known to have some problems (LU-7004, LU-7183) can you please try:

lctl conf_param myth.mdt.enable_remote_dir=1
lctl conf_param myth.mdt.enable_remote_dir_gid=-1

since this is the "traditional" way of setting permanent parameters and uses a different mechanism. If this fixes your problem then we should also fix the manual.

Comment by Frank Heckes (Inactive) [ 14/Sep/16 ]

I'll check this tomorrow.

Comment by Frank Heckes (Inactive) [ 19/Sep/16 ]

I checked with tip of master (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916)

Usage of conf_param had (has) no effect:

[root@lola-16 ~]# pdsh -g mds 'lctl conf_param mdt.*.enable_remote_dir=1 ; lctl conf_param mdt.*.enable_remote_dir_gid=-1'
lola-11: No device found for name MGS: Invalid argument
lola-11: This command must be run on the MGS.
lola-11: error: conf_param: No such device
lola-11: No device found for name MGS: Invalid argument
lola-11: This command must be run on the MGS.
lola-11: error: conf_param: No such device
lola-8: error: conf_param: Invalid argument
lola-8: error: conf_param: Invalid argument
lola-10: No device found for name MGS: Invalid argument
lola-10: This command must be run on the MGS.
lola-10: error: conf_param: No such device
lola-10: No device found for name MGS: Invalid argument
lola-10: This command must be run on the MGS.
lola-10: error: conf_param: No such device
lola-9: No device found for name MGS: Invalid argument
lola-9: This command must be run on the MGS.
lola-9: error: conf_param: No such device
lola-9: No device found for name MGS: Invalid argument
lola-9: This command must be run on the MGS.
lola-9: error: conf_param: No such device
[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
lola-11: mdt.soaked-MDT0003.enable_remote_dir=0
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0
lola-8: mdt.soaked-MDT0000.enable_remote_dir=0
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0
lola-9: mdt.soaked-MDT0001.enable_remote_dir=0
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0
lola-10: mdt.soaked-MDT0002.enable_remote_dir=0
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 

Checking with set_param -P:

[root@lola-16 ~]# pdsh -g mds 'lctl set_param -P mdt.*.enable_remote_dir=1 ; lctl set_param -P mdt.*.enable_remote_dir_gid=-1'
lola-11: No device found for name MGS: Invalid argument
lola-11: This command must be run on the MGS.
lola-11: error: executing set_param: No such device
lola-11: No device found for name MGS: Invalid argument
lola-11: This command must be run on the MGS.
lola-11: error: executing set_param: No such device
lola-10: No device found for name MGS: Invalid argument
lola-10: This command must be run on the MGS.
lola-10: error: executing set_param: No such device
lola-10: No device found for name MGS: Invalid argument
lola-10: This command must be run on the MGS.
lola-10: error: executing set_param: No such device
lola-9: No device found for name MGS: Invalid argument
lola-9: This command must be run on the MGS.
lola-9: error: executing set_param: No such device
lola-9: No device found for name MGS: Invalid argument
lola-9: This command must be run on the MGS.
lola-9: error: executing set_param: No such device
[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1

After crash and remount of mdt-1 of lola-9 parameter turns out to be persistent configured:

[root@lola-16 ~]# date
Sun Sep 18 02:54:52 PDT 2016
[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1

Checked after crash of lola-11

Sun Sep 18 07:34:04 PDT 2016
[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid'
lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1

So the set_param -P - seems to work, but the non relevant error messages are confusing. (This is a similar problem as described in LU-8456)

Comment by Andreas Dilger [ 19/Sep/16 ]

Frank, as shown in my previous comment, the format of conf_param and set_param are different. For conf_param you need to specify the filesystem name first instead of the device name:

lctl conf_param soaked.mdt.enable_remote_dir=1
lctl conf_param soaked.mdt.enable_remote_dir_gid=-1
Comment by Frank Heckes (Inactive) [ 20/Sep/16 ]

Sorry I didn't read and thought carefully. Indeed conf_param fixes the problem:

[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2
lola-8: mdt.soaked-MDT0000.enable_remote_dir=0
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0
lola-9: mdt.soaked-MDT0001.enable_remote_dir=0
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0
lola-10: mdt.soaked-MDT0002.enable_remote_dir=0
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0
lola-11: mdt.soaked-MDT0003.enable_remote_dir=0
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0
[root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir=1'
[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2
lola-8: mdt.soaked-MDT0000.enable_remote_dir=0
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0
lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0
lola-10: mdt.soaked-MDT0002.enable_remote_dir=0
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0
lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0
[root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir_gid=-1'
[root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2
lola-8: mdt.soaked-MDT0000.enable_remote_dir=1
lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1
lola-9: mdt.soaked-MDT0001.enable_remote_dir=1
lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1
lola-10: mdt.soaked-MDT0002.enable_remote_dir=1
lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1
lola-11: mdt.soaked-MDT0003.enable_remote_dir=1
lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1
Comment by Andreas Dilger [ 20/Sep/16 ]

Frank, can you please file an LUDOC ticket with details of what needs to be fixed in the manual so that this documented correctly.

Comment by Frank Heckes (Inactive) [ 20/Sep/16 ]

done (LUDOC-355).
I just wonder whether it is a use case, if customers would like to enable only a subset of the available remote MDTs. The fix enables all or nothing, so that the 'set_param -P - procedure' would need be executed.

Comment by Cliff White (Inactive) [ 13/Oct/16 ]

On current tip of master, after 5 MDT failovers, remote dir is persistent

# lfs setdirstripe -c 4 -i 1 /mnt/soaked/bah
[root@lola-16 jobs]# lfs getdirstripe /mnt/soaked/bah
lmv_stripe_count: 4 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
     1           [0x240002b10:0x21ad:0x0]
     2           [0x280001b74:0x21ad:0x0]
     3           [0x2c0003ac8:0x21ad:0x0]
     0           [0x200002b4f:0x21ad:0x0]
Comment by Peter Jones [ 13/Oct/16 ]

Closing as no longer appearing

Generated at Sat Feb 10 02:17:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.