[LU-8414] DNE: Setting of remote_dir_gid parameter not persistent Created: 19/Jul/16 Updated: 13/Oct/16 Resolved: 13/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Error occured during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713). MDSes have been configured using ldiskfs, OSTs using zfs. Test environment consist of 4 MDSes with 1 MDT each, 6 OSSes with 4 OSTs each. MDS and OSS nodes are configured in active-active HA configuration.
DNE has been enabled using the command sequence (see Lustre manual page 96): pdsh -g mds 'lctl set_param mdt.*.enable_remote_dir=1' pdsh -g mds 'lctl set_param mdt.*.enable_remote_dir_gid=-1' especially pdsh -w lola-8 'lctl set_param -P mdt.*.enable_remote_dir=1' pdsh -w lola-8 'lctl set_param -P mdt.*.enable_remote_dir_git=-1' (The later two commands only work on MGS node). [soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1 /mnt/soaked/soaktest/hsm_rbh/ error on LL_IOC_LMV_SETSTRIPE '/mnt/soaked/soaktest/hsm_rbh/' (3): Operation not permitted error: setdirstripe: create stripe dir '/mnt/soaked/soaktest/hsm_rbh/' failed --------------- --> Remote dir setting: ---------------- lola-8 ---------------- Remote dir_gid setting soaked-MDT0000: -1 ---------------- lola-9 ---------------- Remote dir_gid setting soaked-MDT0001: 0 ---------------- lola-10 ---------------- Remote dir_gid setting soaked-MDT0002: -1 ---------------- lola-11 ---------------- Remote dir_gid setting soaked-MDT0003: 0 failed. This will break all test (slurm) jobs that rely on this functionality. After setting the parameters on the nodes again the command [soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1 /mnt/soaked/soaktest/hsm_rbh/
^A2[soaktest@lola-16 ~]$ lfs setdirstripe -c 4 -i 1 -D /mnt/soaked/soaktest/hsm_rbh/
[soaktest@lola-16 ~]$ lfs getdirstripe /mnt/soaked/soaktest/hsm_rbh/
/mnt/soaked/soaktest/hsm_rbh/
[soaktest@lola-16 ~]$ lfs getdirstripe /mnt/soaked/soaktest/hsm_rbh/
/mnt/soaked/soaktest/hsm_rbh/
lmv_stripe_count: 4 lmv_stripe_offset: 1
mdtidx FID[seq:oid:ver]
1 [0x240007160:0x3:0x0]
2 [0x28000d714:0x3:0x0]
3 [0x2c000a810:0x1:0x0]
0 [0x20000fe01:0x3:0x0]
end successful. |
| Comments |
| Comment by Frank Heckes (Inactive) [ 19/Jul/16 ] |
|
To state this explicit. The usage for setdirstripe commands works without problems in case all 'enabled_remote_dir_gid' - parameters are set to -1 on all MDS nodes. |
| Comment by Cliff White (Inactive) [ 17/Aug/16 ] |
|
We are still seeing this issue with current tip of master |
| Comment by Lai Siyao [ 05/Sep/16 ] |
|
I tested remount 2 mdt on 2 mds servers, and see both parameters are persistent. Besides in your test maybe both parameters were not persistent, but because remote_dir_gid default value is 1, so it looks like it's persistent. Could you get debug log on non-MGS mds's during failover or restart? |
| Comment by Andreas Dilger [ 08/Sep/16 ] |
|
Frank, instead of using lctl set_param -P, which is known to have some problems ( lctl conf_param myth.mdt.enable_remote_dir=1 lctl conf_param myth.mdt.enable_remote_dir_gid=-1 since this is the "traditional" way of setting permanent parameters and uses a different mechanism. If this fixes your problem then we should also fix the manual. |
| Comment by Frank Heckes (Inactive) [ 14/Sep/16 ] |
|
I'll check this tomorrow. |
| Comment by Frank Heckes (Inactive) [ 19/Sep/16 ] |
|
I checked with tip of master (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916) Usage of conf_param had (has) no effect: [root@lola-16 ~]# pdsh -g mds 'lctl conf_param mdt.*.enable_remote_dir=1 ; lctl conf_param mdt.*.enable_remote_dir_gid=-1' lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: conf_param: No such device lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: conf_param: No such device lola-8: error: conf_param: Invalid argument lola-8: error: conf_param: Invalid argument lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: conf_param: No such device lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: conf_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: conf_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: conf_param: No such device [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 Checking with set_param -P: [root@lola-16 ~]# pdsh -g mds 'lctl set_param -P mdt.*.enable_remote_dir=1 ; lctl set_param -P mdt.*.enable_remote_dir_gid=-1' lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: executing set_param: No such device lola-11: No device found for name MGS: Invalid argument lola-11: This command must be run on the MGS. lola-11: error: executing set_param: No such device lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: executing set_param: No such device lola-10: No device found for name MGS: Invalid argument lola-10: This command must be run on the MGS. lola-10: error: executing set_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: executing set_param: No such device lola-9: No device found for name MGS: Invalid argument lola-9: This command must be run on the MGS. lola-9: error: executing set_param: No such device [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 After crash and remount of mdt-1 of lola-9 parameter turns out to be persistent configured: [root@lola-16 ~]# date Sun Sep 18 02:54:52 PDT 2016 [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 Checked after crash of lola-11 Sun Sep 18 07:34:04 PDT 2016 [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 So the set_param -P - seems to work, but the non relevant error messages are confusing. (This is a similar problem as described in |
| Comment by Andreas Dilger [ 19/Sep/16 ] |
|
Frank, as shown in my previous comment, the format of conf_param and set_param are different. For conf_param you need to specify the filesystem name first instead of the device name: lctl conf_param soaked.mdt.enable_remote_dir=1 lctl conf_param soaked.mdt.enable_remote_dir_gid=-1 |
| Comment by Frank Heckes (Inactive) [ 20/Sep/16 ] |
|
Sorry I didn't read and thought carefully. Indeed conf_param fixes the problem: [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2 lola-8: mdt.soaked-MDT0000.enable_remote_dir=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0 [root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir=1' [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2 lola-8: mdt.soaked-MDT0000.enable_remote_dir=0 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=0 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir=0 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=0 lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=0 [root@lola-16 ~]# ssh lola-8 'lctl conf_param soaked.mdt.enable_remote_dir_gid=-1' [root@lola-16 ~]# pdsh -g mds 'lctl get_param mdt.*.enable_remote_dir ; lctl get_param mdt.*.enable_remote_dir_gid' | sort -k 2,2 lola-8: mdt.soaked-MDT0000.enable_remote_dir=1 lola-8: mdt.soaked-MDT0000.enable_remote_dir_gid=-1 lola-9: mdt.soaked-MDT0001.enable_remote_dir=1 lola-9: mdt.soaked-MDT0001.enable_remote_dir_gid=-1 lola-10: mdt.soaked-MDT0002.enable_remote_dir=1 lola-10: mdt.soaked-MDT0002.enable_remote_dir_gid=-1 lola-11: mdt.soaked-MDT0003.enable_remote_dir=1 lola-11: mdt.soaked-MDT0003.enable_remote_dir_gid=-1 |
| Comment by Andreas Dilger [ 20/Sep/16 ] |
|
Frank, can you please file an LUDOC ticket with details of what needs to be fixed in the manual so that this documented correctly. |
| Comment by Frank Heckes (Inactive) [ 20/Sep/16 ] |
|
done ( |
| Comment by Cliff White (Inactive) [ 13/Oct/16 ] |
|
On current tip of master, after 5 MDT failovers, remote dir is persistent # lfs setdirstripe -c 4 -i 1 /mnt/soaked/bah
[root@lola-16 jobs]# lfs getdirstripe /mnt/soaked/bah
lmv_stripe_count: 4 lmv_stripe_offset: 1
mdtidx FID[seq:oid:ver]
1 [0x240002b10:0x21ad:0x0]
2 [0x280001b74:0x21ad:0x0]
3 [0x2c0003ac8:0x21ad:0x0]
0 [0x200002b4f:0x21ad:0x0]
|
| Comment by Peter Jones [ 13/Oct/16 ] |
|
Closing as no longer appearing |