[LU-10626] utils/tests: lctl set_param -P does not appear to do anything Created: 07/Feb/18  Updated: 05/Aug/20  Resolved: 17/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: CEA Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: cea

Issue Links:
Related
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
is related to LU-7004 fix "lctl set_param -P" to allow depr... Resolved
is related to LU-11677 lctl set_param -P silently fails Open
is related to LU-11361 set_conf_param_and_check should check... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When running sanity-hsm on a single node setup I run into this error:

Set HSM on and start
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 20s: wanted 'enabled' got 'stopped'
 sanity-hsm : @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1 
  Trace dump:
  = /home/guest/lustre-release/lustre/tests/test-framework.sh:5336:error()
  = /home/guest/lustre-release/lustre/tests/sanity-hsm.sh:617:mdts_check_param()
  = /home/guest/lustre-release/lustre/tests/sanity-hsm.sh:723:cdt_check_state()
  = /home/guest/lustre-release/lustre/tests/sanity-hsm.sh:1005:main()

It seems lctl set_param -P mdt.*.hsm_control enabled silently fails. I do not know if it was recently modified... I tested it on other hsm tunables, it did not work either. Without the -P option, everything works fine.



 Comments   
Comment by John Hammond [ 07/Feb/18 ]

Hi Quentin,

Is this on a local build? It may be the same as LU-10627. It may be that bind mounting /usr/sbin/lctl to $LUSTRE/utils/lctl no longer works with the libtool script.

Comment by Quentin Bouget [ 09/Feb/18 ]

Hi John,

Yes it is on a local build. Your workaround for LU-10627 solved this too. Thanks again.

Comment by Quentin Bouget [ 09/Mar/18 ]

Well, the workaround stopped working. It still works for LU-10627 though. Any ideas?

For now, I am stuck with:

  • mounting lustre;
  • setting mdt.*.hsm_control to enabled manually (without the "-P" flag);
  • launching sanity-hsm.
Comment by Jean-Baptiste Riaux (Inactive) [ 27/Mar/18 ]

Hi Quentin,

Was it a single node setup ?
I am currently working on this and I have the same behavior but whatever the param specified (not only hsm_control), and using wildcards or not.
Moreover, passing non existing value is not reported.

On a single node setup:

[root@mds1 tests]# ./llmount.sh
Stopping clients: mds1 /mnt/lustre (opts:-f)
Stopping clients: mds1 /mnt/lustre2 (opts:-f)
Loading modules from /home/riauxjb/master3/lustre-release/lustre
detected 4 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
gss/krb5 is not supported
quota/lquota options: 'hash_lqs_cur_bits=3'
Formatting mgs, mds, osts
Format mds1: /tmp/lustre-mdt1
Format mds2: /tmp/lustre-mdt2
Format ost1: /tmp/lustre-ost1
Format ost2: /tmp/lustre-ost2
Format ost3: /tmp/lustre-ost3
Format ost4: /tmp/lustre-ost4
Checking servers environments
Checking clients mds1 environments
Loading modules from /home/riauxjb/master3/lustre-release/lustre
detected 4 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
gss/krb5 is not supported
Setup mgs, mdt, osts
Starting mds1:   /dev/mapper/mds1_flakey /mnt/lustre-mds1
Commit the device label on /tmp/lustre-mdt1
Started lustre-MDT0000
Starting mds2:   /dev/mapper/mds2_flakey /mnt/lustre-mds2
Commit the device label on /tmp/lustre-mdt2
Started lustre-MDT0001
Starting ost1:   /dev/mapper/ost1_flakey /mnt/lustre-ost1
Commit the device label on /tmp/lustre-ost1
Started lustre-OST0000
Starting ost2:   /dev/mapper/ost2_flakey /mnt/lustre-ost2
Commit the device label on /tmp/lustre-ost2
Started lustre-OST0001
Starting ost3:   /dev/mapper/ost3_flakey /mnt/lustre-ost3
Commit the device label on /tmp/lustre-ost3
Started lustre-OST0002
Starting ost4:   /dev/mapper/ost4_flakey /mnt/lustre-ost4
Commit the device label on /tmp/lustre-ost4
Started lustre-OST0003
Starting client: mds1:  -o user_xattr,flock mds1@tcp:/lustre /mnt/lustre
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       125368        2008      112124   2% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       125368        1832      112300   2% /mnt/lustre[MDT:1]
lustre-OST0000_UUID       325368       13648      284560   5% /mnt/lustre[OST:0]
lustre-OST0001_UUID       325368       13652      284556   5% /mnt/lustre[OST:1]
lustre-OST0002_UUID       325368       13652      284556   5% /mnt/lustre[OST:2]
lustre-OST0003_UUID       325368       13652      284556   5% /mnt/lustre[OST:3]

filesystem_summary:      1301472       54604     1138228   5% /mnt/lustre

Using TIMEOUT=20
seting jobstats to procname_uid
Setting lustre.sys.jobid_var from disable to procname_uid
Waiting 90 secs for update
disable quota as required

[root@mds1 tests]# /home/riauxjb/master3/lustre-release/lustre/utils/lctl set_param -P mdt.lustre-MDT0001.hsm_control=enabled
[root@mds1 tests]# lctl get_param  mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
mdt.lustre-MDT0001.hsm_control=stopped

[root@mds1 tests]# lctl get_param  mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
mdt.lustre-MDT0001.hsm_control=stopped

[root@mds1 tests]# lctl get_param  mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
mdt.lustre-MDT0001.hsm_control=stopped

[root@mds1 tests]# /home/riauxjb/master3/lustre-release/lustre/utils/lctl set_param -P mdt.lustre-MDT0001.hsm_control=bs
[root@mds1 tests]# 

But it's working fine when Lustre is deployed on separated nodes (here, one MDS, two OSS, one OSS as client):

[root@mds1 models]# shine format -f testfs
Format testfs on mds1,oss[1-2]: are you sure? (y)es/(N)o: y
Format successful.
= FILESYSTEM STATUS (testfs) =
TYPE # STATUS  NODES
---- - ------  -----
MGT  1 offline mds1
MDT  2 offline mds1
OST  2 offline oss[1-2]
[root@mds1 models]# shine start -f testfs
[16:25] In progress for 2 component(s) on oss[1-2] ...
Start successful.
= FILESYSTEM STATUS (testfs) =
TYPE # STATUS NODES
---- - ------ -----
MGT  1 online mds1
MDT  2 online mds1
OST  2 online oss[1-2]

[root@mds1 models]# lctl get_param mdt.*.hsm_control
mdt.testfs-MDT0000.hsm_control=disabled
mdt.testfs-MDT0001.hsm_control=disabled
[root@mds1 models]# lctl set_param -P mdt.*.hsm_control=enalbed

=> typo, no warning!

[root@mds1 models]# lctl set_param -P mdt.*.hsm_control=enabled

Wait a few seconds:

[root@mds1 models]# lctl get_param mdt.*.hsm_control
mdt.testfs-MDT0000.hsm_control=enabled
mdt.testfs-MDT0001.hsm_control=enabled

I am currently investigating with gdb "lctl" with set_param -P" option.

Comment by James A Simmons [ 27/Mar/18 ]

Do you have the udev rule installed?

Comment by Jean-Baptiste Riaux (Inactive) [ 28/Mar/18 ]

I have this:

[root@mds1 tests]# cat /etc/udev/rules.d/99-lustre.rules
KERNEL=="obd", MODE="0666"
# set sysfs values on client
SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*", RUN+="/usr/sbin/lctl set_param $env{PARAM}=$env{SETTING}"

 

Comment by Quentin Bouget [ 03/Apr/18 ]

Hi Jean-Baptiste,

Was it a single node setup ?

Yes, it was.

Hi James,

Do you have the udev rule installed?

I have the same udev rules installed as Jean-Baptiste.
 
I just noticed this by the way (still on  a single node setup):

[root]# llmount.sh
[root]# lctl get_param mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
[root]# lctl set_param -P mdt.*.hsm_control=enabled
[root]# lctl get_param mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
[root]# lctl set_param mdt.*.hsm_control=enabled
error: set_param: setting /proc/fs/lustre/mdt/lustre-MDT0000/hsm_control=enabled: Operation already in progress
[root]# lctl get_param mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=enabled

Hope this helps =)

Comment by James A Simmons [ 03/Apr/18 ]

This is libtool wrapper issue. I' thinking about a solution for this. The issue is all the $LUSTRE/util path being used. It shoudl be $LUSTRE/util/.lib. I'm trying to figure out a clean method to fix this.

Comment by Quentin Bouget [ 03/Apr/18 ]

Are you sure? I used the "--disable-shared" option with ./configure...

[root]# which lctl
/usr/sbin/lctl
[root]# head -c 4 /usr/sbin/lctl | tail -c 3
ELF
[root]# pwd
/home/root/lustre-release
[root]# find -name ".lib"
[root]#
Comment by James A Simmons [ 03/Apr/18 ]

Oh that is interesting. So this only happens on single node setup? This is really strange. Let me see if I can duplicate it.

Comment by Quentin Bouget [ 16/Jul/18 ]

Is there any way to emulate "lctl set_param -P"?

Comment by James A Simmons [ 16/Jul/18 ]

I'm going to try this out on my Ubuntu18 setup once I get e2fsprogs for it.

Comment by James A Simmons [ 27/Jul/18 ]

Can you try patch https://review.whamcloud.com/#/c/32835 to see if it resolves your issues.

Comment by Quentin Bouget [ 30/Jul/18 ]

No, it does not.

I finally debugged this though, the issue is related to whether or not you use an equal sign to set a parameter with lctl set_param -P. For example:

Without '=':

[root]# lctl set_param -P mdt.*.hsm_control enabled
[root]# sleep 6
[root]# dmesg | tail
[ 3226.555763] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
[ 3226.560817] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[ 3226.578427] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
[ 3226.583304] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[ 3230.763833] Lustre: Mounted lustre-client
[ 3231.860718] Lustre: DEBUG MARKER: Using TIMEOUT=20
[ 3285.107374] LustreError: 16495:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3285.114621] LustreError: 16505:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3285.121183] LustreError: 16506:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3285.127897] LustreError: 16507:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[root]# lctl get_param mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped

I checked, mdt_hsm_cdt_control_seq_write() receives a NULL pointer as its (const __user char *)buffer argument. This is what lead me to check lustre/utils/lustre_cfg.c:jt_lcfg_mgsparam2() and try the same lctl command with the '=' sign:

[root]# lctl set_param -P mdt.*.hsm_control=enabled
[root]# sleep 6
[root]# dmesg | tail
[ 3226.583304] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[ 3230.763833] Lustre: Mounted lustre-client
[ 3231.860718] Lustre: DEBUG MARKER: Using TIMEOUT=20
[ 3285.107374] LustreError: 16495:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3285.114621] LustreError: 16505:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3285.121183] LustreError: 16506:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3285.127897] LustreError: 16507:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3357.759861] LustreError: 16707:0:(mdt_coordinator.c:1092:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping
[ 3357.766074] LustreError: 16708:0:(mdt_coordinator.c:1092:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping
[ 3357.772283] LustreError: 16709:0:(mdt_coordinator.c:1092:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping
[root]# lctl get_param mdt.*.hsm_control
mdt.lustre-MDT0000.hsm_control=enabled

I believe the repeated outputs in dmesg are symptomatic of an issue of its own (which may be more of a feature than a bug, I do not really know).

Comment by James A Simmons [ 01/Aug/18 ]

Now I totally understand why it fails. In fact I'm surprised it ever passed. You MUST use '=' with command "lctl set_param -P $param=$value". That is what gets cached into the config logs. This is a bug buried in the hsm testing scripts. I'm working on supporting lctl set_param -P so I will include a fix in  https://review.whamcloud.com/#/c/30087

 

Comment by Quentin Bouget [ 02/Aug/18 ]

> You MUST use '=' with command "lctl set_param -P $param=$value".

But why? "lctl set_param param value" handles things just fine.

> This is a bug buried in the hsm testing scripts.

I would not say that. To me this is a bug buried in lctl set_param -P's parser:

  • no error is reported
  • this behaviour is not consistent with that of lctl set_param (without the -P option).
Comment by Andreas Dilger [ 08/Aug/18 ]

The use of lctl set_param without an '=' is not very good. It was accepted for some time for compatibility reasons, but I'd rather print a warning if the argument is not param=value. Using it without '=' makes it harder to parse, especially if there are multiple values on the same command-line (which lctl can handle just fine), and it is more prone to errors.

Comment by Quentin Bouget [ 09/Aug/18 ]

Ok, so we only need to add a warning to "lctl set_param" and an error report to "lctl set_param -P" when they are used with the "param value" syntax.

By the way, sanity-hsm actually uses the right syntax, I just did not run make install on my setup so the upcall to /sbin/lctl (silently) fails. Is there something we can do to support lctl not being installed at the correct path?

Comment by Andreas Dilger [ 09/Aug/18 ]

It would be better if "lctl set_param -P" also worked with the improper syntax, and only printed a warning.

Comment by James A Simmons [ 23/Aug/18 ]

So this is something broken for a very long time. With udev rules it is possible to work around this. So what you want to do is create a udev rule on the fly of the format:

SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*", RUN+="/my/absolete/path/lctl set_param '$env{PARAM}=$env{SETTING}'"

You must generate the above with the absolute path to your lctl you are using in your local source tree. So this has to be generated on the fly. udev rules can NOT use relative paths except if they are stored in /lib/udev which is not what we want.

That generated rule must be stored in the udev temporary location which is "/dev/.udev/rules.d/*"

Then you need to restart udevd with the command 

udevadm control --reload-rules

 

With this info I bet you can create a patch for the test suite to make it work without 

/etc/udev/rules.d/99-lustre.rules installed. I would recommend only doing the above if 99-lustre.rules is missing.

Comment by Quentin Bouget [ 24/Aug/18 ]

As I mentioned, sanity-hsm already uses the right syntax. The reason the test suite failed on my setup is that I usually do not install Lustre on my VM. Most of the time this is fine, but not for lctl set_param -P which I guess translates to an upcall to lctl <something> on the mgs. If Lustre is not installed, the upcall points to a non-existent file (/usr/sbin/lctl).

The syntax mess only complicated the debugging.

I think there are two things that need fixing:

  1. The syntax inconsistency (but maybe the udev rule is already good enough);
  2. The silent failures of lctl set_param -P.
Comment by Gerrit Updater [ 12/Sep/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33143
Subject: LU-10626 test: create custom udev rule
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2fff4fb16c64e243cd2a1cc71766dfd1ba0c71bc

Comment by Gerrit Updater [ 17/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33143/
Subject: LU-10626 test: create custom udev rule
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73ecd83a24e220b8097facf5bf3f5d93f523702c

Comment by James A Simmons [ 17/Nov/18 ]

I moved the lctl set_param -P brokeness to another ticket.

Comment by Quentin Bouget [ 20/Nov/18 ]

Ok, thanks.

Generated at Sat Feb 10 02:36:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.