[LU-10626] utils/tests: lctl set_param -P does not appear to do anything Created: 07/Feb/18 Updated: 05/Aug/20 Resolved: 17/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | CEA | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | cea | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
When running sanity-hsm on a single node setup I run into this error: Set HSM on and start Waiting 20 secs for update Waiting 10 secs for update Update not seen after 20s: wanted 'enabled' got 'stopped' sanity-hsm : @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1 Trace dump: = /home/guest/lustre-release/lustre/tests/test-framework.sh:5336:error() = /home/guest/lustre-release/lustre/tests/sanity-hsm.sh:617:mdts_check_param() = /home/guest/lustre-release/lustre/tests/sanity-hsm.sh:723:cdt_check_state() = /home/guest/lustre-release/lustre/tests/sanity-hsm.sh:1005:main() It seems lctl set_param -P mdt.*.hsm_control enabled silently fails. I do not know if it was recently modified... I tested it on other hsm tunables, it did not work either. Without the -P option, everything works fine. |
| Comments |
| Comment by John Hammond [ 07/Feb/18 ] |
|
Hi Quentin, Is this on a local build? It may be the same as |
| Comment by Quentin Bouget [ 09/Feb/18 ] |
|
Hi John, Yes it is on a local build. Your workaround for |
| Comment by Quentin Bouget [ 09/Mar/18 ] |
|
Well, the workaround stopped working. It still works for For now, I am stuck with:
|
| Comment by Jean-Baptiste Riaux (Inactive) [ 27/Mar/18 ] |
|
Hi Quentin, Was it a single node setup ? On a single node setup: [root@mds1 tests]# ./llmount.sh Stopping clients: mds1 /mnt/lustre (opts:-f) Stopping clients: mds1 /mnt/lustre2 (opts:-f) Loading modules from /home/riauxjb/master3/lustre-release/lustre detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /tmp/lustre-mdt1 Format mds2: /tmp/lustre-mdt2 Format ost1: /tmp/lustre-ost1 Format ost2: /tmp/lustre-ost2 Format ost3: /tmp/lustre-ost3 Format ost4: /tmp/lustre-ost4 Checking servers environments Checking clients mds1 environments Loading modules from /home/riauxjb/master3/lustre-release/lustre detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions gss/krb5 is not supported Setup mgs, mdt, osts Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1 Commit the device label on /tmp/lustre-mdt1 Started lustre-MDT0000 Starting mds2: /dev/mapper/mds2_flakey /mnt/lustre-mds2 Commit the device label on /tmp/lustre-mdt2 Started lustre-MDT0001 Starting ost1: /dev/mapper/ost1_flakey /mnt/lustre-ost1 Commit the device label on /tmp/lustre-ost1 Started lustre-OST0000 Starting ost2: /dev/mapper/ost2_flakey /mnt/lustre-ost2 Commit the device label on /tmp/lustre-ost2 Started lustre-OST0001 Starting ost3: /dev/mapper/ost3_flakey /mnt/lustre-ost3 Commit the device label on /tmp/lustre-ost3 Started lustre-OST0002 Starting ost4: /dev/mapper/ost4_flakey /mnt/lustre-ost4 Commit the device label on /tmp/lustre-ost4 Started lustre-OST0003 Starting client: mds1: -o user_xattr,flock mds1@tcp:/lustre /mnt/lustre UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 125368 2008 112124 2% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 125368 1832 112300 2% /mnt/lustre[MDT:1] lustre-OST0000_UUID 325368 13648 284560 5% /mnt/lustre[OST:0] lustre-OST0001_UUID 325368 13652 284556 5% /mnt/lustre[OST:1] lustre-OST0002_UUID 325368 13652 284556 5% /mnt/lustre[OST:2] lustre-OST0003_UUID 325368 13652 284556 5% /mnt/lustre[OST:3] filesystem_summary: 1301472 54604 1138228 5% /mnt/lustre Using TIMEOUT=20 seting jobstats to procname_uid Setting lustre.sys.jobid_var from disable to procname_uid Waiting 90 secs for update disable quota as required [root@mds1 tests]# /home/riauxjb/master3/lustre-release/lustre/utils/lctl set_param -P mdt.lustre-MDT0001.hsm_control=enabled [root@mds1 tests]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=stopped mdt.lustre-MDT0001.hsm_control=stopped [root@mds1 tests]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=stopped mdt.lustre-MDT0001.hsm_control=stopped [root@mds1 tests]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=stopped mdt.lustre-MDT0001.hsm_control=stopped [root@mds1 tests]# /home/riauxjb/master3/lustre-release/lustre/utils/lctl set_param -P mdt.lustre-MDT0001.hsm_control=bs [root@mds1 tests]# But it's working fine when Lustre is deployed on separated nodes (here, one MDS, two OSS, one OSS as client): [root@mds1 models]# shine format -f testfs Format testfs on mds1,oss[1-2]: are you sure? (y)es/(N)o: y Format successful. = FILESYSTEM STATUS (testfs) = TYPE # STATUS NODES ---- - ------ ----- MGT 1 offline mds1 MDT 2 offline mds1 OST 2 offline oss[1-2] [root@mds1 models]# shine start -f testfs [16:25] In progress for 2 component(s) on oss[1-2] ... Start successful. = FILESYSTEM STATUS (testfs) = TYPE # STATUS NODES ---- - ------ ----- MGT 1 online mds1 MDT 2 online mds1 OST 2 online oss[1-2] [root@mds1 models]# lctl get_param mdt.*.hsm_control mdt.testfs-MDT0000.hsm_control=disabled mdt.testfs-MDT0001.hsm_control=disabled [root@mds1 models]# lctl set_param -P mdt.*.hsm_control=enalbed => typo, no warning! [root@mds1 models]# lctl set_param -P mdt.*.hsm_control=enabled Wait a few seconds: [root@mds1 models]# lctl get_param mdt.*.hsm_control mdt.testfs-MDT0000.hsm_control=enabled mdt.testfs-MDT0001.hsm_control=enabled I am currently investigating with gdb "lctl" with set_param -P" option. |
| Comment by James A Simmons [ 27/Mar/18 ] |
|
Do you have the udev rule installed? |
| Comment by Jean-Baptiste Riaux (Inactive) [ 28/Mar/18 ] |
|
I have this: [root@mds1 tests]# cat /etc/udev/rules.d/99-lustre.rules
KERNEL=="obd", MODE="0666"
# set sysfs values on client
SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*", RUN+="/usr/sbin/lctl set_param $env{PARAM}=$env{SETTING}"
|
| Comment by Quentin Bouget [ 03/Apr/18 ] |
|
Hi Jean-Baptiste,
Yes, it was. Hi James,
I have the same udev rules installed as Jean-Baptiste. [root]# llmount.sh [root]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=stopped [root]# lctl set_param -P mdt.*.hsm_control=enabled [root]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=stopped [root]# lctl set_param mdt.*.hsm_control=enabled error: set_param: setting /proc/fs/lustre/mdt/lustre-MDT0000/hsm_control=enabled: Operation already in progress [root]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=enabled Hope this helps =) |
| Comment by James A Simmons [ 03/Apr/18 ] |
|
This is libtool wrapper issue. I' thinking about a solution for this. The issue is all the $LUSTRE/util path being used. It shoudl be $LUSTRE/util/.lib. I'm trying to figure out a clean method to fix this. |
| Comment by Quentin Bouget [ 03/Apr/18 ] |
|
Are you sure? I used the "--disable-shared" option with ./configure... [root]# which lctl /usr/sbin/lctl [root]# head -c 4 /usr/sbin/lctl | tail -c 3 ELF [root]# pwd /home/root/lustre-release [root]# find -name ".lib" [root]# |
| Comment by James A Simmons [ 03/Apr/18 ] |
|
Oh that is interesting. So this only happens on single node setup? This is really strange. Let me see if I can duplicate it. |
| Comment by Quentin Bouget [ 16/Jul/18 ] |
|
Is there any way to emulate "lctl set_param -P"? |
| Comment by James A Simmons [ 16/Jul/18 ] |
|
I'm going to try this out on my Ubuntu18 setup once I get e2fsprogs for it. |
| Comment by James A Simmons [ 27/Jul/18 ] |
|
Can you try patch https://review.whamcloud.com/#/c/32835 to see if it resolves your issues. |
| Comment by Quentin Bouget [ 30/Jul/18 ] |
|
No, it does not. I finally debugged this though, the issue is related to whether or not you use an equal sign to set a parameter with lctl set_param -P. For example: Without '=': [root]# lctl set_param -P mdt.*.hsm_control enabled [root]# sleep 6 [root]# dmesg | tail [ 3226.555763] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5 [ 3226.560817] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro [ 3226.578427] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5 [ 3226.583304] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 3230.763833] Lustre: Mounted lustre-client [ 3231.860718] Lustre: DEBUG MARKER: Using TIMEOUT=20 [ 3285.107374] LustreError: 16495:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3285.114621] LustreError: 16505:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3285.121183] LustreError: 16506:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3285.127897] LustreError: 16507:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [root]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=stopped I checked, mdt_hsm_cdt_control_seq_write() receives a NULL pointer as its (const __user char *)buffer argument. This is what lead me to check lustre/utils/lustre_cfg.c:jt_lcfg_mgsparam2() and try the same lctl command with the '=' sign: [root]# lctl set_param -P mdt.*.hsm_control=enabled [root]# sleep 6 [root]# dmesg | tail [ 3226.583304] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 3230.763833] Lustre: Mounted lustre-client [ 3231.860718] Lustre: DEBUG MARKER: Using TIMEOUT=20 [ 3285.107374] LustreError: 16495:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3285.114621] LustreError: 16505:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3285.121183] LustreError: 16506:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3285.127897] LustreError: 16507:0:(mdt_coordinator.c:2161:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help [ 3357.759861] LustreError: 16707:0:(mdt_coordinator.c:1092:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping [ 3357.766074] LustreError: 16708:0:(mdt_coordinator.c:1092:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping [ 3357.772283] LustreError: 16709:0:(mdt_coordinator.c:1092:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping [root]# lctl get_param mdt.*.hsm_control mdt.lustre-MDT0000.hsm_control=enabled I believe the repeated outputs in dmesg are symptomatic of an issue of its own (which may be more of a feature than a bug, I do not really know). |
| Comment by James A Simmons [ 01/Aug/18 ] |
|
Now I totally understand why it fails. In fact I'm surprised it ever passed. You MUST use '=' with command "lctl set_param -P $param=$value". That is what gets cached into the config logs. This is a bug buried in the hsm testing scripts. I'm working on supporting lctl set_param -P so I will include a fix in https://review.whamcloud.com/#/c/30087
|
| Comment by Quentin Bouget [ 02/Aug/18 ] |
|
> You MUST use '=' with command "lctl set_param -P $param=$value". But why? "lctl set_param param value" handles things just fine. > This is a bug buried in the hsm testing scripts. I would not say that. To me this is a bug buried in lctl set_param -P's parser:
|
| Comment by Andreas Dilger [ 08/Aug/18 ] |
|
The use of lctl set_param without an '=' is not very good. It was accepted for some time for compatibility reasons, but I'd rather print a warning if the argument is not param=value. Using it without '=' makes it harder to parse, especially if there are multiple values on the same command-line (which lctl can handle just fine), and it is more prone to errors. |
| Comment by Quentin Bouget [ 09/Aug/18 ] |
|
Ok, so we only need to add a warning to "lctl set_param" and an error report to "lctl set_param -P" when they are used with the "param value" syntax. By the way, sanity-hsm actually uses the right syntax, I just did not run make install on my setup so the upcall to /sbin/lctl (silently) fails. Is there something we can do to support lctl not being installed at the correct path? |
| Comment by Andreas Dilger [ 09/Aug/18 ] |
|
It would be better if "lctl set_param -P" also worked with the improper syntax, and only printed a warning. |
| Comment by James A Simmons [ 23/Aug/18 ] |
|
So this is something broken for a very long time. With udev rules it is possible to work around this. So what you want to do is create a udev rule on the fly of the format: SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*", RUN+="/my/absolete/path/lctl set_param '$env{PARAM}=$env{SETTING}'" You must generate the above with the absolute path to your lctl you are using in your local source tree. So this has to be generated on the fly. udev rules can NOT use relative paths except if they are stored in /lib/udev which is not what we want. That generated rule must be stored in the udev temporary location which is "/dev/.udev/rules.d/*" Then you need to restart udevd with the command udevadm control --reload-rules
With this info I bet you can create a patch for the test suite to make it work without /etc/udev/rules.d/99-lustre.rules installed. I would recommend only doing the above if 99-lustre.rules is missing. |
| Comment by Quentin Bouget [ 24/Aug/18 ] |
|
As I mentioned, sanity-hsm already uses the right syntax. The reason the test suite failed on my setup is that I usually do not install Lustre on my VM. Most of the time this is fine, but not for lctl set_param -P which I guess translates to an upcall to lctl <something> on the mgs. If Lustre is not installed, the upcall points to a non-existent file (/usr/sbin/lctl). The syntax mess only complicated the debugging. I think there are two things that need fixing:
|
| Comment by Gerrit Updater [ 12/Sep/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33143 |
| Comment by Gerrit Updater [ 17/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33143/ |
| Comment by James A Simmons [ 17/Nov/18 ] |
|
I moved the lctl set_param -P brokeness to another ticket. |
| Comment by Quentin Bouget [ 20/Nov/18 ] |
|
Ok, thanks. |