[LU-12932] parallel-scale test rr_alloc fails with ‘failed while setting qos_threshold_rr & creat_count’ Created: 04/Nov/19  Updated: 07/Jan/21  Resolved: 08/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12624 DNE3: striped directory allocate stri... Resolved
is related to LU-14303 parallel-scale test rr_alloc fails wi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale test_rr_alloc fails getting/setting ‘lov.lustre-MDT0000*.qos_threshold_rr’. These failures started on approximately 29 OCT 2019 and may be related to the changes in the ‘striped directory allocate stripes by QoS’, LU-12624, patches landings.

Looking at the suite_log for https://testing.whamcloud.com/test_sets/c0dad8f4-fd58-11e9-8e77-52540065bddc, we see the errors getting and setting qos_threshold_rr on the MDS

CMD: trevis-20vm12 params=\$(/usr/sbin/lctl get_param lov.lustre-MDT0000*.qos_threshold_rr);
			 [[ -z \"lustre-MDT0000\" ]] && param= ||
			 param=\$(grep lustre-MDT0000 <<< \"\$params\");
			 [[ -z \$param ]] && param=\"\$params\";
			 while read s; do echo mds1 \$s;
			 done <<< \"\$param\"
trevis-20vm12: error: get_param: param_path 'lov/lustre-MDT0000*/qos_threshold_rr': No such file or directory
CMD: trevis-20vm12 params=\$(/usr/sbin/lctl get_param osp.lustre-OST*-osc-MDT0000.create_count);
			 [[ -z \"lustre-MDT0000\" ]] && param= ||
			 param=\$(grep lustre-MDT0000 <<< \"\$params\");
			 [[ -z \$param ]] && param=\"\$params\";
			 while read s; do echo mds1 \$s;
			 done <<< \"\$param\"
CMD: trevis-20vm12 /usr/sbin/lctl set_param -n 		lov.lustre-MDT0000*.qos_threshold_rr 100 		osp.lustre-OST*-osc-MDT0000.create_count 3488
trevis-20vm12: error: set_param: param_path 'lov/lustre-MDT0000*/qos_threshold_rr': No such file or directory
 parallel-scale test_rr_alloc: @@@@@@ FAIL: failed while setting qos_threshold_rr & creat_count 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6108:error()
  = /usr/lib64/lustre/tests/functions.sh:1004:run_rr_alloc()
  = /usr/lib64/lustre/tests/parallel-scale.sh:165:test_rr_alloc()

Looking at the sanity suite_log at https://testing.whamcloud.com/test_sets/9ee6cf78-fd58-11e9-8e77-52540065bddc, we see failures getting the qos_threshold_rr parameter

== sanity test 116a: stripe QOS: free space balance ================================================== 00:49:17 (1572569357)
Free space priority CMD: trevis-20vm12 lctl get_param -n lo[vd].*-mdtlov.qos_prio_free
91%
CMD: trevis-20vm12 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
sleep 5 for ZFS zfs
sleep 5 for ZFS zfs
Waiting for local destroys to complete
OST kbytes available: 1878016 1901568 1912832 1904640 1900544 1911808 1909760
Min free space: OST 0: 1878016
Max free space: OST 2: 1912832
CMD: trevis-20vm12 lctl get_param -n *.*MDT0000-mdtlov.qos_threshold_rr
trevis-20vm12: error: get_param: param_path '*/*MDT0000-mdtlov/qos_threshold_rr': No such file or directory
Check for uneven OSTs: diff=34816KB (1%) must be > % ...ok
Don't need to fill OST0
diff=34816=1% must be > % for QOS mode.../usr/lib64/lustre/tests/sanity.sh: line 10107: [: 1: unary operator expected
failed - QOS mode won't be used
sleep 5 for ZFS zfs
Waiting for local destroys to complete
cleanup time 6

 SKIP: sanity test_116a QOS imbalance criteria not met
SKIP 116a (29s)

== sanity test 116b: QoS shouldn't LBUG if not enough OSTs found on the 2nd pass ===================== 00:49:46 (1572569386)
CMD: trevis-20vm12 lctl get_param -n lo[vd].lustre-MDT0000-mdtlov.qos_threshold_rr
trevis-20vm12: error: get_param: param_path 'lo[vd]/lustre-MDT0000-mdtlov/qos_threshold_rr': No such file or directory

 SKIP: sanity test_116b no QOS
SKIP 116b (1s)

In sanityn, https://testing.whamcloud.com/test_sets/ab3b11a8-fd58-11e9-8e77-52540065bddc, we see similar failures

== sanityn test 93: alloc_rr should not allocate on same ost ========================================= 08:34:06 (1572597246)
CMD: trevis-20vm12 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr
trevis-20vm12: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
CMD: trevis-20vm12 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100
trevis-20vm12: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
CMD: trevis-20vm12 lctl set_param fail_loc=0x00000163
fail_loc=0x00000163
CMD: trevis-20vm12 lctl set_param fail_loc=0x0
fail_loc=0x0
CMD: trevis-20vm12 lctl set_param -n 		'lod.lustre-MDT*/qos_threshold_rr' 

trevis-20vm12: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value

Other failures are at
https://testing.whamcloud.com/test_sets/1447e4bc-fce3-11e9-b934-52540065bddc
https://testing.whamcloud.com/test_sets/25c770d8-fcff-11e9-8e77-52540065bddc
https://testing.whamcloud.com/test_sets/53262b58-fd06-11e9-bbc3-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 04/Nov/19 ]

The manual may need to be updated if qos_threshold_rr was moved or removed. In 19.6.2. Stripe Allocation Methods we have

The allocation method is determined by the amount of free-space imbalance on the OSTs. When free space is relatively balanced across OSTs, the faster round-robin allocator is used, which maximizes network balancing. The weighted allocator is used when any two OSTs are out of balance by more than the specified threshold (17% by default). The threshold between the two allocation methods is defined in the file /proc/fs/fsname/lov/fsname-mdtlov/qos_threshold_rr.

To set the qos_threshold_r to 25, enter this command on the MGS:

lctl set_param lov.fsname-mdtlov.qos_threshold_rr=25

and in 39.7. Allocating Free Space on OSTs, we have

Free space distribution can be tuned using these two tunable parameters:

lod.*.qos_threshold_rr - The threshold at which the allocation method switches from round-robin to weighted is set in this file. The default is to switch to the weighted algorithm when any two OSTs are out of balance by more than 17 percent.
Comment by Andreas Dilger [ 04/Nov/19 ]

We should not change the existing parameter names easily, as that breaks user tunable parameters. In particular, the qos_* parameters are actually changed by some users to disable QOS weighted allocation to get more uniform performance.

So, I do not think that changing the documentation is the right thing to do.

It may be that the problem is the use of "lov.*.qos_threshold_rr" instead of "lod.*.qos_threshold_rr" in the test scripts. That can be fixed by using "lod.*.*" everywhere, since the use of "lov.*.*" on the MDS is very old and should be removed. There was patch https://review.whamcloud.com/35185 "LU-8066 tests: use lod / osp tunables on servers" landed for this, but it looks like this missed at least lustre/tests/functions.sh::run_rr_alloc().

There is also lustre/tests/qos.sh but it isn't clear that this test script is working at all these days (it does things on the client that should be on the MDS, uses old positional setstripe arguments, etc.), and could just be removed.

If it isn't just a matter of s/lov/lod/ in the script because the parameter names changed, we need to change back to the old parameter names so that the tests work again and we don't break existing configurations.

Comment by Gerrit Updater [ 04/Nov/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36666
Subject: LU-12932 tests: remove obsolete qos.sh test script
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5b53b5a0decfe5b6fffa2252e697e9a805454e7f

Comment by James A Simmons [ 04/Nov/19 ]

Looks like qos_thresholdrr was renamed to lod_qos_thresholdrr which is causing the breakage.

Comment by Gerrit Updater [ 04/Nov/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36667
Subject: LU-12932 lod: restore qos_thresholdrr sysfs file
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 88a518e7bdada1ed92278c6bd50b9b37b0ac6ca1

Comment by Gerrit Updater [ 05/Nov/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36667/
Subject: LU-12932 lod: restore qos_thresholdrr sysfs file
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ad541aa34002eb0f3d19ba9512b713ffcaf77bc

Comment by Gerrit Updater [ 05/Nov/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36666/
Subject: LU-12932 tests: remove obsolete qos.sh test script
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 15443866cc98b1c81551ce8d2172b2902c51eebd

Comment by Peter Jones [ 05/Nov/19 ]

Landed for 2.13

Comment by James Nunez (Inactive) [ 05/Nov/19 ]

I don't think this is fixed. Looking at the results for the patch https://review.whamcloud.com/36667/ at https://testing.whamcloud.com/test_sets/59e536a8-ff8d-11e9-8e77-52540065bddc, we see that qos_threshold_rr is missing:

== sanityn test 93: alloc_rr should not allocate on same ost ========================================= 03:24:52 (1572924292)
CMD: trevis-18vm4 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr
trevis-18vm4: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
CMD: trevis-18vm4 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100
trevis-18vm4: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
CMD: trevis-18vm4 lctl set_param fail_loc=0x00000163
fail_loc=0x00000163
CMD: trevis-18vm4 lctl set_param fail_loc=0x0
fail_loc=0x0
CMD: trevis-18vm4 lctl set_param -n 		'lod.lustre-MDT*/qos_threshold_rr' 
trevis-18vm4: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value
/mnt/lustre/f93.sanityn-1/file1

Maybe the parameter changed to qos_thresholdrr?

Comment by Andreas Dilger [ 06/Nov/19 ]

Sigh, the parameter name that was previously existing was named "qos_threshold_rr" and not "qos_thresholdrr", so another patch is needed.

Comment by Gerrit Updater [ 06/Nov/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36683
Subject: LU-12932 lod: rename qos_threshold_rr parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 12992baae51be17271ee0656fab8d236a80cb4d1

Comment by Gerrit Updater [ 06/Nov/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36686
Subject: LU-12932 lod: rename qos_threshold_rr parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5b4cbf2478783de790e7f00ec92c9d226f842897

Comment by Gerrit Updater [ 08/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36686/
Subject: LU-12932 lod: rename qos_threshold_rr parameter
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: aa4269f5c2e3c834cdff63dc32d7a7183f32374a

Comment by Peter Jones [ 08/Nov/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:56:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.