[LU-12932] parallel-scale test rr_alloc fails with ‘failed while setting qos_threshold_rr & creat_count’ Created: 04/Nov/19 Updated: 07/Jan/21 Resolved: 08/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
parallel-scale test_rr_alloc fails getting/setting ‘lov.lustre-MDT0000*.qos_threshold_rr’. These failures started on approximately 29 OCT 2019 and may be related to the changes in the ‘striped directory allocate stripes by QoS’, Looking at the suite_log for https://testing.whamcloud.com/test_sets/c0dad8f4-fd58-11e9-8e77-52540065bddc, we see the errors getting and setting qos_threshold_rr on the MDS CMD: trevis-20vm12 params=\$(/usr/sbin/lctl get_param lov.lustre-MDT0000*.qos_threshold_rr); [[ -z \"lustre-MDT0000\" ]] && param= || param=\$(grep lustre-MDT0000 <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo mds1 \$s; done <<< \"\$param\" trevis-20vm12: error: get_param: param_path 'lov/lustre-MDT0000*/qos_threshold_rr': No such file or directory CMD: trevis-20vm12 params=\$(/usr/sbin/lctl get_param osp.lustre-OST*-osc-MDT0000.create_count); [[ -z \"lustre-MDT0000\" ]] && param= || param=\$(grep lustre-MDT0000 <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo mds1 \$s; done <<< \"\$param\" CMD: trevis-20vm12 /usr/sbin/lctl set_param -n lov.lustre-MDT0000*.qos_threshold_rr 100 osp.lustre-OST*-osc-MDT0000.create_count 3488 trevis-20vm12: error: set_param: param_path 'lov/lustre-MDT0000*/qos_threshold_rr': No such file or directory parallel-scale test_rr_alloc: @@@@@@ FAIL: failed while setting qos_threshold_rr & creat_count Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6108:error() = /usr/lib64/lustre/tests/functions.sh:1004:run_rr_alloc() = /usr/lib64/lustre/tests/parallel-scale.sh:165:test_rr_alloc() Looking at the sanity suite_log at https://testing.whamcloud.com/test_sets/9ee6cf78-fd58-11e9-8e77-52540065bddc, we see failures getting the qos_threshold_rr parameter == sanity test 116a: stripe QOS: free space balance ================================================== 00:49:17 (1572569357) Free space priority CMD: trevis-20vm12 lctl get_param -n lo[vd].*-mdtlov.qos_prio_free 91% CMD: trevis-20vm12 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* sleep 5 for ZFS zfs sleep 5 for ZFS zfs Waiting for local destroys to complete OST kbytes available: 1878016 1901568 1912832 1904640 1900544 1911808 1909760 Min free space: OST 0: 1878016 Max free space: OST 2: 1912832 CMD: trevis-20vm12 lctl get_param -n *.*MDT0000-mdtlov.qos_threshold_rr trevis-20vm12: error: get_param: param_path '*/*MDT0000-mdtlov/qos_threshold_rr': No such file or directory Check for uneven OSTs: diff=34816KB (1%) must be > % ...ok Don't need to fill OST0 diff=34816=1% must be > % for QOS mode.../usr/lib64/lustre/tests/sanity.sh: line 10107: [: 1: unary operator expected failed - QOS mode won't be used sleep 5 for ZFS zfs Waiting for local destroys to complete cleanup time 6 SKIP: sanity test_116a QOS imbalance criteria not met SKIP 116a (29s) == sanity test 116b: QoS shouldn't LBUG if not enough OSTs found on the 2nd pass ===================== 00:49:46 (1572569386) CMD: trevis-20vm12 lctl get_param -n lo[vd].lustre-MDT0000-mdtlov.qos_threshold_rr trevis-20vm12: error: get_param: param_path 'lo[vd]/lustre-MDT0000-mdtlov/qos_threshold_rr': No such file or directory SKIP: sanity test_116b no QOS SKIP 116b (1s) In sanityn, https://testing.whamcloud.com/test_sets/ab3b11a8-fd58-11e9-8e77-52540065bddc, we see similar failures == sanityn test 93: alloc_rr should not allocate on same ost ========================================= 08:34:06 (1572597246) CMD: trevis-20vm12 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr trevis-20vm12: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory CMD: trevis-20vm12 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100 trevis-20vm12: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory CMD: trevis-20vm12 lctl set_param fail_loc=0x00000163 fail_loc=0x00000163 CMD: trevis-20vm12 lctl set_param fail_loc=0x0 fail_loc=0x0 CMD: trevis-20vm12 lctl set_param -n 'lod.lustre-MDT*/qos_threshold_rr' trevis-20vm12: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value Other failures are at |
| Comments |
| Comment by James Nunez (Inactive) [ 04/Nov/19 ] |
|
The manual may need to be updated if qos_threshold_rr was moved or removed. In 19.6.2. Stripe Allocation Methods we have The allocation method is determined by the amount of free-space imbalance on the OSTs. When free space is relatively balanced across OSTs, the faster round-robin allocator is used, which maximizes network balancing. The weighted allocator is used when any two OSTs are out of balance by more than the specified threshold (17% by default). The threshold between the two allocation methods is defined in the file /proc/fs/fsname/lov/fsname-mdtlov/qos_threshold_rr. To set the qos_threshold_r to 25, enter this command on the MGS: lctl set_param lov.fsname-mdtlov.qos_threshold_rr=25 and in 39.7. Allocating Free Space on OSTs, we have Free space distribution can be tuned using these two tunable parameters: lod.*.qos_threshold_rr - The threshold at which the allocation method switches from round-robin to weighted is set in this file. The default is to switch to the weighted algorithm when any two OSTs are out of balance by more than 17 percent. |
| Comment by Andreas Dilger [ 04/Nov/19 ] |
|
We should not change the existing parameter names easily, as that breaks user tunable parameters. In particular, the qos_* parameters are actually changed by some users to disable QOS weighted allocation to get more uniform performance. So, I do not think that changing the documentation is the right thing to do. It may be that the problem is the use of "lov.*.qos_threshold_rr" instead of "lod.*.qos_threshold_rr" in the test scripts. That can be fixed by using "lod.*.*" everywhere, since the use of "lov.*.*" on the MDS is very old and should be removed. There was patch https://review.whamcloud.com/35185 "LU-8066 tests: use lod / osp tunables on servers" landed for this, but it looks like this missed at least lustre/tests/functions.sh::run_rr_alloc(). There is also lustre/tests/qos.sh but it isn't clear that this test script is working at all these days (it does things on the client that should be on the MDS, uses old positional setstripe arguments, etc.), and could just be removed. If it isn't just a matter of s/lov/lod/ in the script because the parameter names changed, we need to change back to the old parameter names so that the tests work again and we don't break existing configurations. |
| Comment by Gerrit Updater [ 04/Nov/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36666 |
| Comment by James A Simmons [ 04/Nov/19 ] |
|
Looks like qos_thresholdrr was renamed to lod_qos_thresholdrr which is causing the breakage. |
| Comment by Gerrit Updater [ 04/Nov/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36667 |
| Comment by Gerrit Updater [ 05/Nov/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36667/ |
| Comment by Gerrit Updater [ 05/Nov/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36666/ |
| Comment by Peter Jones [ 05/Nov/19 ] |
|
Landed for 2.13 |
| Comment by James Nunez (Inactive) [ 05/Nov/19 ] |
|
I don't think this is fixed. Looking at the results for the patch https://review.whamcloud.com/36667/ at https://testing.whamcloud.com/test_sets/59e536a8-ff8d-11e9-8e77-52540065bddc, we see that qos_threshold_rr is missing: == sanityn test 93: alloc_rr should not allocate on same ost ========================================= 03:24:52 (1572924292) CMD: trevis-18vm4 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr trevis-18vm4: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory CMD: trevis-18vm4 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100 trevis-18vm4: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory CMD: trevis-18vm4 lctl set_param fail_loc=0x00000163 fail_loc=0x00000163 CMD: trevis-18vm4 lctl set_param fail_loc=0x0 fail_loc=0x0 CMD: trevis-18vm4 lctl set_param -n 'lod.lustre-MDT*/qos_threshold_rr' trevis-18vm4: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value /mnt/lustre/f93.sanityn-1/file1 Maybe the parameter changed to qos_thresholdrr? |
| Comment by Andreas Dilger [ 06/Nov/19 ] |
|
Sigh, the parameter name that was previously existing was named "qos_threshold_rr" and not "qos_thresholdrr", so another patch is needed. |
| Comment by Gerrit Updater [ 06/Nov/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36683 |
| Comment by Gerrit Updater [ 06/Nov/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36686 |
| Comment by Gerrit Updater [ 08/Nov/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36686/ |
| Comment by Peter Jones [ 08/Nov/19 ] |
|
Landed for 2.13 |