Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12932

parallel-scale test rr_alloc fails with ‘failed while setting qos_threshold_rr & creat_count’

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale test_rr_alloc fails getting/setting ‘lov.lustre-MDT0000*.qos_threshold_rr’. These failures started on approximately 29 OCT 2019 and may be related to the changes in the ‘striped directory allocate stripes by QoS’, LU-12624, patches landings.

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/c0dad8f4-fd58-11e9-8e77-52540065bddc, we see the errors getting and setting qos_threshold_rr on the MDS

      CMD: trevis-20vm12 params=\$(/usr/sbin/lctl get_param lov.lustre-MDT0000*.qos_threshold_rr);
      			 [[ -z \"lustre-MDT0000\" ]] && param= ||
      			 param=\$(grep lustre-MDT0000 <<< \"\$params\");
      			 [[ -z \$param ]] && param=\"\$params\";
      			 while read s; do echo mds1 \$s;
      			 done <<< \"\$param\"
      trevis-20vm12: error: get_param: param_path 'lov/lustre-MDT0000*/qos_threshold_rr': No such file or directory
      CMD: trevis-20vm12 params=\$(/usr/sbin/lctl get_param osp.lustre-OST*-osc-MDT0000.create_count);
      			 [[ -z \"lustre-MDT0000\" ]] && param= ||
      			 param=\$(grep lustre-MDT0000 <<< \"\$params\");
      			 [[ -z \$param ]] && param=\"\$params\";
      			 while read s; do echo mds1 \$s;
      			 done <<< \"\$param\"
      CMD: trevis-20vm12 /usr/sbin/lctl set_param -n 		lov.lustre-MDT0000*.qos_threshold_rr 100 		osp.lustre-OST*-osc-MDT0000.create_count 3488
      trevis-20vm12: error: set_param: param_path 'lov/lustre-MDT0000*/qos_threshold_rr': No such file or directory
       parallel-scale test_rr_alloc: @@@@@@ FAIL: failed while setting qos_threshold_rr & creat_count 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6108:error()
        = /usr/lib64/lustre/tests/functions.sh:1004:run_rr_alloc()
        = /usr/lib64/lustre/tests/parallel-scale.sh:165:test_rr_alloc()
      

      Looking at the sanity suite_log at https://testing.whamcloud.com/test_sets/9ee6cf78-fd58-11e9-8e77-52540065bddc, we see failures getting the qos_threshold_rr parameter

      == sanity test 116a: stripe QOS: free space balance ================================================== 00:49:17 (1572569357)
      Free space priority CMD: trevis-20vm12 lctl get_param -n lo[vd].*-mdtlov.qos_prio_free
      91%
      CMD: trevis-20vm12 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-20vm12 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      sleep 5 for ZFS zfs
      sleep 5 for ZFS zfs
      Waiting for local destroys to complete
      OST kbytes available: 1878016 1901568 1912832 1904640 1900544 1911808 1909760
      Min free space: OST 0: 1878016
      Max free space: OST 2: 1912832
      CMD: trevis-20vm12 lctl get_param -n *.*MDT0000-mdtlov.qos_threshold_rr
      trevis-20vm12: error: get_param: param_path '*/*MDT0000-mdtlov/qos_threshold_rr': No such file or directory
      Check for uneven OSTs: diff=34816KB (1%) must be > % ...ok
      Don't need to fill OST0
      diff=34816=1% must be > % for QOS mode.../usr/lib64/lustre/tests/sanity.sh: line 10107: [: 1: unary operator expected
      failed - QOS mode won't be used
      sleep 5 for ZFS zfs
      Waiting for local destroys to complete
      cleanup time 6
      
       SKIP: sanity test_116a QOS imbalance criteria not met
      SKIP 116a (29s)
      
      == sanity test 116b: QoS shouldn't LBUG if not enough OSTs found on the 2nd pass ===================== 00:49:46 (1572569386)
      CMD: trevis-20vm12 lctl get_param -n lo[vd].lustre-MDT0000-mdtlov.qos_threshold_rr
      trevis-20vm12: error: get_param: param_path 'lo[vd]/lustre-MDT0000-mdtlov/qos_threshold_rr': No such file or directory
      
       SKIP: sanity test_116b no QOS
      SKIP 116b (1s)
      

      In sanityn, https://testing.whamcloud.com/test_sets/ab3b11a8-fd58-11e9-8e77-52540065bddc, we see similar failures

      == sanityn test 93: alloc_rr should not allocate on same ost ========================================= 08:34:06 (1572597246)
      CMD: trevis-20vm12 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr
      trevis-20vm12: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
      CMD: trevis-20vm12 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100
      trevis-20vm12: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
      CMD: trevis-20vm12 lctl set_param fail_loc=0x00000163
      fail_loc=0x00000163
      CMD: trevis-20vm12 lctl set_param fail_loc=0x0
      fail_loc=0x0
      CMD: trevis-20vm12 lctl set_param -n 		'lod.lustre-MDT*/qos_threshold_rr' 
      
      trevis-20vm12: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value
      

      Other failures are at
      https://testing.whamcloud.com/test_sets/1447e4bc-fce3-11e9-b934-52540065bddc
      https://testing.whamcloud.com/test_sets/25c770d8-fcff-11e9-8e77-52540065bddc
      https://testing.whamcloud.com/test_sets/53262b58-fd06-11e9-bbc3-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-12932] parallel-scale test rr_alloc fails with ‘failed while setting qos_threshold_rr & creat_count’

            Sigh, the parameter name that was previously existing was named "qos_threshold_rr" and not "qos_thresholdrr", so another patch is needed.

            adilger Andreas Dilger added a comment - Sigh, the parameter name that was previously existing was named " qos_threshold_rr " and not " qos_thresholdrr ", so another patch is needed.

            I don't think this is fixed. Looking at the results for the patch https://review.whamcloud.com/36667/ at https://testing.whamcloud.com/test_sets/59e536a8-ff8d-11e9-8e77-52540065bddc, we see that qos_threshold_rr is missing:

            == sanityn test 93: alloc_rr should not allocate on same ost ========================================= 03:24:52 (1572924292)
            CMD: trevis-18vm4 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr
            trevis-18vm4: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
            CMD: trevis-18vm4 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100
            trevis-18vm4: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory
            CMD: trevis-18vm4 lctl set_param fail_loc=0x00000163
            fail_loc=0x00000163
            CMD: trevis-18vm4 lctl set_param fail_loc=0x0
            fail_loc=0x0
            CMD: trevis-18vm4 lctl set_param -n 		'lod.lustre-MDT*/qos_threshold_rr' 
            trevis-18vm4: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value
            /mnt/lustre/f93.sanityn-1/file1
            

            Maybe the parameter changed to qos_thresholdrr?

            jamesanunez James Nunez (Inactive) added a comment - I don't think this is fixed. Looking at the results for the patch https://review.whamcloud.com/36667/ at https://testing.whamcloud.com/test_sets/59e536a8-ff8d-11e9-8e77-52540065bddc , we see that qos_threshold_rr is missing: == sanityn test 93: alloc_rr should not allocate on same ost ========================================= 03:24:52 (1572924292) CMD: trevis-18vm4 lctl get_param -n lod.lustre-MDT*/qos_threshold_rr trevis-18vm4: error: get_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory CMD: trevis-18vm4 lctl set_param -n lod.lustre-MDT*/qos_threshold_rr 100 trevis-18vm4: error: set_param: param_path 'lod/lustre-MDT*/qos_threshold_rr': No such file or directory CMD: trevis-18vm4 lctl set_param fail_loc=0x00000163 fail_loc=0x00000163 CMD: trevis-18vm4 lctl set_param fail_loc=0x0 fail_loc=0x0 CMD: trevis-18vm4 lctl set_param -n 'lod.lustre-MDT*/qos_threshold_rr' trevis-18vm4: error: set_param: setting lod.lustre-MDT*/qos_threshold_rr: no value /mnt/lustre/f93.sanityn-1/file1 Maybe the parameter changed to qos_thresholdrr?
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36666/
            Subject: LU-12932 tests: remove obsolete qos.sh test script
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 15443866cc98b1c81551ce8d2172b2902c51eebd

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36666/ Subject: LU-12932 tests: remove obsolete qos.sh test script Project: fs/lustre-release Branch: master Current Patch Set: Commit: 15443866cc98b1c81551ce8d2172b2902c51eebd

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36667/
            Subject: LU-12932 lod: restore qos_thresholdrr sysfs file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9ad541aa34002eb0f3d19ba9512b713ffcaf77bc

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/36667/ Subject: LU-12932 lod: restore qos_thresholdrr sysfs file Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9ad541aa34002eb0f3d19ba9512b713ffcaf77bc

            James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36667
            Subject: LU-12932 lod: restore qos_thresholdrr sysfs file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 88a518e7bdada1ed92278c6bd50b9b37b0ac6ca1

            gerrit Gerrit Updater added a comment - James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36667 Subject: LU-12932 lod: restore qos_thresholdrr sysfs file Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 88a518e7bdada1ed92278c6bd50b9b37b0ac6ca1

            Looks like qos_thresholdrr was renamed to lod_qos_thresholdrr which is causing the breakage.

            simmonsja James A Simmons added a comment - Looks like qos_thresholdrr was renamed to lod_qos_thresholdrr which is causing the breakage.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36666
            Subject: LU-12932 tests: remove obsolete qos.sh test script
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5b53b5a0decfe5b6fffa2252e697e9a805454e7f

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36666 Subject: LU-12932 tests: remove obsolete qos.sh test script Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5b53b5a0decfe5b6fffa2252e697e9a805454e7f
            adilger Andreas Dilger added a comment - - edited

            We should not change the existing parameter names easily, as that breaks user tunable parameters. In particular, the qos_* parameters are actually changed by some users to disable QOS weighted allocation to get more uniform performance.

            So, I do not think that changing the documentation is the right thing to do.

            It may be that the problem is the use of "lov.*.qos_threshold_rr" instead of "lod.*.qos_threshold_rr" in the test scripts. That can be fixed by using "lod.*.*" everywhere, since the use of "lov.*.*" on the MDS is very old and should be removed. There was patch https://review.whamcloud.com/35185 "LU-8066 tests: use lod / osp tunables on servers" landed for this, but it looks like this missed at least lustre/tests/functions.sh::run_rr_alloc().

            There is also lustre/tests/qos.sh but it isn't clear that this test script is working at all these days (it does things on the client that should be on the MDS, uses old positional setstripe arguments, etc.), and could just be removed.

            If it isn't just a matter of s/lov/lod/ in the script because the parameter names changed, we need to change back to the old parameter names so that the tests work again and we don't break existing configurations.

            adilger Andreas Dilger added a comment - - edited We should not change the existing parameter names easily, as that breaks user tunable parameters. In particular, the qos_* parameters are actually changed by some users to disable QOS weighted allocation to get more uniform performance. So, I do not think that changing the documentation is the right thing to do. It may be that the problem is the use of " lov.*.qos_threshold_rr " instead of " lod.*.qos_threshold_rr " in the test scripts. That can be fixed by using " lod.*.* " everywhere, since the use of " lov.*.* " on the MDS is very old and should be removed. There was patch https://review.whamcloud.com/35185 " LU-8066 tests: use lod / osp tunables on servers " landed for this, but it looks like this missed at least lustre/tests/functions.sh::run_rr_alloc() . There is also lustre/tests/qos.sh but it isn't clear that this test script is working at all these days (it does things on the client that should be on the MDS, uses old positional setstripe arguments, etc.), and could just be removed. If it isn't just a matter of s/lov/lod/ in the script because the parameter names changed, we need to change back to the old parameter names so that the tests work again and we don't break existing configurations.

            The manual may need to be updated if qos_threshold_rr was moved or removed. In 19.6.2. Stripe Allocation Methods we have

            The allocation method is determined by the amount of free-space imbalance on the OSTs. When free space is relatively balanced across OSTs, the faster round-robin allocator is used, which maximizes network balancing. The weighted allocator is used when any two OSTs are out of balance by more than the specified threshold (17% by default). The threshold between the two allocation methods is defined in the file /proc/fs/fsname/lov/fsname-mdtlov/qos_threshold_rr.
            
            To set the qos_threshold_r to 25, enter this command on the MGS:
            
            lctl set_param lov.fsname-mdtlov.qos_threshold_rr=25
            

            and in 39.7. Allocating Free Space on OSTs, we have

            Free space distribution can be tuned using these two tunable parameters:
            
            lod.*.qos_threshold_rr - The threshold at which the allocation method switches from round-robin to weighted is set in this file. The default is to switch to the weighted algorithm when any two OSTs are out of balance by more than 17 percent.
            
            jamesanunez James Nunez (Inactive) added a comment - The manual may need to be updated if qos_threshold_rr was moved or removed. In 19.6.2. Stripe Allocation Methods we have The allocation method is determined by the amount of free-space imbalance on the OSTs. When free space is relatively balanced across OSTs, the faster round-robin allocator is used, which maximizes network balancing. The weighted allocator is used when any two OSTs are out of balance by more than the specified threshold (17% by default). The threshold between the two allocation methods is defined in the file /proc/fs/fsname/lov/fsname-mdtlov/qos_threshold_rr. To set the qos_threshold_r to 25, enter this command on the MGS: lctl set_param lov.fsname-mdtlov.qos_threshold_rr=25 and in 39.7. Allocating Free Space on OSTs, we have Free space distribution can be tuned using these two tunable parameters: lod.*.qos_threshold_rr - The threshold at which the allocation method switches from round-robin to weighted is set in this file. The default is to switch to the weighted algorithm when any two OSTs are out of balance by more than 17 percent.

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: