Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11170

sanity test 415 fails with 'rename took N > M sec'

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc

      sanity test 415 fails for DNE with ZFS with the error

      total: 500 open/close in 0.87 seconds: 572.35 ops/second
      rename 500 files took 283 sec
       sanity test_415: @@@@@@ FAIL: rename took 283 sec 
      

      So far, this test only fails for ZFS.

      This test started failing on 2018-07-03 with logs at https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc

      Other test failures at
      https://testing.whamcloud.com/test_sets/8de6a208-8945-11e8-9028-52540065bddc
      https://testing.whamcloud.com/test_sets/9bb5e252-8e9c-11e8-b0aa-52540065bddc
      https://testing.whamcloud.com/test_sets/029c73ea-8e9e-11e8-87f3-52540065bddc

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_415 - rename took 283 sec
      sanity test_415 - rename took 154 > 125 sec

      Attachments

        Issue Links

          Activity

            [LU-11170] sanity test 415 fails with 'rename took N > M sec'
            pjones Peter Jones added a comment -

            Seems to be merged for 2.16

            pjones Peter Jones added a comment - Seems to be merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50654/
            Subject: LU-11170 tests: don't fail sanity/415 in VM
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 73a7b1c2a3f0114618db7781adb56974ed682f24

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50654/ Subject: LU-11170 tests: don't fail sanity/415 in VM Project: fs/lustre-release Branch: master Current Patch Set: Commit: 73a7b1c2a3f0114618db7781adb56974ed682f24

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50654
            Subject: LU-11170 tests: don't fail sanity/415 in VM
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 01b1f0ec789313c077e3a6aff64ef145efc4d9f4

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50654 Subject: LU-11170 tests: don't fail sanity/415 in VM Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01b1f0ec789313c077e3a6aff64ef145efc4d9f4

            With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s.

            It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo, most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention:

            rename 500 files without 'touch' took 173 sec
            rename 500 files with 'touch' took 302 sec
            :
            rename 500 files without 'touch' took 181 sec
            rename 500 files with 'touch' took 287 sec
            :
            rename 500 files without 'touch' took 137 sec
            rename 500 files with 'touch' took 204 sec
            :
            rename 500 files without 'touch' took 58 sec
            rename 500 files with 'touch' took 77 sec
            

            The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames?

            I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min, and max > 4x avg (stats columns are "stat, count, 'samples', [unit], min, max, sum", so avg=sum/100 can be easily calculated visually):

            # mkdir /mnt/testfs/dir; touch /mnt/testfs/dir/f{1..100}
            # lctl set_param llite.*.stats=clear; (cd /mnt/testfs/dir; rename f g f[1-9]*); lctl get_param llite.*.stats | grep -w rename
            rename                    100 samples [usec] 1289 15669 305488
            rename                    100 samples [usec] 1194 17080 443209
            rename                    100 samples [usec] 1393 17190 445666
            

            Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg:

            rename                    100 samples [usec] 556 839 65490 43136492
            rename                    100 samples [usec] 466 849 68130 46780816
            rename                    100 samples [usec] 591 826 68434 47090098
            rename                    100 samples [usec] 633 801 72962 53361444
            rename                    100 samples [usec] 615 816 73353 53967167
            

            Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)?

            Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error.

            adilger Andreas Dilger added a comment - With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s. It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo , most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention: rename 500 files without 'touch' took 173 sec rename 500 files with 'touch' took 302 sec : rename 500 files without 'touch' took 181 sec rename 500 files with 'touch' took 287 sec : rename 500 files without 'touch' took 137 sec rename 500 files with 'touch' took 204 sec : rename 500 files without 'touch' took 58 sec rename 500 files with 'touch' took 77 sec The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames? I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min , and max > 4x avg (stats columns are " stat, count, 'samples', [unit] , min, max, sum ", so avg=sum/100 can be easily calculated visually): # mkdir /mnt/testfs/dir; touch /mnt/testfs/dir/f{1..100} # lctl set_param llite.*.stats=clear; (cd /mnt/testfs/dir; rename f g f[1-9]*); lctl get_param llite.*.stats | grep -w rename rename 100 samples [usec] 1289 15669 305488 rename 100 samples [usec] 1194 17080 443209 rename 100 samples [usec] 1393 17190 445666 Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg : rename 100 samples [usec] 556 839 65490 43136492 rename 100 samples [usec] 466 849 68130 46780816 rename 100 samples [usec] 591 826 68434 47090098 rename 100 samples [usec] 633 801 72962 53361444 rename 100 samples [usec] 615 816 73353 53967167 Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)? Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error .
            ssmirnov Serguei Smirnov added a comment - +1 on master: https://testing.whamcloud.com/test_sets/d8e1c14f-849d-46a5-b422-c77955590561

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49724/
            Subject: LU-11170 tests: add debugging to sanity/415
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6594babc73851fab335c514cd1fee018425e7bb3

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49724/ Subject: LU-11170 tests: add debugging to sanity/415 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6594babc73851fab335c514cd1fee018425e7bb3
            qian_wc Qian Yingjin added a comment - +1 on master: https://testing.whamcloud.com/test_sets/c37adf87-d38a-4a38-afa4-effadab56dd5
            scherementsev Sergey Cheremencev added a comment - +1 master https://testing.whamcloud.com/test_sets/55564f7c-ab54-4536-a2e5-1ec9777a2363

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49724
            Subject: LU-11170 tests: add debugging to sanity/415
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e33e8906febd161a89bb190648e74dcc3b7414cb

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49724 Subject: LU-11170 tests: add debugging to sanity/415 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e33e8906febd161a89bb190648e74dcc3b7414cb

            Still hitting this about 1-3% of runs during normal testing. Failures usually complete within 140s, one as long as 269s.

            The fact that ZFS is also very slow means that the failures may actually be caused by a regression in the COS code, because commits on ZFS can take a long time. I'm going to push a patch to improve the debugability of the patch, hopefully we can identify the source of the problem.

            adilger Andreas Dilger added a comment - Still hitting this about 1-3% of runs during normal testing. Failures usually complete within 140s, one as long as 269s. The fact that ZFS is also very slow means that the failures may actually be caused by a regression in the COS code, because commits on ZFS can take a long time. I'm going to push a patch to improve the debugability of the patch, hopefully we can identify the source of the problem.

            People

              vkuznetsov Vitaliy Kuznetsov
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: