Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0
-
3
-
9223372036854775807
Description
This issue was created by maloo for James Nunez <james.a.nunez@intel.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc
sanity test 415 fails for DNE with ZFS with the error
total: 500 open/close in 0.87 seconds: 572.35 ops/second rename 500 files took 283 sec sanity test_415: @@@@@@ FAIL: rename took 283 sec
So far, this test only fails for ZFS.
This test started failing on 2018-07-03 with logs at https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc
Other test failures at
https://testing.whamcloud.com/test_sets/8de6a208-8945-11e8-9028-52540065bddc
https://testing.whamcloud.com/test_sets/9bb5e252-8e9c-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/029c73ea-8e9e-11e8-87f3-52540065bddc
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_415 - rename took 283 sec
sanity test_415 - rename took 154 > 125 sec
Attachments
Issue Links
- is blocking
-
LU-12336 Update ZFS Version to 0.8.2
-
- Resolved
-
- is duplicated by
-
LU-11225 sanity test_415: rename took 176 sec
-
- Resolved
-
-
LU-12697 sanity test_415: rename took 133 sec
-
- Resolved
-
-
LU-14103 sanity test_415: sanity test_415: @@@@@@ FAIL: rename took 101 sec
-
- Resolved
-
-
LU-16237 sanity test_415: rename took 130 sec
-
- Resolved
-
-
LU-12843 sanity test_415: rename took 396 sec
-
- Closed
-
- is related to
-
LU-11102 lock revoke may not take effect
-
- Resolved
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s.
It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo, most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention:
The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames?
I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min, and max > 4x avg (stats columns are "stat, count, 'samples', [unit], min, max, sum", so avg=sum/100 can be easily calculated visually):
Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg:
Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)?
Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error.