[LU-11170] sanity test 415 fails with 'rename took N > M sec' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0
Labels:
- DNE
- zfs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc

sanity test 415 fails for DNE with ZFS with the error

total: 500 open/close in 0.87 seconds: 572.35 ops/second
rename 500 files took 283 sec
 sanity test_415: @@@@@@ FAIL: rename took 283 sec

So far, this test only fails for ZFS.

This test started failing on 2018-07-03 with logs at https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc

Other test failures at
https://testing.whamcloud.com/test_sets/8de6a208-8945-11e8-9028-52540065bddc
https://testing.whamcloud.com/test_sets/9bb5e252-8e9c-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/029c73ea-8e9e-11e8-87f3-52540065bddc

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_415 - rename took 283 sec
sanity test_415 - rename took 154 > 125 sec

Attachments

Issue Links

is blocking

LU-12336 Update ZFS Version to 0.8.2

Resolved

is duplicated by

LU-11225 sanity test_415: rename took 176 sec

Resolved

LU-12697 sanity test_415: rename took 133 sec

Resolved

LU-14103 sanity test_415: sanity test_415: @@@@@@ FAIL: rename took 101 sec

Resolved

LU-16237 sanity test_415: rename took 130 sec

Resolved

LU-12843 sanity test_415: rename took 396 sec

Closed

is related to

LU-11102 lock revoke may not take effect

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 is duplicated by, 1 is related to , 11 mentioned in)

Activity

[LU-11170] sanity test 415 fails with 'rename took N > M sec'

Peter Jones added a comment - 10/Oct/24 11:00 PM

Seems to be merged for 2.16

Peter Jones added a comment - 10/Oct/24 11:00 PM Seems to be merged for 2.16

Gerrit Updater added a comment - 22/Apr/23 5:35 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50654/
Subject: ~~LU-11170~~ tests: don't fail sanity/415 in VM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73a7b1c2a3f0114618db7781adb56974ed682f24

Gerrit Updater added a comment - 22/Apr/23 5:35 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50654/ Subject: LU-11170 tests: don't fail sanity/415 in VM Project: fs/lustre-release Branch: master Current Patch Set: Commit: 73a7b1c2a3f0114618db7781adb56974ed682f24

Gerrit Updater added a comment - 17/Apr/23 8:45 AM

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50654
Subject: ~~LU-11170~~ tests: don't fail sanity/415 in VM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01b1f0ec789313c077e3a6aff64ef145efc4d9f4

Gerrit Updater added a comment - 17/Apr/23 8:45 AM "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50654 Subject: LU-11170 tests: don't fail sanity/415 in VM Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01b1f0ec789313c077e3a6aff64ef145efc4d9f4

Andreas Dilger added a comment - 04/Apr/23 11:58 PM

With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s.

It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo, most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention:

rename 500 files without 'touch' took 173 sec
rename 500 files with 'touch' took 302 sec
:
rename 500 files without 'touch' took 181 sec
rename 500 files with 'touch' took 287 sec
:
rename 500 files without 'touch' took 137 sec
rename 500 files with 'touch' took 204 sec
:
rename 500 files without 'touch' took 58 sec
rename 500 files with 'touch' took 77 sec

The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames?

I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min, and max > 4x avg (stats columns are "stat, count, 'samples', [unit], min, max, sum", so avg=sum/100 can be easily calculated visually):

# mkdir /mnt/testfs/dir; touch /mnt/testfs/dir/f{1..100}
# lctl set_param llite.*.stats=clear; (cd /mnt/testfs/dir; rename f g f[1-9]*); lctl get_param llite.*.stats | grep -w rename
rename                    100 samples [usec] 1289 15669 305488
rename                    100 samples [usec] 1194 17080 443209
rename                    100 samples [usec] 1393 17190 445666

Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg:

rename                    100 samples [usec] 556 839 65490 43136492
rename                    100 samples [usec] 466 849 68130 46780816
rename                    100 samples [usec] 591 826 68434 47090098
rename                    100 samples [usec] 633 801 72962 53361444
rename                    100 samples [usec] 615 816 73353 53967167

Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)?

Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error.

Andreas Dilger added a comment - 04/Apr/23 11:58 PM With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s. It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo , most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention: rename 500 files without 'touch' took 173 sec rename 500 files with 'touch' took 302 sec : rename 500 files without 'touch' took 181 sec rename 500 files with 'touch' took 287 sec : rename 500 files without 'touch' took 137 sec rename 500 files with 'touch' took 204 sec : rename 500 files without 'touch' took 58 sec rename 500 files with 'touch' took 77 sec The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames? I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min , and max > 4x avg (stats columns are " stat, count, 'samples', [unit] , min, max, sum ", so avg=sum/100 can be easily calculated visually): # mkdir /mnt/testfs/dir; touch /mnt/testfs/dir/f{1..100} # lctl set_param llite.*.stats=clear; (cd /mnt/testfs/dir; rename f g f[1-9]*); lctl get_param llite.*.stats | grep -w rename rename 100 samples [usec] 1289 15669 305488 rename 100 samples [usec] 1194 17080 443209 rename 100 samples [usec] 1393 17190 445666 Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg : rename 100 samples [usec] 556 839 65490 43136492 rename 100 samples [usec] 466 849 68130 46780816 rename 100 samples [usec] 591 826 68434 47090098 rename 100 samples [usec] 633 801 72962 53361444 rename 100 samples [usec] 615 816 73353 53967167 Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)? Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error .

Serguei Smirnov added a comment - 29/Mar/23 2:20 PM

+1 on master: https://testing.whamcloud.com/test_sets/d8e1c14f-849d-46a5-b422-c77955590561

Serguei Smirnov added a comment - 29/Mar/23 2:20 PM +1 on master: https://testing.whamcloud.com/test_sets/d8e1c14f-849d-46a5-b422-c77955590561

Gerrit Updater added a comment - 31/Jan/23 2:33 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49724/
Subject: ~~LU-11170~~ tests: add debugging to sanity/415
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6594babc73851fab335c514cd1fee018425e7bb3

Gerrit Updater added a comment - 31/Jan/23 2:33 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49724/ Subject: LU-11170 tests: add debugging to sanity/415 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6594babc73851fab335c514cd1fee018425e7bb3

Qian Yingjin added a comment - 28/Jan/23 12:56 PM

+1 on master:
https://testing.whamcloud.com/test_sets/c37adf87-d38a-4a38-afa4-effadab56dd5

Qian Yingjin added a comment - 28/Jan/23 12:56 PM +1 on master: https://testing.whamcloud.com/test_sets/c37adf87-d38a-4a38-afa4-effadab56dd5

Sergey Cheremencev added a comment - 24/Jan/23 5:38 PM

+1 master

https://testing.whamcloud.com/test_sets/55564f7c-ab54-4536-a2e5-1ec9777a2363

Sergey Cheremencev added a comment - 24/Jan/23 5:38 PM +1 master https://testing.whamcloud.com/test_sets/55564f7c-ab54-4536-a2e5-1ec9777a2363

Gerrit Updater added a comment - 20/Jan/23 8:51 PM

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49724
Subject: ~~LU-11170~~ tests: add debugging to sanity/415
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e33e8906febd161a89bb190648e74dcc3b7414cb

Gerrit Updater added a comment - 20/Jan/23 8:51 PM "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49724 Subject: LU-11170 tests: add debugging to sanity/415 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e33e8906febd161a89bb190648e74dcc3b7414cb

Andreas Dilger added a comment - 20/Jan/23 8:49 PM

Still hitting this about 1-3% of runs during normal testing. Failures usually complete within 140s, one as long as 269s.

The fact that ZFS is also very slow means that the failures may actually be caused by a regression in the COS code, because commits on ZFS can take a long time. I'm going to push a patch to improve the debugability of the patch, hopefully we can identify the source of the problem.

Andreas Dilger added a comment - 20/Jan/23 8:49 PM Still hitting this about 1-3% of runs during normal testing. Failures usually complete within 140s, one as long as 269s. The fact that ZFS is also very slow means that the failures may actually be caused by a regression in the COS code, because commits on ZFS can take a long time. I'm going to push a patch to improve the debugability of the patch, hopefully we can identify the source of the problem.

People

Assignee:: Vitaliy Kuznetsov

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 24/Jul/18 6:34 PM

Updated:: 25/Mar/25 6:05 PM

Resolved:: 10/Oct/24 11:00 PM