[LU-11170] sanity test 415 fails with 'rename took N > M sec' Created: 24/Jul/18 Updated: 12/May/23 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Vitaliy Kuznetsov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | DNE, always_except, zfs | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for James Nunez <james.a.nunez@intel.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc sanity test 415 fails for DNE with ZFS with the error total: 500 open/close in 0.87 seconds: 572.35 ops/second rename 500 files took 283 sec sanity test_415: @@@@@@ FAIL: rename took 283 sec So far, this test only fails for ZFS. This test started failing on 2018-07-03 with logs at https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc Other test failures at VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Lai Siyao [ 27/Jul/18 ] |
|
This is caused by |
| Comment by James Nunez (Inactive) [ 03/Aug/18 ] |
|
As Lai said, this test is failing only for the patch that adds it. So, this ticket was opened prematurely and closing the ticket. |
| Comment by James Nunez (Inactive) [ 03/Aug/18 ] |
|
Reopening to test out async ZFS patches. |
| Comment by Andreas Dilger [ 03/Aug/18 ] |
|
We should consider to disable sync on ZFS using the tunables from patch https://review.whamcloud.com/7761 " Two new tunables are introduced to control the delay, the
osd_object_sync_delay_us and osd_txg_sync_delay_us module options.
These values default to -1 which preserves the safe full sync
pool behavior. Setting these values to zero or larger will
replace the pool sync with a delay of N microseconds.
|
| Comment by Andreas Dilger [ 03/Aug/18 ] |
|
We should set osd-zfs.*.osd_object_sync_delay_us=0 during this test to see if it reduces the runtime to a reasonable level. It may be worthwhile to submit a separate patch to set this in test-framework.sh for all ZFS targets to see if it significantly reduces the overall test time. We should be able to handle this within the test framework, since most failover testing that we do will sync before any recovery barrier that we impose, and a regular unmount will still flush any uncommitted changes to disk. |
| Comment by James Nunez (Inactive) [ 07/Aug/18 ] |
|
In the past week, sanity test 415 fails 100% of the time for review-dne-zfs-part-1. This test does not fail for review-ldiskfs, all review-dne-part, nor for review-zfs. We should land https://review.whamcloud.com/#/c/32933 . |
| Comment by Gerrit Updater [ 07/Aug/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32957 |
| Comment by Andreas Dilger [ 07/Aug/18 ] |
|
Failing patches will need to be rebased to get the fix for this issue. |
| Comment by James Nunez (Inactive) [ 10/Sep/18 ] |
|
For some reason, we can't find the osd_object_sync_delay_us parameter on the MDS. In the test log https://testing.whamcloud.com/test_sets/1f79aa3c-9c04-11e8-b0aa-52540065bddc, we see: CMD: onyx-45vm9 /usr/sbin/lctl get_param -n osd-zfs.*.osd_object_sync_delay_us This parameter landed to master, https://review.whamcloud.com/#/c/7761/, in January and thus, should be seen in this release, 2.11.53.54. |
| Comment by Jian Yu [ 23/Aug/19 ] |
|
+1 on master branch: https://testing.whamcloud.com/test_sets/7eaab0f8-c5db-11e9-98c8-52540065bddc |
| Comment by Jian Yu [ 24/Aug/19 ] |
|
There were 35 failures last week. This is blocking patch review testing on master branch. |
| Comment by Andreas Dilger [ 26/Aug/19 ] |
It appears that osd_object_sync_delay_us was not added as a Lustre parameter, but rather as a module parameter. So instead of using "lctl set_param" to set this, it is necessary to use something like "echo 0 > /sys/modules/osd-zfs/parameters/ osd_object_sync_delay_us" to set it. |
| Comment by Peter Jones [ 28/Aug/19 ] |
|
Nathaniel Could you please investigate? Thanks Peter |
| Comment by Gerrit Updater [ 11/Sep/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36157 |
| Comment by Gerrit Updater [ 07/Oct/19 ] |
|
Nathaniel Clark (nclark@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36395 |
| Comment by Chris Horn [ 09/Oct/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/24c45246-045b-4744-81ac-7ebdc79bcf0c |
| Comment by Gerrit Updater [ 16/Oct/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36463 |
| Comment by Gerrit Updater [ 18/Oct/19 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/36491 |
| Comment by Gerrit Updater [ 21/Oct/19 ] |
|
|
| Comment by Gerrit Updater [ 12/Nov/19 ] |
|
|
| Comment by Emoly Liu [ 26/Nov/19 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/2c84fba2-0fce-11ea-98f1-52540065bddc |
| Comment by Jian Yu [ 27/Jan/20 ] |
|
More failure instances on master branch: |
| Comment by Chris Horn [ 12/Aug/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/df8737c2-e377-4dac-9fd7-f0472a3927ad |
| Comment by James Nunez (Inactive) [ 28/Oct/20 ] |
|
A recent failure for review-dne-zfs-part-1 for el8.2 server and client at https://testing.whamcloud.com/test_sets/e788cbb4-2b21-473d-a127-64dcbc8ceec8 . |
| Comment by James Nunez (Inactive) [ 03/Nov/20 ] |
|
I'm seeing this issue for ldiskfs testing; https://testing.whamcloud.com/test_sets/9e51c284-86f5-4042-86b1-2436e537e34f and https://testing.whamcloud.com/test_sets/2febebf4-9666-4f7f-a633-8afc6aeebf22 . Since this ticket is/was for ZFS, should I create a new ticket for the ldiskfs failures? |
| Comment by Serguei Smirnov [ 16/Jul/21 ] |
|
+1 on b2_14: https://testing.whamcloud.com/test_sets/034e37f7-45d5-4a95-b720-46540caa8ccd |
| Comment by Vladimir Saveliev [ 19/Jul/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/41f2f737-0a6f-438d-b764-ef446639e144 |
| Comment by Alex Zhuravlev [ 18/Aug/21 ] |
|
+1 on master: https://testing.whamcloud.com/test_sessions/9af65bc1-a8a0-4805-acde-0f77e8d8abce |
| Comment by Artem Blagodarenko (Inactive) [ 18/Nov/21 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/8a693eb7-f6c9-4e44-b663-58a5297fb9b8 |
| Comment by Artem Blagodarenko (Inactive) [ 30/Nov/21 ] |
|
+1 https://testing.whamcloud.com/test_sets/5999564a-11d5-4543-bffd-74ea8cd2ba60 |
| Comment by Sergey Cheremencev [ 02/Dec/21 ] |
|
+1 https://testing.whamcloud.com/test_sets/1f9a4f4e-e6cd-4029-bf4f-5a1b6c144bd1 note, it is not ZFS |
| Comment by Sergey Cheremencev [ 08/Dec/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/993a9a55-1a43-4a47-a976-271c19135ca0 |
| Comment by Serguei Smirnov [ 08/Dec/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/03a97eca-b958-4f73-9383-ab34d6143360 |
| Comment by Chris Horn [ 07/Feb/22 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/8ecb13cd-7b09-4f2b-a410-e3c20eb5c244 |
| Comment by Gerrit Updater [ 13/May/22 ] |
|
"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47341 |
| Comment by Nikitas Angelinas [ 18/May/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/3adb9c61-eb90-4af3-9e2c-c18365f5dd36 |
| Comment by Gerrit Updater [ 18/May/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47341/ |
| Comment by Peter Jones [ 20/May/22 ] |
|
Tried to avoid again |
| Comment by Andreas Dilger [ 01/Jun/22 ] |
|
+1 on master: |
| Comment by Sarah Liu [ 08/Jun/22 ] |
|
on 2.15 aarch64 client |
| Comment by Alexander Boyko [ 11/Jul/22 ] |
|
master https://testing.whamcloud.com/test_sets/8ff591d9-c8e4-43ab-84d3-357f79417e08 |
| Comment by Andreas Dilger [ 20/Jan/23 ] |
|
Still hitting this about 1-3% of runs during normal testing. Failures usually complete within 140s, one as long as 269s. The fact that ZFS is also very slow means that the failures may actually be caused by a regression in the COS code, because commits on ZFS can take a long time. I'm going to push a patch to improve the debugability of the patch, hopefully we can identify the source of the problem. |
| Comment by Gerrit Updater [ 20/Jan/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49724 |
| Comment by Sergey Cheremencev [ 24/Jan/23 ] |
|
+1 master https://testing.whamcloud.com/test_sets/55564f7c-ab54-4536-a2e5-1ec9777a2363 |
| Comment by Qian Yingjin [ 28/Jan/23 ] |
|
+1 on master: |
| Comment by Gerrit Updater [ 31/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49724/ |
| Comment by Serguei Smirnov [ 29/Mar/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/d8e1c14f-849d-46a5-b422-c77955590561 |
| Comment by Andreas Dilger [ 04/Apr/23 ] |
|
With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s. It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo, most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention: rename 500 files without 'touch' took 173 sec rename 500 files with 'touch' took 302 sec : rename 500 files without 'touch' took 181 sec rename 500 files with 'touch' took 287 sec : rename 500 files without 'touch' took 137 sec rename 500 files with 'touch' took 204 sec : rename 500 files without 'touch' took 58 sec rename 500 files with 'touch' took 77 sec The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames? I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min, and max > 4x avg (stats columns are "stat, count, 'samples', [unit], min, max, sum", so avg=sum/100 can be easily calculated visually): # mkdir /mnt/testfs/dir; touch /mnt/testfs/dir/f{1..100}
# lctl set_param llite.*.stats=clear; (cd /mnt/testfs/dir; rename f g f[1-9]*); lctl get_param llite.*.stats | grep -w rename
rename 100 samples [usec] 1289 15669 305488
rename 100 samples [usec] 1194 17080 443209
rename 100 samples [usec] 1393 17190 445666
Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg: rename 100 samples [usec] 556 839 65490 43136492 rename 100 samples [usec] 466 849 68130 46780816 rename 100 samples [usec] 591 826 68434 47090098 rename 100 samples [usec] 633 801 72962 53361444 rename 100 samples [usec] 615 816 73353 53967167 Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)? Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error. |
| Comment by Gerrit Updater [ 17/Apr/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50654 |
| Comment by Gerrit Updater [ 22/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50654/ |