[LU-11170] sanity test 415 fails with 'rename took N > M sec' Created: 24/Jul/18  Updated: 12/May/23

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Vitaliy Kuznetsov
Resolution: Unresolved Votes: 0
Labels: DNE, always_except, zfs

Issue Links:
Blocker
is blocking LU-12336 Update ZFS Version to 0.8.2 Resolved
Duplicate
is duplicated by LU-11225 sanity test_415: rename took 176 sec Resolved
is duplicated by LU-12697 sanity test_415: rename took 133 sec Resolved
is duplicated by LU-14103 sanity test_415: sanity test_415: @@@... Resolved
is duplicated by LU-16237 sanity test_415: rename took 130 sec Resolved
is duplicated by LU-12843 sanity test_415: rename took 396 sec Closed
Related
is related to LU-11102 lock revoke may not take effect Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc

sanity test 415 fails for DNE with ZFS with the error

total: 500 open/close in 0.87 seconds: 572.35 ops/second
rename 500 files took 283 sec
 sanity test_415: @@@@@@ FAIL: rename took 283 sec 

So far, this test only fails for ZFS.

This test started failing on 2018-07-03 with logs at https://testing.whamcloud.com/test_sets/98efeea0-7f0b-11e8-8fe6-52540065bddc

Other test failures at
https://testing.whamcloud.com/test_sets/8de6a208-8945-11e8-9028-52540065bddc
https://testing.whamcloud.com/test_sets/9bb5e252-8e9c-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/029c73ea-8e9e-11e8-87f3-52540065bddc

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_415 - rename took 283 sec
sanity test_415 - rename took 154 > 125 sec



 Comments   
Comment by Lai Siyao [ 27/Jul/18 ]

This is caused by LU-11102, but the patch https://review.whamcloud.com/#/c/32738/ is not landed yet.

Comment by James Nunez (Inactive) [ 03/Aug/18 ]

As Lai said, this test is failing only for the patch that adds it. So, this ticket was opened prematurely and closing the ticket.

Comment by James Nunez (Inactive) [ 03/Aug/18 ]

Reopening to test out async ZFS patches.

Comment by Andreas Dilger [ 03/Aug/18 ]

We should consider to disable sync on ZFS using the tunables from patch https://review.whamcloud.com/7761 "LU-10460 osd-zfs: Add tunables to disable sync":

    Two new tunables are introduced to control the delay, the
    osd_object_sync_delay_us and osd_txg_sync_delay_us module options.
    These values default to -1 which preserves the safe full sync
    pool behavior.  Setting these values to zero or larger will
    replace the pool sync with a delay of N microseconds.
Comment by Andreas Dilger [ 03/Aug/18 ]

We should set osd-zfs.*.osd_object_sync_delay_us=0 during this test to see if it reduces the runtime to a reasonable level.

It may be worthwhile to submit a separate patch to set this in test-framework.sh for all ZFS targets to see if it significantly reduces the overall test time. We should be able to handle this within the test framework, since most failover testing that we do will sync before any recovery barrier that we impose, and a regular unmount will still flush any uncommitted changes to disk.

Comment by James Nunez (Inactive) [ 07/Aug/18 ]

In the past week, sanity test 415 fails 100% of the time for review-dne-zfs-part-1.

This test does not fail for review-ldiskfs, all review-dne-part, nor for review-zfs.

We should land https://review.whamcloud.com/#/c/32933 .

Comment by Gerrit Updater [ 07/Aug/18 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32957
Subject: LU-11170 tests: add delay sync for sanity test 415
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d5ec750ab15e8fe7d051a32bb486c358ce2970a1

Comment by Andreas Dilger [ 07/Aug/18 ]

Failing patches will need to be rebased to get the fix for this issue.

Comment by James Nunez (Inactive) [ 10/Sep/18 ]

For some reason, we can't find the osd_object_sync_delay_us parameter on the MDS. In the test log https://testing.whamcloud.com/test_sets/1f79aa3c-9c04-11e8-b0aa-52540065bddc, we see:

CMD: onyx-45vm9 /usr/sbin/lctl get_param -n osd-zfs.*.osd_object_sync_delay_us
onyx-45vm9: error: get_param: param_path 'osd-zfs/*/osd_object_sync_delay_us': No such file or directory
CMD: onyx-45vm10,onyx-45vm9 /usr/sbin/lctl set_param osd-zfs.*.osd_object_sync_delay_us=0
onyx-45vm9: error: set_param: param_path 'osd-zfs/*/osd_object_sync_delay_us': No such file or directory
onyx-45vm10: error: set_param: param_path 'osd-zfs/*/osd_object_sync_delay_us': No such file or directory

This parameter landed to master, https://review.whamcloud.com/#/c/7761/, in January and thus, should be seen in this release, 2.11.53.54.

Comment by Jian Yu [ 23/Aug/19 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/7eaab0f8-c5db-11e9-98c8-52540065bddc

Comment by Jian Yu [ 24/Aug/19 ]

There were 35 failures last week. This is blocking patch review testing on master branch.

Comment by Andreas Dilger [ 26/Aug/19 ]

For some reason, we can't find the osd_object_sync_delay_us parameter on the MDS. In the test log https://testing.whamcloud.com/test_sets/1f79aa3c-9c04-11e8-b0aa-52540065bddc, we see:

CMD: onyx-45vm9 /usr/sbin/lctl get_param -n osd-zfs.*.osd_object_sync_delay_us
onyx-45vm9: error: get_param: param_path 'osd-zfs/*/osd_object_sync_delay_us': No such file or directory
CMD: onyx-45vm10,onyx-45vm9 /usr/sbin/lctl set_param osd-zfs.*.osd_object_sync_delay_us=0
onyx-45vm9: error: set_param: param_path 'osd-zfs/*/osd_object_sync_delay_us': No such file or directory
onyx-45vm10: error: set_param: param_path 'osd-zfs/*/osd_object_sync_delay_us': No such file or directory

This parameter landed to master, https://review.whamcloud.com/#/c/7761/, in January and thus, should be seen in this release, 2.11.53.54.

It appears that osd_object_sync_delay_us was not added as a Lustre parameter, but rather as a module parameter. So instead of using "lctl set_param" to set this, it is necessary to use something like "echo 0 > /sys/modules/osd-zfs/parameters/ osd_object_sync_delay_us" to set it.

Comment by Peter Jones [ 28/Aug/19 ]

Nathaniel

Could you please investigate?

Thanks

Peter

Comment by Gerrit Updater [ 11/Sep/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36157
Subject: LU-11170 tests: stop running sanity 415 for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 68d374f4f914403d27712acbee9bf8293a2a3731

Comment by Gerrit Updater [ 07/Oct/19 ]

Nathaniel Clark (nclark@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36395
Subject: LU-11170 test: Skip sanity/415 for ZFS 0.8.0+
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3584f574441185dac6cbd0f306597a6713ede26

Comment by Chris Horn [ 09/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sessions/24c45246-045b-4744-81ac-7ebdc79bcf0c

Comment by Gerrit Updater [ 16/Oct/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/36463
Subject: LU-11170 test: try zfs tunable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 016f2edb1e00232ddd7366d6a54482b2b5025c29

Comment by Gerrit Updater [ 18/Oct/19 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/36491
Subject: LU-11170 osd-zfs: test ZFS version update to 0.8.2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3c59355d08d0aa7a64e1caf534567cd896966dc2

Comment by Gerrit Updater [ 21/Oct/19 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/36525
Subject: LU-11170 osd-zfs: gather data ZFS version update to 0.8.2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8f04a216c726881a64887dce3c18d7124c3e90d1

Comment by Gerrit Updater [ 12/Nov/19 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/36742
Subject: LU-11170 osd-zfs: test ZFS versino update to 0.8.2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: de47eb0b013af40a36505a331be912c235bf8a62

Comment by Emoly Liu [ 26/Nov/19 ]

+1 on master: https://testing.whamcloud.com/test_sets/2c84fba2-0fce-11ea-98f1-52540065bddc

Comment by Jian Yu [ 27/Jan/20 ]

More failure instances on master branch:
https://testing.whamcloud.com/test_sets/86e61e30-4121-11ea-ac52-52540065bddc
https://testing.whamcloud.com/test_sets/dcdc6f56-40b8-11ea-83b6-52540065bddc

Comment by Chris Horn [ 12/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/df8737c2-e377-4dac-9fd7-f0472a3927ad

Comment by James Nunez (Inactive) [ 28/Oct/20 ]

A recent failure for review-dne-zfs-part-1 for el8.2 server and client at https://testing.whamcloud.com/test_sets/e788cbb4-2b21-473d-a127-64dcbc8ceec8 .

Comment by James Nunez (Inactive) [ 03/Nov/20 ]

I'm seeing this issue for ldiskfs testing; https://testing.whamcloud.com/test_sets/9e51c284-86f5-4042-86b1-2436e537e34f and https://testing.whamcloud.com/test_sets/2febebf4-9666-4f7f-a633-8afc6aeebf22 . Since this ticket is/was for ZFS, should I create a new ticket for the ldiskfs failures?

Comment by Serguei Smirnov [ 16/Jul/21 ]

+1 on b2_14: https://testing.whamcloud.com/test_sets/034e37f7-45d5-4a95-b720-46540caa8ccd

Comment by Vladimir Saveliev [ 19/Jul/21 ]

+1 on master https://testing.whamcloud.com/test_sets/41f2f737-0a6f-438d-b764-ef446639e144

Comment by Alex Zhuravlev [ 18/Aug/21 ]

+1 on master: https://testing.whamcloud.com/test_sessions/9af65bc1-a8a0-4805-acde-0f77e8d8abce

Comment by Artem Blagodarenko (Inactive) [ 18/Nov/21 ]

+1 on master:

https://testing.whamcloud.com/test_sets/8a693eb7-f6c9-4e44-b663-58a5297fb9b8

Comment by Artem Blagodarenko (Inactive) [ 30/Nov/21 ]

+1 https://testing.whamcloud.com/test_sets/5999564a-11d5-4543-bffd-74ea8cd2ba60

Comment by Sergey Cheremencev [ 02/Dec/21 ]

+1 https://testing.whamcloud.com/test_sets/1f9a4f4e-e6cd-4029-bf4f-5a1b6c144bd1

note, it is not ZFS

Comment by Sergey Cheremencev [ 08/Dec/21 ]

+1 on master https://testing.whamcloud.com/test_sets/993a9a55-1a43-4a47-a976-271c19135ca0

Comment by Serguei Smirnov [ 08/Dec/21 ]

+1 on master https://testing.whamcloud.com/test_sets/03a97eca-b958-4f73-9383-ab34d6143360

Comment by Chris Horn [ 07/Feb/22 ]

+1 on master https://testing.whamcloud.com/test_sets/8ecb13cd-7b09-4f2b-a410-e3c20eb5c244

Comment by Gerrit Updater [ 13/May/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47341
Subject: LU-11170 test: increase time limit in sanity test_415
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e0700d52edf88f2146a1b5380970e6c2cfc55573

Comment by Nikitas Angelinas [ 18/May/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/3adb9c61-eb90-4af3-9e2c-c18365f5dd36

Comment by Gerrit Updater [ 18/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47341/
Subject: LU-11170 test: increase time limit in sanity test_415
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b974d7d8a9759601b6f851895ee32714ccf56f28

Comment by Peter Jones [ 20/May/22 ]

Tried to avoid again

Comment by Andreas Dilger [ 01/Jun/22 ]

+1 on master:
https://testing.whamcloud.com/test_sets/2e6c4a29-6de8-4e85-bc7c-a986208af412

Comment by Sarah Liu [ 08/Jun/22 ]

on 2.15 aarch64 client
https://testing.whamcloud.com/test_sets/771da1f7-4616-4a2a-b3fc-d0028e98a485

Comment by Alexander Boyko [ 11/Jul/22 ]

master

https://testing.whamcloud.com/test_sets/8ff591d9-c8e4-43ab-84d3-357f79417e08

Comment by Andreas Dilger [ 20/Jan/23 ]

Still hitting this about 1-3% of runs during normal testing. Failures usually complete within 140s, one as long as 269s.

The fact that ZFS is also very slow means that the failures may actually be caused by a regression in the COS code, because commits on ZFS can take a long time. I'm going to push a patch to improve the debugability of the patch, hopefully we can identify the source of the problem.

Comment by Gerrit Updater [ 20/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49724
Subject: LU-11170 tests: add debugging to sanity/415
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e33e8906febd161a89bb190648e74dcc3b7414cb

Comment by Sergey Cheremencev [ 24/Jan/23 ]

+1 master

https://testing.whamcloud.com/test_sets/55564f7c-ab54-4536-a2e5-1ec9777a2363

Comment by Qian Yingjin [ 28/Jan/23 ]

+1 on master:
https://testing.whamcloud.com/test_sets/c37adf87-d38a-4a38-afa4-effadab56dd5

Comment by Gerrit Updater [ 31/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49724/
Subject: LU-11170 tests: add debugging to sanity/415
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6594babc73851fab335c514cd1fee018425e7bb3

Comment by Serguei Smirnov [ 29/Mar/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/d8e1c14f-849d-46a5-b422-c77955590561

Comment by Andreas Dilger [ 04/Apr/23 ]

With the added debugging and changes from my previous patch, it looks like this issue is not only because some VMs do renames slower than others. Some tests pass after almost 500s, while others fail in 130s.

It seems possible that there is variable VM performance between the "uncontended" rename test run and the "contended" rename test run that is causing failures. Looking at recent test results in Maloo, most of the subtest runs take between 20-50s to finish, but some of the passing "uncontested" runs take longer to complete than some of the failures, which is due to VM performance and/or external contention:

rename 500 files without 'touch' took 173 sec
rename 500 files with 'touch' took 302 sec
:
rename 500 files without 'touch' took 181 sec
rename 500 files with 'touch' took 287 sec
:
rename 500 files without 'touch' took 137 sec
rename 500 files with 'touch' took 204 sec
:
rename 500 files without 'touch' took 58 sec
rename 500 files with 'touch' took 77 sec

The main question that this test should be answering is if a single contended rename is slow and blocked due to DLM contention? Or are all of the contended renames about 2.2x slower than the uncontended renames?

I was hoping there was a way to make this test more robust by using the rename stats to see if there are "long pole" renames where max >> average that show a test failure, but even that doesn't appear to be very reliable. In my local (single-VM, on otherwise idle system) testing an uncontended rename I see times with max > 10x min, and max > 4x avg (stats columns are "stat, count, 'samples', [unit], min, max, sum", so avg=sum/100 can be easily calculated visually):

# mkdir /mnt/testfs/dir; touch /mnt/testfs/dir/f{1..100}
# lctl set_param llite.*.stats=clear; (cd /mnt/testfs/dir; rename f g f[1-9]*); lctl get_param llite.*.stats | grep -w rename
rename                    100 samples [usec] 1289 15669 305488
rename                    100 samples [usec] 1194 17080 443209
rename                    100 samples [usec] 1393 17190 445666

Comparatively, running this on my real client+server hardware shows much closer times between min/max/avg, with max=1.1-1.3x avg:

rename                    100 samples [usec] 556 839 65490 43136492
rename                    100 samples [usec] 466 849 68130 46780816
rename                    100 samples [usec] 591 826 68434 47090098
rename                    100 samples [usec] 633 801 72962 53361444
rename                    100 samples [usec] 615 816 73353 53967167

Lai, since you wrote patch https://review.whamcloud.com/32738 and this test originally, do you have any suggestions for Vitaliy on how to make it more reliable (e.g. grep something from the client or MDS debug logs to detect DLM lock cancel/contention)?

Otherwise, if you don't have any immediate suggestions, I don't think that this test can produce reliable results when run in a VM, and should be changed to use error_not_in_vm instead of error.

Comment by Gerrit Updater [ 17/Apr/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50654
Subject: LU-11170 tests: don't fail sanity/415 in VM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01b1f0ec789313c077e3a6aff64ef145efc4d9f4

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50654/
Subject: LU-11170 tests: don't fail sanity/415 in VM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73a7b1c2a3f0114618db7781adb56974ed682f24

Generated at Sat Feb 10 02:41:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.