[LU-14824] sanity test_413a: timeout Created: 07/Jul/21  Updated: 22/Dec/23  Resolved: 08/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14659 sanity test_413a: subdirs shouldn't b... Resolved
is related to LU-16507 sanity test_413a: division by 0 (erro... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a8d333cd-1b69-4f00-9829-2590702b0c0e

test_413a failed with the following error:

Timeout occurred after 492 mins, last suite running was sanity

The test is blocked for an unknown reason, as nothing is visible in the console of the client or server nodes. Last message in test log is:

Mkdir (stripe_count 3) roundrobin:

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_413a - Timeout occurred after 492 mins, last suite running was sanity



 Comments   
Comment by Emoly Liu [ 02/Aug/21 ]

+1 on masterĀ https://testing.whamcloud.com/test_sets/9255ef4d-66b3-44ce-aafd-7dd654d0508a

Comment by Lai Siyao [ 03/Aug/21 ]

This looks to be on zfs backend only, and the possible reason is slow striped directory mkdir. I'll look into the test scripts to see how to improve this.

Comment by Sergey Cheremencev [ 02/Dec/21 ]

+1 on master: https://testing.whamcloud.com/test_sets/014222ff-aefc-4677-8545-f8b4bf0975c2

Comment by Chris Horn [ 02/Dec/21 ]

+1 on master https://testing.whamcloud.com/test_sets/712a4fd4-1460-412c-a436-f648b3e0fc3d

Comment by Cory Spitz [ 06/Dec/21 ]

Proposing for 2.15.0 given the recent activity with master.

Comment by Lai Siyao [ 29/Dec/21 ]

Andreas, all the failures are zfs system, and I don't see anything special in test logs, this test creates lots of files/directories, and unlinks them after test, it's stuck in unlink time. Should we just disable this test on zfs system?

Comment by Andreas Dilger [ 29/Dec/21 ]

I would prefer not to disable it if possible. I think one option to speed up the test for ZFS is to use larger DoM files, since this also reduces free inodes on a ZFS MDT, and should give the same behavior as creating a large number of inodes, unlike on ldiskfs.

Comment by Gerrit Updater [ 30/Dec/21 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45955
Subject: LU-14824 test: collect debug logs on zfs system
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1d583433b1a4ad23a99ecb85fe4dc6858edaef20

Comment by Andreas Dilger [ 10/Mar/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/3720ae52-c898-40bd-9bb0-f41ea075568c

Currently failing about 2.5% of runs, but 7.5% of ZFS runs.

Comment by Gerrit Updater [ 10/Mar/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46774
Subject: LU-14824 tests: reduce sanity test_413 ZFS test time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 502cc9bb0ee94537a77e56f3888c25f11f8790a0

Comment by Andreas Dilger [ 10/Mar/22 ]

It is worthwhile to note that patch https://review.whamcloud.com/46734 "LU-15528 mdt: enqueue newly created object locks in TXN mode" and the later patch https://review.whamcloud.com/46733 "LU-15526 mdt: enable remote PDO lock" are about 10x faster (~150-170s vs. ~1100-3000s) when running sanity test_413a compared to unpatched systems:

https://testing.whamcloud.com/search?server_file_system_type_id=00437f32-318d-11e1-9c6d-5254004bbbd3&test_set_script_id=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=44d5fa14-70d0-11e9-a6f2-52540065bddc&start_date=2022-03-07&end_date=2022-03-09&source=sub_tests#redirect

Comment by Nikitas Angelinas [ 18/May/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/46d8f39d-da3a-4039-bd02-d7c90196d7f9

Comment by Gerrit Updater [ 13/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45955/
Subject: LU-14824 test: sanity 413a/b unlink timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5ff3e400f1a74ea49b7eb9cf19715f0fae08c3f5

Comment by Peter Jones [ 13/Jan/23 ]

Landed for 2.16

Comment by Andreas Dilger [ 16/Jan/23 ]

Patch is being reverted due to many timeouts in ldiskfs since landing.

The patch 45955 was pushed and tested on 2022-03-06, but it looks like another patch may have landed in this same code in between and caused the failure.

Comment by Andreas Dilger [ 16/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49646
Subject: LU-14824 Revert "test: sanity 413a/b unlink timeout"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 46277a6a1affb8b21abb28941ff4b471d2b3bd32

Comment by Gerrit Updater [ 17/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49646/
Subject: LU-14824 Revert "test: sanity 413a/b unlink timeout"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 601ed56575a304c15ccb6d98a252162e64ef95e9

Comment by Gerrit Updater [ 27/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49799
Subject: LU-14824 test: sanity 413a/b unlink timeout v2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 09dd0a091defa34ed7de2402b64f0e38dbefd9a6

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49799/
Subject: LU-14824 test: sanity 413a/b unlink timeout v2
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 09fc9ccb1acc534b3fb074433dcbcecebb175384

Comment by Gerrit Updater [ 04/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51561
Subject: LU-14824 test: sanity 413a/b unlink timeout v2
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 6c3655aa62aadbed870cc9a1b8ec559156238aac

Generated at Sat Feb 10 03:13:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.