[LU-16198] sanity test_33hh: MDT index match 49/250 times Created: 30/Sep/22  Updated: 15/Oct/22  Resolved: 15/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15720 imbalanced file creation in 'crush' s... Resolved
is related to LU-13481 sanity test_33h: MDT index mismatch 5... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4a579469-93b2-4c09-9114-b0f1258c2fc9

test_33hh failed with the following error:

MDT index match 49/250 times

Test log is:

== sanity test 33hh: temp file is located on the same MDT as target (crush2) ========================================================== 16:36:28 (1664469388)
MDS1_VERSION=34550793 version_code=34537472
striped dir -i1 -c4 -H crush2 /mnt/lustre/d33hh.sanity
pattern .f33hh.sanity.XXXXXX
/mnt/lustre/d33hh.sanity/.f33hh.sanity.NPJRGZ MDT index mismatch 0 != 2
pattern f33hh.sanity.XXXXXXXX
1/250 MDT index mismatches, expect ~2-4
pattern .f33hh.sanity.XXXXXX
pattern f33hh.sanity.XXXXXXXX
52/250 matches, expect ~62 for crush2
pattern=.f33hh.sanity....XXX
pattern=f33hh.sanity....XXXXX
49/250 matches, expect ~62 for crush2
 sanity test_33hh: @@@@@@ FAIL: MDT index match 49/250 times

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_33hh - MDT index match 49/250 times



 Comments   
Comment by Qian Yingjin [ 30/Sep/22 ]

+1 on master:https://testing.whamcloud.com/test_sets/3fe086d9-2c34-49c2-9413-42383acff1c8

Comment by Andreas Dilger [ 30/Sep/22 ]

This test failure is mostly caused by random chance based on the filenames created by "mktemp()" being all-number or all-uppercase or all-lowercase letters. Statistically this should be rare to have many of these, but the many times that this subtest is run (421 runs in the past week) means that the test will randomly fail occasionally (0.47% ).

Options for fixing it would include:

  • increase the margin of error for allowing this to pass. Currently the threshold is 80% of the expected number of files per MDT (250 files / 4 MDTs = 62 * 4/5 = 49 files, or 62 * 5/4 = 77 files). In the past 3 months the subtest has failed 20 times, almost all of them are 46/250 or more. One failure is 45/250 and one is 80/250, so using 5/7=71%, so 62 * 5/7 = 44 files or 62 * 7/5 = 86 files should avoid virtually all random errors.
  • automatically re-running the subtest if it fails once would also reduce the chance of randomly reporting an error from 0.4% = 2/421 to 1/40000, or less than once every two years at the current rate of ~1500 runs per month.
Comment by Gerrit Updater [ 01/Oct/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48713
Subject: LU-16198 tests: increase margin for sanity/33hh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0efdd531c2896812c430e0ca623ef67fb2002ca1

Comment by Gerrit Updater [ 15/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48713/
Subject: LU-16198 tests: increase margin for sanity/33hh
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e17471792388e59f44040d48dd8138ec865663af

Comment by Andreas Dilger [ 15/Oct/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:24:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.