[LU-10754] sanityn test 47b fails with 'create must fail' Created: 02/Mar/18  Updated: 26/Oct/21  Resolved: 22/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.2
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: dne, zfs

Issue Links:
Duplicate
is duplicated by LU-13097 sanityn test_47b: create must fail Resolved
Related
is related to LU-12470 sanityn test_47b: create isn't blocked Resolved
is related to LU-2233 sanityn/40-46 takes ~800-900s Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanityn test_47b fails for DNE/ZFS testing. From the client test_log, we see

== sanityn test 47b: pdirops: remote mkdir vs create ================================================= 09:50:51 (1519725051)
CMD: onyx-32vm9 lctl set_param fail_loc=0x80000145
fail_loc=0x80000145
 sanityn test_47b: @@@@@@ FAIL: create must fail 
lfs mkdir: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/f47b.sanityn' (3): stripe already set
lfs setdirstripe: cannot create stripe dir '/mnt/lustre/f47b.sanityn': File exists

This test started failing on 2018-01-19.

The console and dmesg logs don’t have any more information in them for this failure than what is in the suite_log.

Logs for test sessions with this failure are at
https://testing.hpdd.intel.com/test_sets/b9932fbe-fd24-11e7-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/316d208c-1bb4-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/5ce20dce-1905-11e8-a6ad-52540065bddc



 Comments   
Comment by Alex Zhuravlev [ 28/Jan/19 ]

https://testing.whamcloud.com/test_sessions/b9d6b421-a589-4a3c-8daa-a9e2e08e5b31

Comment by Patrick Farrell (Inactive) [ 31/Jan/19 ]

https://testing.whamcloud.com/test_sessions/bc2f83f6-83cc-4baa-8854-25a3f29ecb0c

Comment by Patrick Farrell (Inactive) [ 31/Jan/19 ]

So what I learned after looking at these logs for a bit is that the DNE2 protocol is complex and I don't understand it, but what I did learn is that we are hanging such that the lfs mkdir is not running until after the multiop has started:

$LFS mkdir -i 1 $DIR1/$tfile &
PID1=$!
sleep 1
multiop $DIR2/$tfile oO_CREAT:O_EXCL:c && error "create must fail"

The lfs mkdir hangs waiting on a lock (not the intended PDO lock, I think - much earlier...?).  The sequence of events is a bit beyond me right now, but it involves cancelling locks on MDT1, which are presumably held because of previous tests.  Somewhere in there, something gets stuck, and the lock cancellation does not complete (on MDT0) until after the create operation has started, hence the mkdir failing with EEXIST.

Comment by Patrick Farrell (Inactive) [ 04/Feb/19 ]

I can't prove it, but I have a guess on this.

The lock it's waiting for was created by/during a previous test, and I can't prove it, but I think there's some uncommitted stuff under it that's getting sync'ed.  In that case, 'sleep 1' simply isn't long enough for that to happen due to our lack of the ZIL and the long commit intervals on ZFS (once per second).

 

I'm going to push a patch to change the sleeps to 2 seconds.

Comment by Patrick Farrell (Inactive) [ 04/Feb/19 ]

Discussed with bzzz, he's going to take a look and see what we can do, maybe we can do better than just increasing the sleep.  (Which isn't 100% reliable, but we also want to avoid syncs...)

Comment by Andreas Dilger [ 10/May/19 ]

What about just cancelling all of the MDC locks at the start of the test, so that we aren't waiting on that in the middle of the test? That avoids making the sleep to be long enough for the 1% case, without wasting time to make it too long for 99% of runs. We still need a small sleep due to fork/exec, but that can be sleep 0.2 or similar.

Comment by Patrick Farrell (Inactive) [ 10/May/19 ]

That makes a lot more sense than my suggestion.

Andreas, has this failed recently?  I ignored it because I saw it a few times, then hadn't seen it since.

Comment by Andreas Dilger [ 10/May/19 ]

PS: I post here because it seems like there has been an uptick in the frequency of test failures:

  • 2019-05-10: 4 failures already today
  • 2019-05-09: 4 failures the week ending on
  • 2019-05-02: 2 failures that week
  • 2019-04-25: 2 failures that week

There were a number of patches landed on 2019-05-08, so it is likely one of them is involved (possibly just increasing the size of the race window by making some piece of code/test slower.

Comment by Patrick Farrell (Inactive) [ 10/May/19 ]

Presumably:

https://review.whamcloud.com/4392

Which modifies these

Comment by Andreas Dilger [ 10/May/19 ]

It looks like the recent uptick in failures is caused by the landing on 2019-05-10 of patch https://review.whamcloud.com/4392 "LU-2233 tests: improve tests sanityn/40-47", which reduced the "sleep" time to 0.2s, which increased the probability of other lock
operations still being blocked when this test was run.

Comment by Gerrit Updater [ 10/May/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34848
Subject: LU-10754 tests: Clear mdc locks before tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1a4b819075e2936ed9dc5d4b3600a7be37270c8a

Comment by Gerrit Updater [ 12/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34848/
Subject: LU-10754 tests: Clear mdc locks before tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 26a7abe812834663a6d42ba0d04606650c71cb9f

Comment by Gerrit Updater [ 13/May/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34853
Subject: LU-10754 debug: try to sleep for 1s
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 24860d1c1948a2a7ccec0be8eb1ea8ee78e63cb6

Comment by James A Simmons [ 22/May/19 ]

This is causing about 1/3 of each test suite runs to fail.

Comment by Gerrit Updater [ 22/May/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34853/
Subject: LU-10754 tests: sanityn/47b to sleep for 1s
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 85ac0da36748311528aa7b89f62679f087b7445c

Comment by Gerrit Updater [ 01/Jul/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35399
Subject: LU-10754 tests: Debug full testing failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c6779689cbcd658b600933425db1d5365b3a3337

Comment by Patrick Farrell (Inactive) [ 08/Jul/19 ]

OK, figured it out:
https://review.whamcloud.com/#/c/4392/
breaks interop.

Wondering if we can just land it to b2_12?

Comment by Andreas Dilger [ 09/Jul/19 ]

Wondering if we can just land it to b2_12?

If this patch is landed to b2_12, will it then cause interop testing problems for 2.12 vs. 2.10? Not that I'm totally against this, since we run far more testing on master than b2_12, but ideally we should just add a version interop check, or just skip this test for interop testing on b2_12 so that there aren't gratuitous errors that need to be looked at.

Comment by Jian Yu [ 06/Aug/19 ]

The failure still occurred on master branch:
https://testing.whamcloud.com/test_sets/c376ce50-b86c-11e9-b753-52540065bddc
https://testing.whamcloud.com/test_sets/3e5c96d8-b7c6-11e9-b88c-52540065bddc
https://testing.whamcloud.com/test_sets/29aab538-b66a-11e9-b88c-52540065bddc

Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ]

yujian,

There are a number of bugs for this, unfortunately.  Here's the one we're using now, which has an unlanded fix: LU-12470

Comment by Jian Yu [ 07/Aug/19 ]

Thank you, Patrick.

Generated at Sat Feb 10 02:37:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.