[LU-10754] sanityn test 47b fails with 'create must fail' Created: 02/Mar/18 Updated: 26/Oct/21 Resolved: 22/May/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne, zfs | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
sanityn test_47b fails for DNE/ZFS testing. From the client test_log, we see == sanityn test 47b: pdirops: remote mkdir vs create ================================================= 09:50:51 (1519725051) CMD: onyx-32vm9 lctl set_param fail_loc=0x80000145 fail_loc=0x80000145 sanityn test_47b: @@@@@@ FAIL: create must fail lfs mkdir: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/f47b.sanityn' (3): stripe already set lfs setdirstripe: cannot create stripe dir '/mnt/lustre/f47b.sanityn': File exists This test started failing on 2018-01-19. The console and dmesg logs don’t have any more information in them for this failure than what is in the suite_log. Logs for test sessions with this failure are at |
| Comments |
| Comment by Alex Zhuravlev [ 28/Jan/19 ] |
|
https://testing.whamcloud.com/test_sessions/b9d6b421-a589-4a3c-8daa-a9e2e08e5b31 |
| Comment by Patrick Farrell (Inactive) [ 31/Jan/19 ] |
|
https://testing.whamcloud.com/test_sessions/bc2f83f6-83cc-4baa-8854-25a3f29ecb0c |
| Comment by Patrick Farrell (Inactive) [ 31/Jan/19 ] |
|
So what I learned after looking at these logs for a bit is that the DNE2 protocol is complex and I don't understand it, but what I did learn is that we are hanging such that the lfs mkdir is not running until after the multiop has started: — $LFS mkdir -i 1 $DIR1/$tfile & — The lfs mkdir hangs waiting on a lock (not the intended PDO lock, I think - much earlier...?). The sequence of events is a bit beyond me right now, but it involves cancelling locks on MDT1, which are presumably held because of previous tests. Somewhere in there, something gets stuck, and the lock cancellation does not complete (on MDT0) until after the create operation has started, hence the mkdir failing with EEXIST. |
| Comment by Patrick Farrell (Inactive) [ 04/Feb/19 ] |
|
I can't prove it, but I have a guess on this. The lock it's waiting for was created by/during a previous test, and I can't prove it, but I think there's some uncommitted stuff under it that's getting sync'ed. In that case, 'sleep 1' simply isn't long enough for that to happen due to our lack of the ZIL and the long commit intervals on ZFS (once per second).
I'm going to push a patch to change the sleeps to 2 seconds. |
| Comment by Patrick Farrell (Inactive) [ 04/Feb/19 ] |
|
Discussed with bzzz, he's going to take a look and see what we can do, maybe we can do better than just increasing the sleep. (Which isn't 100% reliable, but we also want to avoid syncs...) |
| Comment by Andreas Dilger [ 10/May/19 ] |
|
What about just cancelling all of the MDC locks at the start of the test, so that we aren't waiting on that in the middle of the test? That avoids making the sleep to be long enough for the 1% case, without wasting time to make it too long for 99% of runs. We still need a small sleep due to fork/exec, but that can be sleep 0.2 or similar. |
| Comment by Patrick Farrell (Inactive) [ 10/May/19 ] |
|
That makes a lot more sense than my suggestion. Andreas, has this failed recently? I ignored it because I saw it a few times, then hadn't seen it since. |
| Comment by Andreas Dilger [ 10/May/19 ] |
|
PS: I post here because it seems like there has been an uptick in the frequency of test failures:
There were a number of patches landed on 2019-05-08, so it is likely one of them is involved (possibly just increasing the size of the race window by making some piece of code/test slower. |
| Comment by Patrick Farrell (Inactive) [ 10/May/19 ] |
|
Presumably: https://review.whamcloud.com/4392 Which modifies these |
| Comment by Andreas Dilger [ 10/May/19 ] |
|
It looks like the recent uptick in failures is caused by the landing on 2019-05-10 of patch https://review.whamcloud.com/4392 " |
| Comment by Gerrit Updater [ 10/May/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34848 |
| Comment by Gerrit Updater [ 12/May/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34848/ |
| Comment by Gerrit Updater [ 13/May/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34853 |
| Comment by James A Simmons [ 22/May/19 ] |
|
This is causing about 1/3 of each test suite runs to fail. |
| Comment by Gerrit Updater [ 22/May/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34853/ |
| Comment by Gerrit Updater [ 01/Jul/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35399 |
| Comment by Patrick Farrell (Inactive) [ 08/Jul/19 ] |
|
OK, figured it out: Wondering if we can just land it to b2_12? |
| Comment by Andreas Dilger [ 09/Jul/19 ] |
If this patch is landed to b2_12, will it then cause interop testing problems for 2.12 vs. 2.10? Not that I'm totally against this, since we run far more testing on master than b2_12, but ideally we should just add a version interop check, or just skip this test for interop testing on b2_12 so that there aren't gratuitous errors that need to be looked at. |
| Comment by Jian Yu [ 06/Aug/19 ] |
|
The failure still occurred on master branch: |
| Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ] |
|
There are a number of bugs for this, unfortunately. Here's the one we're using now, which has an unlanded fix: |
| Comment by Jian Yu [ 07/Aug/19 ] |
|
Thank you, Patrick. |