[LU-12470] sanityn test_47b: create isn't blocked Created: 25/Jun/19 Updated: 23/Jun/22 Resolved: 23/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Chris Horn <hornc@cray.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a0a32158-9756-11e9-8262-52540065bddc test_47b failed with the following error: create isn't blocked == sanityn test 47b: pdirops: remote mkdir vs create ================================================= 10:59:51 (1561460391) CMD: trevis-38vm9 lctl set_param fail_loc=0x80000145 fail_loc=0x80000145 CMD: trevis-38vm9 lctl set_param fail_loc=0 fail_loc=0 No conflict lfs mkdir: dirstripe error on '/mnt/lustre/f47b.sanityn': stripe already set lfs setdirstripe: cannot create dir '/mnt/lustre/f47b.sanityn': File exists sanityn test_47b: @@@@@@ FAIL: create isn't blocked VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 25/Jun/19 ] |
|
Ah, I get it. This the old "delay because ZFS sync is slow problem", where we're cancelling a pre-existing lock. But we're clearing the client mdc locks first, so... 00010000:00010000:1.0:1561460391.953615:0:9926:0:(ldlm_lock.c:662:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-lustre-MDT0000_UUID lock: ffff9a5196ab66c0/0x2d41583a090c8900 lrc: 2/0,0 mode: EX/EX res: [0x200000007:0x1:0x0].0x0 bits 0x2/0x0 rrc: 8 type: IBT flags: 0x40000001000000 nid: 10.9.3.146@tcp remote: 0x7614c41d0ce57ab0 expref: 8 pid: 3617 timeout: 0 lvb_type: 0 This is the lock that MDT0 is cancelling... And the owner: nid: 10.9.3.146@tcp Is another MDT. So we also have to clear MDT-MDT locks. Easy enough. |
| Comment by Gerrit Updater [ 25/Jun/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35321 |
| Comment by Andreas Dilger [ 24/Jul/19 ] |
|
Hopefully my change to the Maloo auto-vet works for this ticket. There were about 10 different tickets open for apparently identical failures on different subtests. I picked this one as the prime target because it has a potential fix. |
| Comment by Cory Spitz [ 03/Aug/19 ] |
|
"There are about 10 different tickets open...", does that mean that this test fix could dramatically improve the pass rate of auto test? Is there any relation to the failures that prompted LU-12210 ? |
| Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ] |
|
No, these are all very low incidence, I think the highest is well below 1%, maybe below 0.1%. But it's somewhere between 10 and 30 tests failing (I'd have to dig to be sure), so a lot of tickets get opened. |
| Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ] |
|
Oh, and no, these tests are not among those that I noticed affected by LU-12210 when I looked in to it. There's just a pile of tests that all do the same thing (testing blocking/not blocking behavior of different metadata ops, in DNE and non-DNE configs) and when Alex moved the wait time in these tests from "obscenely long" to "reasonable" many of them started getting a few different low probability failures, some common to the full set, some limited to DNE. I don't really know if the patch I've got here is the last problem, but I'm hopeful. Anyway, even though there's a lot of different tests, the aggregate failure rate is not that high, and we're currently passing autotest pretty reliably. |
| Comment by Gerrit Updater [ 09/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35321/ |
| Comment by Peter Jones [ 09/Aug/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 22/Aug/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35865 |
| Comment by Chris Horn [ 23/Aug/19 ] |
|
Looks like this issue is not totally resolved - https://testing.whamcloud.com/test_sets/5cafca60-c5a9-11e9-98c8-52540065bddc pfarrell re-open this ticket or open a new one? |
| Comment by Chris Horn [ 23/Aug/19 ] |
|
Opened https://jira.whamcloud.com/browse/LU-12689 to track new failure |
| Comment by Andreas Dilger [ 26/Aug/19 ] |
|
It seems that there are now two different failure modes for these tests:
The testing interop problem may relate to a patch in 2.13 on how the LRU is cleared (always waiting for locks to be cancelled?), but I can't find the patch in question. I think it makes sense to reopen this patch and not mark it resolved for 2.13, and either revert the patch (since it breaks testing interop), or figure out what is causing the testing interop issues and include the fix to 2.12.3 (though this is sub-optimal for some reasons). |
| Comment by Gerrit Updater [ 12/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35865/ |
| Comment by Emoly Liu [ 27/Sep/19 ] |
|
+1 on master with the patch: https://testing.whamcloud.com/test_sets/60b8ee10-e03a-11e9-a197-52540065bddc |
| Comment by Chris Horn [ 24/Oct/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/3f181300-d3f8-465a-88de-95756bf58f3c |
| Comment by Chris Horn [ 31/Oct/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/7a906b8b-f8ee-48b5-91bc-d3628a032560 |
| Comment by Emoly Liu [ 20/Nov/19 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/bca0aec6-0abe-11ea-b934-52540065bddc |
| Comment by Arshad Hussain [ 30/Dec/19 ] |
|
Seen on master: https://testing.whamcloud.com/sub_tests/863eeb94-2b1c-11ea-b0f4-52540065bddc |
| Comment by Gerrit Updater [ 22/Jan/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37304 |
| Comment by Gerrit Updater [ 23/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37304/ |
| Comment by James Nunez (Inactive) [ 08/Jun/21 ] |
|
I'm going to open a new ticket for this, but thought I'd leave a note here. We are still seeing this issue for interop testing between master clients and 2.12.6 servers. One such failure is at https://testing.whamcloud.com/test_sets/d3a83e0b-15be-4799-b8ba-c264835472e2 . |
| Comment by Andreas Dilger [ 26/Oct/21 ] |
|
... and that bug is LU-14746 |