[LU-12470] sanityn test_47b: create isn't blocked Created: 25/Jun/19  Updated: 23/Jun/22  Resolved: 23/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-12407 sanityn test_42d: unlink isn't blocked Resolved
is duplicated by LU-10870 sanityn test 40a, 40b, 40c, 40d, 40e ... Open
is duplicated by LU-14746 Interop: sanityn test_41b fails with ... Open
is duplicated by LU-4575 Test failure sanityn test_41g: getatt... Resolved
is duplicated by LU-6641 sanityn test_46i- FAIL: remote mkdir ... Resolved
is duplicated by LU-8759 sanityn test_41c: link isn't blocked Resolved
is duplicated by LU-10874 sanityn test_40b: @@@@@@ FAIL: create... Resolved
is duplicated by LU-12408 sanityn test_44g: getattr isn't blocked Resolved
is duplicated by LU-12435 sanityn: test_44i: @@@@@@ FAIL: remo... Resolved
is duplicated by LU-12437 sanityn test_44f: rename isn't blocked Resolved
is duplicated by LU-12449 sanityn test_47b: create must fail Resolved
is duplicated by LU-12466 sanityn test_45a fails with: mkdir is... Resolved
is duplicated by LU-12551 sanityn test_46e: rename isn't blocked Resolved
is duplicated by LU-12576 sanityn: test_43d 'unlink isn't blocked' Resolved
is duplicated by LU-13097 sanityn test_47b: create must fail Resolved
Related
is related to LU-12689 sanityn test_42b: create isn't blocked Resolved
is related to LU-10754 sanityn test 47b fails with 'create m... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a0a32158-9756-11e9-8262-52540065bddc

test_47b failed with the following error:

create isn't blocked
== sanityn test 47b: pdirops: remote mkdir vs create ================================================= 10:59:51 (1561460391)
CMD: trevis-38vm9 lctl set_param fail_loc=0x80000145
fail_loc=0x80000145
CMD: trevis-38vm9 lctl set_param fail_loc=0
fail_loc=0
No conflict
lfs mkdir: dirstripe error on '/mnt/lustre/f47b.sanityn': stripe already set
lfs setdirstripe: cannot create dir '/mnt/lustre/f47b.sanityn': File exists
 sanityn test_47b: @@@@@@ FAIL: create isn't blocked 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanityn test_41a - mkdir isn't blocked
sanityn test_41b - create isn't blocked
sanityn test_41c - link isn't blocked
sanityn test_41d - unlink isn't blocked
sanityn test_41e - rename isn't blocked
sanityn test_41f - rename isn't blocked
sanityn test_41g - getattr isn't blocked
sanityn test_42a - mkdir isn't blocked
sanityn test_42b - create isn't blocked
sanityn test_42c - link isn't blocked
sanityn test_42d - unlink isn't blocked
sanityn test_42e - rename isn't blocked
sanityn test_42f - rename isn't blocked
sanityn test_42g - getattr isn't blocked
sanityn test_43a - mkdir isn't blocked
sanityn test_43b - create isn't blocked
sanityn test_43c - link isn't blocked
sanityn test_43d - unlink isn't blocked
sanityn test_43e - rename isn't blocked
sanityn test_43f - rename isn't blocked
sanityn test_43g - getattr isn't blocked
sanityn test_43i - remote mkdir isn't blocked
sanityn test_44a - mkdir isn't blocked
sanityn test_44b - create isn't blocked
sanityn test_44c - link isn't blocked
sanityn test_44d - unlink isn't blocked
sanityn test_44e - rename isn't blocked
sanityn test_44f - rename isn't blocked
sanityn test_44g - getattr isn't blocked
sanityn test_44i - remote mkdir isn't blocked
sanityn test_45a - mkdir isn't blocked
sanityn test_45b - create isn't blocked
sanityn test_45c - link isn't blocked
sanityn test_45d - unlink isn't blocked
sanityn test_45e - rename isn't blocked
sanityn test_45f - rename isn't blocked
sanityn test_45g - getattr isn't blocked
sanityn test_46a - mkdir isn't blocked
sanityn test_46b - create isn't blocked
sanityn test_46c - link isn't blocked
sanityn test_46d - unlink isn't blocked
sanityn test_46e - rename isn't blocked
sanityn test_46f - rename isn't blocked
sanityn test_46g - getattr isn't blocked
sanityn test_46i - remote mkdir isn't blocked
sanityn test_47a - mkdir isn't blocked
sanityn test_47b - create isn't blocked
sanityn test_47c - link isn't blocked
sanityn test_47d - unlink isn't blocked
sanityn test_47e - rename isn't blocked
sanityn test_47f - rename isn't blocked
sanityn test_47g - getattr isn't blocked



 Comments   
Comment by Patrick Farrell (Inactive) [ 25/Jun/19 ]

Ah, I get it.

This the old "delay because ZFS sync is slow problem", where we're cancelling a pre-existing lock.  But we're clearing the client mdc locks first, so...

00010000:00010000:1.0:1561460391.953615:0:9926:0:(ldlm_lock.c:662:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-lustre-MDT0000_UUID lock: ffff9a5196ab66c0/0x2d41583a090c8900 lrc: 2/0,0 mode: EX/EX res: [0x200000007:0x1:0x0].0x0 bits 0x2/0x0 rrc: 8 type: IBT flags: 0x40000001000000 nid: 10.9.3.146@tcp remote: 0x7614c41d0ce57ab0 expref: 8 pid: 3617 timeout: 0 lvb_type: 0 

This is the lock that MDT0 is cancelling...  And the owner: nid: 10.9.3.146@tcp 

Is another MDT.

So we also have to clear MDT-MDT locks.  Easy enough.

Comment by Gerrit Updater [ 25/Jun/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35321
Subject: LU-12470 tests: clear MDT-MDT locks for pdo tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4397884527e437439005606690489010b247d0dd

Comment by Andreas Dilger [ 24/Jul/19 ]

Hopefully my change to the Maloo auto-vet works for this ticket. There were about 10 different tickets open for apparently identical failures on different subtests. I picked this one as the prime target because it has a potential fix.

Comment by Cory Spitz [ 03/Aug/19 ]

"There are about 10 different tickets open...", does that mean that this test fix could dramatically improve the pass rate of auto test? Is there any relation to the failures that prompted LU-12210 ?

Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ]

spitzcor,

No, these are all very low incidence, I think the highest is well below 1%, maybe below 0.1%.  But it's somewhere between 10 and 30 tests failing (I'd have to dig to be sure), so a lot of tickets get opened.

Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ]

Oh, and no, these tests are not among those that I noticed affected by LU-12210 when I looked in to it.  There's just a pile of tests that all do the same thing (testing blocking/not blocking behavior of different metadata ops, in DNE and non-DNE configs) and when Alex moved the wait time in these tests from "obscenely long" to "reasonable" many of them started getting a few different low probability failures, some common to the full set, some limited to DNE.

I don't really know if the patch I've got here is the last problem, but I'm hopeful.  Anyway, even though there's a lot of different tests, the aggregate failure rate is not that high, and we're currently passing autotest pretty reliably.

Comment by Gerrit Updater [ 09/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35321/
Subject: LU-12470 tests: clear MDT-MDT locks for pdo tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43ed7101e10e395839f9406bead6a5ac4fb02997

Comment by Peter Jones [ 09/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 22/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35865
Subject: LU-12470 tests: clear MDT-MDT locks for pdo tests
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 5c668c55ce8e44520f4eec7b1ae35eceeaa3a75b

Comment by Chris Horn [ 23/Aug/19 ]

Looks like this issue is not totally resolved - https://testing.whamcloud.com/test_sets/5cafca60-c5a9-11e9-98c8-52540065bddc

pfarrell re-open this ticket or open a new one?

Comment by Chris Horn [ 23/Aug/19 ]

Opened https://jira.whamcloud.com/browse/LU-12689 to track new failure

Comment by Andreas Dilger [ 26/Aug/19 ]

It seems that there are now two different failure modes for these tests:

  • occasional review test failures for patches similar to what happened before the LU-12470 patch landed, so it isn't clear whether this patch really fixed anything?
  • complete failure of subtests in the 40-46 range when running tests with 2.12.2 MDS/OSS and reporting locks not being canceled by lru_size=clear:
    == sanityn test 41b: pdirops: create vs create ======================================================= 15:25:02 (1565882702)
    CMD: trevis-20vm12 /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
    CMD: trevis-20vm12 /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
    ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_count=43
    

The testing interop problem may relate to a patch in 2.13 on how the LRU is cleared (always waiting for locks to be cancelled?), but I can't find the patch in question.

I think it makes sense to reopen this patch and not mark it resolved for 2.13, and either revert the patch (since it breaks testing interop), or figure out what is causing the testing interop issues and include the fix to 2.12.3 (though this is sub-optimal for some reasons).

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35865/
Subject: LU-12470 tests: clear MDT-MDT locks for pdo tests
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: cf3506ccb25f4a4f57150e51e7a8bde4cee80a26

Comment by Emoly Liu [ 27/Sep/19 ]

+1 on master with the patch: https://testing.whamcloud.com/test_sets/60b8ee10-e03a-11e9-a197-52540065bddc

Comment by Chris Horn [ 24/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sessions/3f181300-d3f8-465a-88de-95756bf58f3c

Comment by Chris Horn [ 31/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sessions/7a906b8b-f8ee-48b5-91bc-d3628a032560

Comment by Emoly Liu [ 20/Nov/19 ]

+1 on master: https://testing.whamcloud.com/test_sets/bca0aec6-0abe-11ea-b934-52540065bddc

Comment by Arshad Hussain [ 30/Dec/19 ]

Seen on master: https://testing.whamcloud.com/sub_tests/863eeb94-2b1c-11ea-b0f4-52540065bddc

Comment by Gerrit Updater [ 22/Jan/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37304
Subject: LU-12470 tests: increase pdirops timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a42f438a4acac3145cd5b7b178a737dbacc43d32

Comment by Gerrit Updater [ 23/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37304/
Subject: LU-12470 tests: increase pdirops timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b35f50c96c608ba650a5b3cf29fa129e01025549

Comment by James Nunez (Inactive) [ 08/Jun/21 ]

I'm going to open a new ticket for this, but thought I'd leave a note here. We are still seeing this issue for interop testing between master clients and 2.12.6 servers.

One such failure is at https://testing.whamcloud.com/test_sets/d3a83e0b-15be-4799-b8ba-c264835472e2 .

Comment by Andreas Dilger [ 26/Oct/21 ]

... and that bug is LU-14746

Generated at Sat Feb 10 02:52:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.