[LU-10870] sanityn test 40a, 40b, 40c, 40d, 40e fail with 'create is blocked' Created: 01/Apr/18  Updated: 22/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-12470 sanityn test_47b: create isn't blocked Resolved
duplicates LU-13097 sanityn test_47b: create must fail Resolved
is duplicated by LU-14746 Interop: sanityn test_41b fails with ... Open
is duplicated by LU-16686 sanityn test_47b: create must fail Open
is duplicated by LU-15313 sanityn test_46h: readdir isn't blocked Resolved
Related
is related to LU-10874 sanityn test_40b: @@@@@@ FAIL: create... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanityn test_40a, b, c, d, e all fail with

'create is blocked'

 

The output for all these tests in the test_log is

== sanityn test 40a: pdirops: create vs others ======================================================= 05:31:57 (1522474317)
CMD: trevis-50vm10 lctl set_param fail_loc=0x80000145
fail_loc=0x80000145
Conflict
 sanityn test_40a: @@@@@@ FAIL: create is blocked

 

The most interesting output in the console and dmesg logs is in the MDS logs

[18200.405863] Lustre: DEBUG MARKER: == sanityn test 40a: pdirops: create vs others ======================================================= 05:31:57 (1522474317)

[18200.592175] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000145

[18200.744375] LustreError: 1331:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 145 sleeping for 15000ms

[18200.745639] LustreError: 1331:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 2 previous similar messages

[18215.746700] LustreError: 1331:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 145 awake

[18215.747866] LustreError: 1331:0:(fail.c:133:__cfs_fail_timeout_set()) Skipped 2 previous similar messages

[18216.960004] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_40a: @@@@@@ FAIL: create is blocked

[18217.152621] Lustre: DEBUG MARKER: sanityn test_40a: @@@@@@ FAIL: create is blocked

 

Logs for this failure are at

2.11.0-RC3 Ubuntu clients - https://testing.hpdd.intel.com/test_sets/a680d104-3543-11e8-95c0-52540065bddc

2.10.1 Ubuntu clients - https://testing.hpdd.intel.com/test_sets/289b3946-ce96-11e7-9840-52540065bddc

2.10.3 el7/ZFS - https://testing.hpdd.intel.com/test_sets/e427ccbc-2cf4-11e8-b74b-52540065bddc

 



 Comments   
Comment by Peter Jones [ 01/Apr/18 ]

Yang Sheng

Could you please investigate?

Thanks

Peter

Comment by Yang Sheng [ 02/Apr/18 ]

From log:

 

00010000:00010000:1.0:1521525055.419920:0:15546:0:(ldlm_lockd.c:1239:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:00010000:1.0:1521525055.419924:0:15546:0:(ldlm_lockd.c:1319:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-lustre-MDT0000_UUID lock: ffff88005f262800/0x401f5317adeb9f2e lrc: 2/0,0 mode: --/CR res: [0x200000007:0x1:0x0].0x0 bits 0x0 rrc: 4 type: IBT flags: 0x40000000000000 nid: local remote: 0x4c150bb57b22c4ec expref: -99 pid: 15546 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1521525055.419941:0:15546:0:(ldlm_lock.c:743:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-lustre-MDT0000_UUID lock: ffff88005f263400/0x401f5317adeb9f35 lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x0 rrc: 5 type: IBT flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 15546 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1521525055.419949:0:15546:0:(ldlm_lock.c:659:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-lustre-MDT0000_UUID lock: ffff88005f262000/0x401f5317adeb9f20 lrc: 2/0,1 mode: CW/CW res: [0x200000007:0x1:0x0].0x0 bits 0x2 rrc: 5 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 12416 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1521525055.419953:0:15546:0:(ldlm_resource.c:1551:ldlm_resource_add_lock()) ### About to add this lock ns: mdt-lustre-MDT0000_UUID lock: ffff88005f263400/0x401f5317adeb9f35 lrc: 4/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 5 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 15546 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1521525055.419959:0:15546:0:(ldlm_request.c:357:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-lustre-MDT0000_UUID lock: ffff88005f262000/0x401f5317adeb9f20 lrc: 3/0,1 mode: CW/CW res: [0x200000007:0x1:0x0].0x0 bits 0x2 rrc: 5 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 12416 timeout: 0 lvb_type: 0

Looks like MDS_INODELOCK_UPDATE flag was setted while 'touch DIR2/$tfile' enqueue lock. So It is very likely duplicated with LU-6570. But nothing changing after LU-6570 patch was landed. I wonder why the logs is not collected with -1 flag.

Thanks,
YangSheng

Comment by James Nunez (Inactive) [ 18/Apr/18 ]

Some recent failures at:
2018-03-27 – review-dne-part-1 - https://testing.hpdd.intel.com/test_sets/21f1282a-318e-11e8-b6a0-52540065bddc
2018-03-19 - review-dne-part-1 - https://testing.hpdd.intel.com/test_sets/e9d4cd10-2b3f-11e8-b3c6-52540065bddc
2018-03-17 – review-dne-zfs-part-1 - https://testing.hpdd.intel.com/test_sets/d62a14a6-29b3-11e8-9e0e-52540065bddc

Comment by James Nunez (Inactive) [ 16/Sep/19 ]

We are seeing this issue again or something similar: https://testing.whamcloud.com/test_sets/29e74e80-d6db-11e9-a25b-52540065bddc

Comment by Chris Horn [ 24/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sessions/3f181300-d3f8-465a-88de-95756bf58f3c

Comment by Jian Yu [ 10/Feb/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/75e0b4fa-4ba4-11ea-b69a-52540065bddc

Comment by Emoly Liu [ 07/Jul/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/abe98c66-5293-41c4-a72b-c317b11bb2e2

Comment by Nikitas Angelinas [ 05/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/7f698e8e-3861-4ff7-b36a-fdd9b23bb69f for test_40b.

Is it possible that test_40a which is failing in the same test run with "link is blocked" is due to the same issue, or should I open a separate ticket?

Comment by Etienne Aujames [ 14/Feb/22 ]

+1 on b2_12 (2.12.8 - ZFS): https://testing.whamcloud.com/test_sets/1b0b4890-5b93-49ea-92de-881cd6d714fe
"link is blocked"

Generated at Sat Feb 10 02:38:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.