[LU-14203] sanityn test 40a fails with 'parallel operation is blocked' Created: 09/Dec/20  Updated: 09/Dec/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanityn test_40a fails with 'parallel operation is blocked'.

We’ve seen this failure three times :
18 AUG 2020 DNE/ZFS 2.12.5.20 for ticket/patch LU-13471/39576 - https://testing.whamcloud.com/test_sets/15350d90-9f69-4900-b036-9fc3f73140e5
07 NOV 2020 DNE/ZFS 2.12.5.83 for branch testing - https://testing.whamcloud.com/test_sets/bbd2f426-8f1b-4e0d-a27f-951e75976ab6
07 DEC 2020 DNE/ldiskfs 2.12.6 RC2 branch testing - https://testing.whamcloud.com/test_sets/bc0cbb1c-6e57-40df-9eef-47d8b7b8962d

Looking at the failure for RC2 testing, the suite_log shows the output for the test

== sanityn test 40a: pdirops: create vs others ======================================================= 17:18:20 (1607361500)
CMD: trevis-52vm5,trevis-52vm6 /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
CMD: trevis-52vm5,trevis-52vm6 /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_count=57
ldlm.namespaces.mdt-lustre-MDT0001_UUID.lock_count=1
CMD: trevis-52vm5 lctl set_param fail_loc=0x80000145
fail_loc=0x80000145
No conflict
No conflict
No conflict
No conflict
No conflict
No conflict
Conflict
 sanityn test_40a: @@@@@@ FAIL: parallel operation is blocked 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5907:error()
  = /usr/lib64/lustre/tests/sanityn.sh:1517:test_40a()

Although we see these ‘errors’ in the MDS console when 40a passes, this is the only interesting messages in the console logs. Looking at the console log for MDS1/3 (vm5), we see

[31916.082787] Lustre: DEBUG MARKER: == sanityn test 40a: pdirops: create vs others ======================================================= 17:18:20 (1607361500)
[31916.531769] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n ldlm.namespaces.*mdt*.lru_size=clear
[31917.009765] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param ldlm.namespaces.*mdt*.lock_unused_count ldlm.namespaces.*mdt*.lock_count
[31917.449680] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000145
[31917.647130] LustreError: 8427:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 145 sleeping for 15000ms
[31917.648962] LustreError: 8427:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 4 previous similar messages
[31932.650062] LustreError: 8427:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 145 awake
[31932.651716] LustreError: 8427:0:(fail.c:133:__cfs_fail_timeout_set()) Skipped 4 previous similar messages
[31933.827926] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_40a: @@@@@@ FAIL: parallel operation is blocked 
[31934.090878] Lustre: DEBUG MARKER: sanityn test_40a: @@@@@@ FAIL: parallel operation is blocked

Generated at Sat Feb 10 03:07:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.