[LU-13328] sanityn test 35 is broken Created: 04/Mar/20  Updated: 05/Mar/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13329 Catch improperly reset fail_race cond... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like tha race condition is rarely hit. Here's the typical run:

[  751.745601] Lustre: DEBUG MARKER: == sanityn test 35: -EINTR cp_ast vs. bl_ast race does not evict client ============================== 04:48:12 (1583315292)
[  752.624986] Lustre: DEBUG MARKER: Race attempt 0
[  754.244032] Lustre: DEBUG MARKER: Wait for 16439 16445 for 60 sec...
[  759.855738] Lustre: lustre-OST0001-osc-ffff8800a08ce800: disconnect after 21s idle
[  816.516229] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly 

BUT sometimes it does hit and in that case the thread is hung and remains hung forever, making cleanup timeout.

[  699.235637] Lustre: DEBUG MARKER: == sanityn test 35: -EINTR cp_ast vs. bl_ast race does not evict client ============================== 17:30:03 (1583274603)
[  700.065742] Lustre: DEBUG MARKER: Race attempt 0
[  701.154904] Lustre: *** cfs_fail_loc=317, val=2147484440***
[  701.155724] LustreError: 17592:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 318 sleeping
[  701.579880] Lustre: DEBUG MARKER: Wait for 17549 17565 for 60 sec...
[  763.731071] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly 

See how the race hit, but never actually woke.

We can confirm this in analyzing the core file:

PID: 17592  TASK: ffff88009dfc8440  CPU: 2   COMMAND: "ll_sa_17565"
 #0 [ffff8800c25276e8] __schedule+0x3ee at ffffffff817d889e
 #1 [ffff8800c2527778] schedule+0x29 at ffffffff817d8ea9
 #2 [ffff8800c2527788] ldlm_lock_decref_internal+0x96e at ffffffffa05b09ce [ptlrpc]
 #3 [ffff8800c25277e8] failed_lock_cleanup+0x98 at ffffffffa05cb928 [ptlrpc]
 #4 [ffff8800c2527808] ldlm_cli_enqueue_fini+0x16c at ffffffffa05cddac [ptlrpc]
 #5 [ffff8800c25278b0] ldlm_cli_enqueue+0x441 at ffffffffa05d1ad1 [ptlrpc]

crash> l *0xffffffffa05b09ce
0xffffffffa05b09ce is in ldlm_lock_decref_internal (/home/green/git/lustre-release/libcfs/include/libcfs/libcfs_fail.h:170).
165		if (CFS_FAIL_PRECHECK(id)) {
166			if (unlikely(__cfs_fail_check_set(id, 0, CFS_FAIL_LOC_NOSET))) {
167				int rc;
168				cfs_race_state = 0;
169				CERROR("cfs_race id %x sleeping\n", id);
170				rc = wait_event_interruptible(cfs_race_waitq,
171							      cfs_race_state != 0);
172				CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc);
173			} else {
174				CERROR("cfs_fail_race id %x waking\n", id);

so in addition to fixing this, I wonder if it makes sense to clear the active race with a very visible dmesg message?

Here's the example of the failure case: http://testing.linuxhacker.ru:3333/lustre-reports/7130/testresults/sanityn-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/


Generated at Sat Feb 10 03:00:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.