Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
It looks like tha race condition is rarely hit. Here's the typical run:
[ 751.745601] Lustre: DEBUG MARKER: == sanityn test 35: -EINTR cp_ast vs. bl_ast race does not evict client ============================== 04:48:12 (1583315292) [ 752.624986] Lustre: DEBUG MARKER: Race attempt 0 [ 754.244032] Lustre: DEBUG MARKER: Wait for 16439 16445 for 60 sec... [ 759.855738] Lustre: lustre-OST0001-osc-ffff8800a08ce800: disconnect after 21s idle [ 816.516229] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly
BUT sometimes it does hit and in that case the thread is hung and remains hung forever, making cleanup timeout.
[ 699.235637] Lustre: DEBUG MARKER: == sanityn test 35: -EINTR cp_ast vs. bl_ast race does not evict client ============================== 17:30:03 (1583274603) [ 700.065742] Lustre: DEBUG MARKER: Race attempt 0 [ 701.154904] Lustre: *** cfs_fail_loc=317, val=2147484440*** [ 701.155724] LustreError: 17592:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 318 sleeping [ 701.579880] Lustre: DEBUG MARKER: Wait for 17549 17565 for 60 sec... [ 763.731071] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly
See how the race hit, but never actually woke.
We can confirm this in analyzing the core file:
PID: 17592 TASK: ffff88009dfc8440 CPU: 2 COMMAND: "ll_sa_17565" #0 [ffff8800c25276e8] __schedule+0x3ee at ffffffff817d889e #1 [ffff8800c2527778] schedule+0x29 at ffffffff817d8ea9 #2 [ffff8800c2527788] ldlm_lock_decref_internal+0x96e at ffffffffa05b09ce [ptlrpc] #3 [ffff8800c25277e8] failed_lock_cleanup+0x98 at ffffffffa05cb928 [ptlrpc] #4 [ffff8800c2527808] ldlm_cli_enqueue_fini+0x16c at ffffffffa05cddac [ptlrpc] #5 [ffff8800c25278b0] ldlm_cli_enqueue+0x441 at ffffffffa05d1ad1 [ptlrpc] crash> l *0xffffffffa05b09ce 0xffffffffa05b09ce is in ldlm_lock_decref_internal (/home/green/git/lustre-release/libcfs/include/libcfs/libcfs_fail.h:170). 165 if (CFS_FAIL_PRECHECK(id)) { 166 if (unlikely(__cfs_fail_check_set(id, 0, CFS_FAIL_LOC_NOSET))) { 167 int rc; 168 cfs_race_state = 0; 169 CERROR("cfs_race id %x sleeping\n", id); 170 rc = wait_event_interruptible(cfs_race_waitq, 171 cfs_race_state != 0); 172 CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc); 173 } else { 174 CERROR("cfs_fail_race id %x waking\n", id);
so in addition to fixing this, I wonder if it makes sense to clear the active race with a very visible dmesg message?
Here's the example of the failure case: http://testing.linuxhacker.ru:3333/lustre-reports/7130/testresults/sanityn-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
Attachments
Issue Links
- is related to
-
LU-13329 Catch improperly reset fail_race condition
- Open