Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.0
-
None
-
3
-
4630
Description
Following bug occured :
> c0-0c0s5n1 LustreError: 20262:0:(ldlm_lock.c:213:ldlm_lock_add_to_lru_nolock()) ASSERTION(lock->l_resource->lr_type != LDLM_FLOCK) failed
> c0-0c0s5n1 LustreError: 20262:0:(ldlm_lock.c:213:ldlm_lock_add_to_lru_nolock()) LBUG
> c0-0c0s5n1 Pid: 20262, comm: fcntl17
> c0-0c0s5n1 Call Trace:
> c0-0c0s5n1 [<ffffffff81007a89>] try_stack_unwind+0x149/0x190
> c0-0c0s5n1 [<ffffffff81006420>] dump_trace+0x90/0x300
> c0-0c0s5n1 [<ffffffffa0132992>] libcfs_debug_dumpstack+0x52/0x80 [libcfs]
> c0-0c0s5n1 [<ffffffffa0132f01>] lbug_with_loc+0x71/0xe0 [libcfs]
> c0-0c0s5n1 [<ffffffffa013c461>] libcfs_assertion_failed+0x61/0x70 [libcfs]
> c0-0c0s5n1 [<ffffffffa0261348>] ldlm_lock_add_to_lru_nolock+0xd8/0xe0 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa02619d9>] ldlm_lock_add_to_lru+0x49/0x100 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa0266d28>] ldlm_lock_decref_internal+0x2e8/0x860 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa027d288>] failed_lock_cleanup+0x58/0x100 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa027d4e6>] ldlm_cli_enqueue_fini+0x1b6/0xbb0 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa0282541>] ldlm_cli_enqueue+0x1a1/0x760 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa04a876b>] ll_file_flock+0x47b/0x690 [lustre]
> c0-0c0s5n1 [<ffffffff81122dee>] vfs_lock_file+0x1e/0x40
> c0-0c0s5n1 [<ffffffff81123027>] fcntl_setlk+0x167/0x320
> c0-0c0s5n1 [<ffffffff810f6661>] sys_fcntl+0x321/0x540
> c0-0c0s5n1 [<ffffffff81002eab>] system_call_fastpath+0x16/0x1b
> c0-0c0s5n1 [<00002aaaadd7f702>] 0x2aaaadd7f702
It looks like the problem is in following race:
ldlm_cb thread calls ldlm_run_cp_ast_work() :
lock_res_and_lock(lock);
list_del_init(&lock->l_cp_ast);
LASSERT(lock->l_flags & LDLM_FL_CP_REQD);
/* save l_completion_ast since it can be changed by
mds_intent_policy(), see bug 14225 */
completion_callback = lock->l_completion_ast;
lock->l_flags &= ~LDLM_FL_CP_REQD;
unlock_res_and_lock(lock);
while original lock wait thread receives signal:
signal callback ldlm_flock_interrupted_wait() does
lock->l_flags |= LDLM_FL_CBPENDING;
without locking
l_wait_event() exits with error (signal occurred) and failed_lock_cleanup() fails on assert because LDLM_FL_CBPENDING was cleared by ldlm_run_cp_ast_work()