Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0
-
3
-
14756
Description
I see this running vanilla single node racer with memory allocation fault injection.
[ 169.793670] LustreError: 8024:0:(ldlm_lock.c:852:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed: [ 169.793681] LustreError: 8024:0:(ldlm_lock.c:852:ldlm_lock_decref_internal_nolock()) LBUG [ 169.793687] Pid: 8024, comm: setfattr [ 169.793690] [ 169.793691] Call Trace: [ 169.793731] [<ffffffffa02be8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 169.793757] [<ffffffffa02beec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 169.793848] [<ffffffffa0643842>] ldlm_lock_decref_internal_nolock+0xd2/0x180 [ptlrpc] [ 169.793923] [<ffffffffa0646d40>] ldlm_lock_decref_internal+0x50/0xae0 [ptlrpc] [ 169.793993] [<ffffffffa0438b7e>] ? class_handle2object+0x3e/0x1d0 [obdclass] [ 169.794052] [<ffffffffa06481b9>] ldlm_lock_decref+0x39/0x90 [ptlrpc] [ 169.794088] [<ffffffffa0e31b6f>] ll_intent_drop_lock+0xaf/0x150 [lustre] [ 169.794113] [<ffffffffa0e31c51>] ll_intent_release+0x41/0x1d0 [lustre] [ 169.794150] [<ffffffffa0e7e9c8>] ll_lookup_nd+0x108/0x4a0 [lustre] [ 169.794158] [<ffffffff811b29b5>] do_lookup+0x1a5/0x230 [ 169.794163] [<ffffffff811b2fc4>] __link_path_walk+0x584/0x840 [ 169.794168] [<ffffffff811b398a>] path_walk+0x6a/0xe0 [ 169.794172] [<ffffffff811b3b9b>] filename_lookup+0x6b/0xc0 [ 169.794177] [<ffffffff811b4cc7>] user_path_at+0x57/0xa0 [ 169.794182] [<ffffffff8119f6c3>] ? sys_close+0x43/0x120 [ 169.794187] [<ffffffff8119f6c3>] ? sys_close+0x43/0x120 [ 169.794192] [<ffffffff811cb418>] sys_setxattr+0x48/0xe0 [ 169.794200] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [ 169.794203]
This was triggered by a allocation failure in mdc_enqueue_finish(). But the issue is at the bottom of mdc_enqueue():
rc = mdc_finish_enqueue(exp, req, einfo, it, lockh, rc); if (rc < 0) { if (lustre_handle_is_used(lockh)) { ldlm_lock_decref(lockh, einfo->ei_mode); memset(lockh, 0, sizeof(*lockh)); } ptlrpc_req_finished(req); } RETURN(rc); }
We should clean it_lock_handle and it_lock_mode as well.
More generally mdc_enqueue() should not have a *lockh parameter at all but to fix this we probably need split md_enqueue() into md_enqueue() and md_flock().