Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.1.0, Lustre 2.3.0, Lustre 1.8.8
-
Originally it happened with FEFS, based on Lustre-1.8.5 in RIKEN K computer environment
MDSx1, OSSx2592, OSTx5184, Clientx84672
-
3
-
4517
Description
Hi, I believe I found some bugs in mgc_enqueue().
I'm sure in 1.8.x case because I've often run into the case actually,
but I'm not so in 2.x case because it's due to only my sourde code
inspection, not actuall facing the problem.
So could someone verify whether or not it's true or not in the 2.x case?
The over views of the bugs are the below.
1)
mgc_blocking_ast() is never callced when the case that mgc_enqnene()
failed to allocat memory for a new ldlm_lock for cld, because there
is no ldlm_lock holding callback-functions due to the allocation failure.
That's why the cld remains and it cause the obd_device of MGC remains
forever even after umount-command is exexuted.
IMHO, In order to fix the problem, we should call config_log_put() when
the above allocation failure happened. But we have no way to know just
only the failure because ldlm_cli_enqueue() returns only -ENOMEM to caller
and there're some other functions which will return -ENOMEM.
Locally, I've fixed it with adding one more argument, though my case
is FEFS, based on Lustre-1.8.5. but I think it's not so good.
So I'd like to discuss about the way to fix it.
2)
When the case that mgc_enqnene() failed to allocat memory for a new
ptrpc_request for cld, failed_lock_cleanup() is called, and the call
path will follow the below list ...
failed_lock_clanup() ldlm_lock_decref_internal() ldlm_handle_bl_callback() mgc_blocking_ast() ldlm_cli_cancel() ldlm_cli_cancel_local()
Then, it will reach ldlm_cli_cancel_local() and go the below
else-statement because the lock->l_conn_export haven't been
set so far.
But it's caught by LBUG because it's a client side lock.
static int ldlm_cli_cancel_local(struct ldlm_lock *lock) { int rc = LDLM_FL_LOCAL_ONLY; ENTRY; if (lock->l_conn_export) { <---- NULL in the case .... } else { if (ns_is_client(ldlm_lock_to_ns(lock))) { <---- This is a client side lock, then it will be True LDLM_ERROR(lock, "Trying to cancel local lock"); <---- caught here LBUG(); } LDLM_DEBUG(lock, "server-side local cancel"); ldlm_lock_cancel(lock); ldlm_reprocess_all(lock->l_resource); } RETURN(rc); }
So, I believe ldlm_cli_enqueue should be modified like the below pseudo code.
+ lock->l_conn_export = exp; + lock->l_export = NULL; + lock->l_blocking_ast = einfo->ei_cb_bl; if (reqp == NULL || *reqp == NULL) { req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp), &RQF_LDLM_ENQUEUE, LUSTRE_DLM_VERSION, LDLM_ENQUEUE); if (req == NULL) { failed_lock_cleanup(ns, lock, einfo->ei_mode); LDLM_LOCK_RELEASE(lock); RETURN(-ENOMEM); } req_passed_in = 0; if (reqp) *reqp = req; } else { int len; req = *reqp; len = req_capsule_get_size(&req->rq_pill, &RMF_DLM_REQ, RCL_CLIENT); LASSERTF(len >= sizeof(*body), "buflen[%d] = %d, not %d\n", DLM_LOCKREQ_OFF, len, (int)sizeof(*body)); } - lock->l_conn_export = exp; - lock->l_export = NULL; - lock->l_blocking_ast = einfo->ei_cb_bl;