[LU-1682] Memory allocation failure in mgc_enqueue() tends to cause LBUG and refcount issue Created: 27/Jul/12 Updated: 08/Feb/18 Resolved: 08/Feb/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.3.0, Lustre 1.8.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Hiroya Nozaki | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Originally it happened with FEFS, based on Lustre-1.8.5 in RIKEN K computer environment |
||
| Severity: | 3 |
| Rank (Obsolete): | 4517 |
| Description |
|
Hi, I believe I found some bugs in mgc_enqueue(). So could someone verify whether or not it's true or not in the 2.x case? 1) IMHO, In order to fix the problem, we should call config_log_put() when Locally, I've fixed it with adding one more argument, though my case So I'd like to discuss about the way to fix it. 2) failed_lock_clanup()
ldlm_lock_decref_internal()
ldlm_handle_bl_callback()
mgc_blocking_ast()
ldlm_cli_cancel()
ldlm_cli_cancel_local()
Then, it will reach ldlm_cli_cancel_local() and go the below static int ldlm_cli_cancel_local(struct ldlm_lock *lock)
{
int rc = LDLM_FL_LOCAL_ONLY;
ENTRY;
if (lock->l_conn_export) { <---- NULL in the case
....
} else {
if (ns_is_client(ldlm_lock_to_ns(lock))) { <---- This is a client side lock, then it will be True
LDLM_ERROR(lock, "Trying to cancel local lock"); <---- caught here
LBUG();
}
LDLM_DEBUG(lock, "server-side local cancel");
ldlm_lock_cancel(lock);
ldlm_reprocess_all(lock->l_resource);
}
RETURN(rc);
}
So, I believe ldlm_cli_enqueue should be modified like the below pseudo code. + lock->l_conn_export = exp;
+ lock->l_export = NULL;
+ lock->l_blocking_ast = einfo->ei_cb_bl;
if (reqp == NULL || *reqp == NULL) {
req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
&RQF_LDLM_ENQUEUE,
LUSTRE_DLM_VERSION,
LDLM_ENQUEUE);
if (req == NULL) {
failed_lock_cleanup(ns, lock, einfo->ei_mode);
LDLM_LOCK_RELEASE(lock);
RETURN(-ENOMEM);
}
req_passed_in = 0;
if (reqp)
*reqp = req;
} else {
int len;
req = *reqp;
len = req_capsule_get_size(&req->rq_pill, &RMF_DLM_REQ,
RCL_CLIENT);
LASSERTF(len >= sizeof(*body), "buflen[%d] = %d, not %d\n",
DLM_LOCKREQ_OFF, len, (int)sizeof(*body));
}
- lock->l_conn_export = exp;
- lock->l_export = NULL;
- lock->l_blocking_ast = einfo->ei_cb_bl;
|
| Comments |
| Comment by Oleg Drokin [ 27/Jul/12 ] |
|
Thanks for uncovering this problem. I see that the code in 2.x is different than in 1.8. The cld get is done after mgc_enqueue call in 2.x. In 1.8 you can check that the lockh is not 0 after ldlm_cli_enqueue in mgc_enqueue and if it is still 0 (well, again, the not valid function), drop the cld refcount. I think this approach should help for issue #1. For issue #2, 2.x is unaffected in this particular case because we allocate the request before calling ldlm_cli_enqueue. But otherwise the bug is real and could hit in some other scenarios. As such please submit a separate patch just following your pseudocode for master (and b1_8 too). Thanks again! |
| Comment by Hiroya Nozaki [ 27/Jul/12 ] |
|
Hi, Oleg. It's been a while and thank you for your quick response!! 1)
I got it, it sounds good. I'll try it from now on. thanks alot !! 2)
OK, so I'll create and submit the new patch soon. |
| Comment by Oleg Drokin [ 27/Jul/12 ] |
Sorry, what I mean is while the config_log_get() is called after mgc_enqueue in master, it's still called regardless of success |
| Comment by Hiroya Nozaki [ 27/Jul/12 ] |
Wow, that's true. I seemed to jump to conclusions ... OK, so what I'm supposed to do is trying to fix the #1 bug of both 1.8 |
| Comment by Oleg Drokin [ 27/Jul/12 ] |
|
Yes, it would be great if you can submit patches for both issues for both branches. |
| Comment by Hiroya Nozaki [ 27/Jul/12 ] |
|
for the #2 bug, b1_8 branch |
| Comment by Hiroya Nozaki [ 27/Jul/12 ] |
|
for the #2 bug, master branch |
| Comment by Hiroya Nozaki [ 27/Jul/12 ] |
|
For the #1 bug, b1_8 branch. |
| Comment by Hiroya Nozaki [ 27/Jul/12 ] |
|
For the #1 but, master branch I especially want you all to review this patch. |
| Comment by Hiroya Nozaki [ 07/Aug/12 ] |
|
Now I've investigated a weird dump of FEFS based on Lustre-1.8.5. Do you know of some similar phenomena? Anything can be helpful for me to proceed it. |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |
|
close old tickets |