[LU-2807] lockup in server completion ast -> lu_object_find_at Created: 13/Feb/13 Updated: 16/Oct/13 Resolved: 16/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 6794 | ||||||||||||||||
| Description |
|
Running racer I hit a problem multiple times where on completion AST the callback gets stuck looking for some object. [175924.328073] INFO: task ptlrpc_hr01_003:16414 blocked for more than 120 seconds. [175924.328610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [175924.329108] ptlrpc_hr01_0 D 0000000000000006 3952 16414 2 0x00000000 [175924.329432] ffff880076a19920 0000000000000046 0000000000000040 0000000000000286 [175924.329950] ffff880076a198a0 0000000000000286 0000000000000286 ffffc9000376b040 [175924.330457] ffff8800573a67b8 ffff880076a19fd8 000000000000fba8 ffff8800573a67b8 [175924.330950] Call Trace: [175924.331191] [<ffffffffa0743c36>] ? htable_lookup+0x1a6/0x1c0 [obdclass] [175924.331505] [<ffffffffa041e79e>] cfs_waitq_wait+0xe/0x10 [libcfs] [175924.331807] [<ffffffffa0744243>] lu_object_find_at+0xb3/0x360 [obdclass] [175924.332104] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [175924.332403] [<ffffffffa07413df>] ? keys_fill+0x6f/0x190 [obdclass] [175924.332746] [<ffffffffa0744506>] lu_object_find+0x16/0x20 [obdclass] [175924.333035] [<ffffffffa0549ea6>] mdt_object_find+0x56/0x170 [mdt] [175924.333398] [<ffffffffa0586e63>] mdt_lvbo_fill+0x2f3/0x800 [mdt] [175924.333715] [<ffffffffa0845c1a>] ldlm_server_completion_ast+0x18a/0x640 [ptlrpc] [175924.334204] [<ffffffffa0845a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc] [175924.334655] [<ffffffffa081bbdc>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc] [175924.334976] [<ffffffffa085c18f>] ptlrpc_set_wait+0x6f/0x880 [ptlrpc] [175924.335264] [<ffffffff81090154>] ? __init_waitqueue_head+0x24/0x40 [175924.335559] [<ffffffffa041e8a5>] ? cfs_waitq_init+0x15/0x20 [libcfs] [175924.335867] [<ffffffffa085876e>] ? ptlrpc_prep_set+0x11e/0x300 [ptlrpc] [175924.336134] [<ffffffffa081bb10>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc] [175924.336444] [<ffffffffa081e19b>] ldlm_run_ast_work+0x1db/0x460 [ptlrpc] [175924.336767] [<ffffffffa081eda4>] ldlm_reprocess_all+0x114/0x300 [ptlrpc] [175924.337067] [<ffffffffa08372e3>] ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc] [175924.337445] [<ffffffffa083bbab>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc] [175924.337719] [<ffffffffa083bf42>] ldlm_blocking_ast_nocheck+0x92/0x320 [ptlrpc] [175924.338177] [<ffffffffa0819070>] ? lock_res_and_lock+0x30/0x50 [ptlrpc] [175924.338464] [<ffffffffa0549d40>] mdt_blocking_ast+0x190/0x2a0 [mdt] [175924.338759] [<ffffffffa042e401>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [175924.339051] [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10 [175924.339339] [<ffffffffa083f950>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [175924.339814] [<ffffffffa0820cc6>] ldlm_lock_decref_internal+0x426/0xc80 [ptlrpc] [175924.340282] [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10 [175924.340614] [<ffffffffa0712217>] ? class_handle2object+0x97/0x170 [obdclass] [175924.341175] [<ffffffffa0821f49>] ldlm_lock_decref+0x39/0x90 [ptlrpc] [175924.341527] [<ffffffffa087112b>] ptlrpc_hr_main+0x39b/0x760 [ptlrpc] [175924.341824] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [175924.342141] [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc] [175924.342444] [<ffffffff8100c14a>] child_rip+0xa/0x20 [175924.342734] [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc] [175924.343068] [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc] [175924.343376] [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Oleg Drokin [ 13/Feb/13 ] |
|
I dumped a core from this node just in case it's useful: |
| Comment by Bruno Faccini (Inactive) [ 28/Feb/13 ] |
|
I went thru the crash-dump, and I agree, this thread and an other ("mdt01_008") are stuck in lu_object_find_at() awaiting for an object end-of-life. This means that htable_lookup() returned -EAGAIN. Going deeper into the crash-dump I did not found any object linked into the related (as far as my manual hashing computations were good to find the good one ...) hash-bucket descriptor, but anyway both hung threads really have their cfs_waitlink_t linked onto the lsb_marche_funebre list, so this means that these threads are stuck there for ever. Could be related to BTW, are there known locking issues/holes around hash-bucket descriptor's lsb_marche_funebre waitq-list management that may explain this scenario of orphan threads ? |
| Comment by Bruno Faccini (Inactive) [ 08/Mar/13 ] |
|
In fact this situation looks like an other possible consequence of the bug described in |
| Comment by Oleg Drokin [ 08/Mar/13 ] |
|
I can confirm I still hit this almost daily with current master. |
| Comment by Bruno Faccini (Inactive) [ 13/Mar/13 ] |
|
Bobijam agrees that there may be some more work to complement patch he already provided for |
| Comment by Bruno Faccini (Inactive) [ 13/Mar/13 ] |
|
More crash-dump learning (or deceptions ...) indicate that : _ both hung threads in lu_object_find_at() are in this state since ages (more than 100000s/1day) Thus, and since problem is "easily" reproducible, I would like to get a new crash-dump when running with a debug patch to print the dying object reference. Oleg, can you tell me more detail on the platform and racer configs you have when you trigger problem ? Debug patch is at http://review.whamcloud.com/5706. |
| Comment by Oleg Drokin [ 13/Mar/13 ] |
|
details: I will try to start your patch later today. |
| Comment by Oleg Drokin [ 14/Mar/13 ] |
|
reproduced with your patch. |
| Comment by Bruno Faccini (Inactive) [ 15/Mar/13 ] |
|
Thank's Oleg, I am working on the new crash-dump with the object reference the debug message printed. Interesting is that dying object found but still hashed condition was triggered earlier during your tests on same node but did not induce threads hangs. This is likely to be the consequence of Bobijam's patch for Also interesting in our case is that seem that always 2 threads trigger the same dying object condition, and with help of debug patch trace I found that concerned dying object is still hashed : [ 8468.524706] Lustre: 17062:0:(lu_object.c:565:htable_lookup()) found object ffff880029715ea8 dying but still hashed.
[ 8468.540255] Lustre: 15574:0:(lu_object.c:565:htable_lookup()) found object ffff880029715ea8 dying but still hashed.
PID: 17062 TASK: ffff88001cd863c0 CPU: 6 COMMAND: "mdt01_001"
#0 [ffff8800b574da50] schedule at ffffffff814f7c98
#1 [ffff8800b574db18] cfs_waitq_wait at ffffffffa0a4a79e [libcfs]
#2 [ffff8800b574db28] lu_object_find_at at ffffffffa0d0af83 [obdclass]
#3 [ffff8800b574dbe8] lu_object_find at ffffffffa0d0b246 [obdclass]
#4 [ffff8800b574dbf8] mdt_object_find at ffffffffa060eea6 [mdt]
#5 [ffff8800b574dc28] mdt_lvbo_fill at ffffffffa064c3e3 [mdt]
#6 [ffff8800b574dcb8] ldlm_handle_enqueue0 at ffffffffa0eb2f47 [ptlrpc]
#7 [ffff8800b574dd28] mdt_enqueue at ffffffffa0622dd6 [mdt]
#8 [ffff8800b574dd48] mdt_handle_common at ffffffffa0617f18 [mdt]
#9 [ffff8800b574dd98] mds_regular_handle at ffffffffa06504c5 [mdt]
#10 [ffff8800b574dda8] ptlrpc_server_handle_request at ffffffffa0ee2de3 [ptlrpc]
#11 [ffff8800b574dea8] ptlrpc_main at ffffffffa0ee58ad [ptlrpc]
#12 [ffff8800b574df48] kernel_thread at ffffffff8100c14a
PID: 15574 TASK: ffff880058d08340 CPU: 2 COMMAND: "ptlrpc_hr00_001"
#0 [ffff8800affef850] schedule at ffffffff814f7c98
#1 [ffff8800affef918] cfs_waitq_wait at ffffffffa0a4a79e [libcfs]
#2 [ffff8800affef928] lu_object_find_at at ffffffffa0d0af83 [obdclass]
#3 [ffff8800affef9e8] lu_object_find at ffffffffa0d0b246 [obdclass]
#4 [ffff8800affef9f8] mdt_object_find at ffffffffa060eea6 [mdt]
#5 [ffff8800affefa28] mdt_lvbo_fill at ffffffffa064c3e3 [mdt]
#6 [ffff8800affefab8] ldlm_server_completion_ast at ffffffffa0eb3b8a [ptlrpc]
#7 [ffff8800affefb28] ldlm_work_cp_ast_lock at ffffffffa0e89bdc [ptlrpc]
#8 [ffff8800affefb58] ptlrpc_set_wait at ffffffffa0eca0ff [ptlrpc]
#9 [ffff8800affefbf8] ldlm_run_ast_work at ffffffffa0e8c19b [ptlrpc]
#10 [ffff8800affefc28] ldlm_reprocess_all at ffffffffa0e8cda4 [ptlrpc]
#11 [ffff8800affefc78] ldlm_cli_cancel_local at ffffffffa0ea5303 [ptlrpc]
#12 [ffff8800affefca8] ldlm_cli_cancel at ffffffffa0ea9bd0 [ptlrpc]
#13 [ffff8800affefce8] ldlm_blocking_ast_nocheck at ffffffffa0ea9f67 [ptlrpc]
#14 [ffff8800affefd18] mdt_blocking_ast at ffffffffa060ed40 [mdt]
#15 [ffff8800affefd98] ldlm_handle_bl_callback at ffffffffa0ead970 [ptlrpc]
#16 [ffff8800affefdc8] ldlm_lock_decref_internal at ffffffffa0e8ecc6 [ptlrpc]
#17 [ffff8800affefe28] ldlm_lock_decref at ffffffffa0e8ff49 [ptlrpc]
#18 [ffff8800affefe58] ptlrpc_hr_main at ffffffffa0edf11b [ptlrpc]
#19 [ffff8800affeff48] kernel_thread at ffffffff8100c14a
*(struct lu_object_header *)0xffff880029715ea8
$4 = {
loh_flags = 0x1,
loh_ref = {
counter = 0x1
},
loh_fid = {
f_seq = 0x200000401,
f_oid = 0x42,
f_ver = 0x0
},
loh_attr = 0x8001,
loh_hash = {
next = 0x0,
pprev = 0xffffc90005c96b08
},
loh_lru = {
next = 0xffff880029715ee0,
prev = 0xffff880029715ee0
},
loh_layers = {
next = 0xffff880029715f18,
prev = 0xffff880082dc0f08
},
loh_reference = {<No data fields>}
}
|
| Comment by Bruno Faccini (Inactive) [ 18/Mar/13 ] |
|
Humm in fact seems that there is a 3rd guy involved (stuck since about the same time and with pointer/reference to the object the 2 others/first threads wait for) here and probably the cause of the remaining reference to the dying object : PID: 18000 TASK: ffff88004519a500 CPU: 2 COMMAND: "mdt00_004" #0 [ffff88002e71d820] schedule at ffffffff814f7c98 #1 [ffff88002e71d8e8] cfs_waitq_wait at ffffffffa0a4a79e [libcfs] #2 [ffff88002e71d8f8] ldlm_completion_ast at ffffffffa0eabdba [ptlrpc] #3 [ffff88002e71d9a8] ldlm_cli_enqueue_local at ffffffffa0eab478 [ptlrpc] #4 [ffff88002e71da38] mdt_object_lock0 at ffffffffa0611889 [mdt] #5 [ffff88002e71dae8] mdt_object_lock at ffffffffa0612104 [mdt] #6 [ffff88002e71daf8] mdt_getattr_name_lock at ffffffffa0624ef9 [mdt] #7 [ffff88002e71dbb8] mdt_intent_getattr at ffffffffa0625ced [mdt] #8 [ffff88002e71dc18] mdt_intent_policy at ffffffffa06228fe [mdt] #9 [ffff88002e71dc58] ldlm_lock_enqueue at ffffffffa0e8c70a [ptlrpc] #10 [ffff88002e71dcb8] ldlm_handle_enqueue0 at ffffffffa0eb2df7 [ptlrpc] #11 [ffff88002e71dd28] mdt_enqueue at ffffffffa0622dd6 [mdt] #12 [ffff88002e71dd48] mdt_handle_common at ffffffffa0617f18 [mdt] #13 [ffff88002e71dd98] mds_regular_handle at ffffffffa06504c5 [mdt] #14 [ffff88002e71dda8] ptlrpc_server_handle_request at ffffffffa0ee2de3 [ptlrpc] #15 [ffff88002e71dea8] ptlrpc_main at ffffffffa0ee58ad [ptlrpc] #16 [ffff88002e71df48] kernel_thread at ffffffff8100c14a and after some de-assembly to retrieve and resolve where the pointer to dying object come for this thread, I was able to determine it is the parent pointer in : /*
* UPDATE lock should be taken against parent, and be release before exit;
* child_bits lock should be taken against child, and be returned back:
* (1)normal request should release the child lock;
* (2)intent request will grant the lock to client.
*/
static int mdt_getattr_name_lock(struct mdt_thread_info *info,
struct mdt_lock_handle *lhc,
__u64 child_bits,
struct ldlm_reply *ldlm_rep)
{
struct ptlrpc_request *req = mdt_info_req(info);
struct mdt_body *reqbody = NULL;
struct mdt_object *parent = info->mti_object; <<<<<<<<<
....
|
| Comment by Bruno Faccini (Inactive) [ 20/Mar/13 ] |
|
Forget the 2nd part of my last update (I might be tired ...) pointer to the dying object in the 3rd (pid 18000) is NOT the parent but the child !!! So it is definitely the one that bumped the object's loh_ref/ref-count and highly probably the cause the 2 others threads have not been awaken/broadcasted. |
| Comment by Bruno Faccini (Inactive) [ 20/Mar/13 ] |
|
Thus my understanding now of the situation, it is a dead-lock between pid 18000 having set a reference to the object and now waiting for lock completion ast on it, and pid 15574 running the completion but stuck waiting for the object to die that will not happen since reference count is set. Can this be fixed by canceling the lock during object death ?? |
| Comment by Bruno Faccini (Inactive) [ 25/Mar/13 ] |
|
The dead-lock comes from the fact thread 18000, which owns last object reference, is waiting for lock-completion, which is itself handled by thread 15574. But 15574 can not accomplish the completion for this lock because, in parsing its l_cp_ast list, it is already running completion for an other lock on same object when it has been marked dying in-between and due to lvb an object lookup has been necessary, causing to wait for object full death that can never happen. I am currently reviewing concerned source code to propose a fix now. |
| Comment by Bruno Faccini (Inactive) [ 27/Mar/13 ] |
|
The object lookup, due to lvb and causing the dead-lock, is the result of recent integration of patches for layout-swap, mainly from I'll ask Jinshan if this scenario sounds familiar to him and if has some idea on how to fix it without breaking layout-swap. |
| Comment by Jinshan Xiong (Inactive) [ 27/Mar/13 ] |
|
I think this is a race between unlink and getattr. Let's make up a test case for this race, say: If this is the case, we can work out a lu_object_lookup() and if the object is already killed or not existed, -ENOENT should be returned; then -ENOENT should be returned to getattr intent request too. |
| Comment by Bruno Faccini (Inactive) [ 28/Mar/13 ] |
|
Thank's Jinshan, so for you problem has not been introduced by LVB/layout-swap changes but only highlighted. And the fix you suggest is to give getattr the mean to detect unlink occurred and object is dying with a new lu_object_lookup() method, just after it acquired the "inodebit dlm lock" and return ENOENT if object is dying ? |
| Comment by Jinshan Xiong (Inactive) [ 28/Mar/13 ] |
|
I mean we can declare a new function, say: mdt_object_lookup() which will lookup the hash table and make sure the object exists in the cache. In mdt_object_lookup(), it also calls lu_object_find(), but with a new flags in lu_object_conf, say: LOC_F_LOOKUP. With this flag, lu_object_find() will look up hash table only, and of course, if the object is dying it will return -ENOENT. This assumes that the object must have been referenced by someone. For getattr intent request, this is true. However we need to check other code path to make sure. |
| Comment by Jinshan Xiong (Inactive) [ 01/Apr/13 ] |
|
patch is at: http://review.whamcloud.com/5911 |
| Comment by Jinshan Xiong (Inactive) [ 05/Apr/13 ] |
|
I'm going to fix this issue by finding a field in ldlm_lock, say l_tree_node, to store mdt_object, if it's an intent operation which find the object firstly and then request dlm lock. So in mdt_lvbo_fill(), it only calls mdt_object_find() if it's NULL. |
| Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ] |
|
So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ? |
| Comment by Alex Zhuravlev [ 05/Apr/13 ] |
|
frankly, I can't say this is very nice solution.. and I don't think one more RPC to fetch LOV after data restore is such a big problem. |
| Comment by Jinshan Xiong (Inactive) [ 08/Apr/13 ] |
This is not an issue about layout-swap or something. Maybe I missed something in our previous conversation |
| Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ] |
|
No, I did not mean the problem is in layout-swap/lvb but that it was put back to front due to it. We already agreed it is an old/known problem/race between unlink and getattr. I just wanted to comment on the fact that now you think the best place to handle and fix this is in mdt_lvbo_fill() where I was pointing that the extra-lookup causing the hung situation is. |
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/13 ] |
|
Will be fixed in |