[LU-2807] lockup in server completion ast -> lu_object_find_at Created: 13/Feb/13  Updated: 16/Oct/13  Resolved: 16/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: MB

Issue Links:
Related
is related to LU-3124 bad interaction between layout lock a... Resolved
is related to LU-4106 racer test hang Resolved
is related to LU-3124 bad interaction between layout lock a... Resolved
Severity: 3
Rank (Obsolete): 6794

 Description   

Running racer I hit a problem multiple times where on completion AST the callback gets stuck looking for some object.
Alex thinks it's a not fully fixed race vs object deletion of some sort.
The stack trace looks like this:

[175924.328073] INFO: task ptlrpc_hr01_003:16414 blocked for more than 120 seconds.
[175924.328610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[175924.329108] ptlrpc_hr01_0 D 0000000000000006  3952 16414      2 0x00000000
[175924.329432]  ffff880076a19920 0000000000000046 0000000000000040 0000000000000286
[175924.329950]  ffff880076a198a0 0000000000000286 0000000000000286 ffffc9000376b040
[175924.330457]  ffff8800573a67b8 ffff880076a19fd8 000000000000fba8 ffff8800573a67b8
[175924.330950] Call Trace:
[175924.331191]  [<ffffffffa0743c36>] ? htable_lookup+0x1a6/0x1c0 [obdclass]
[175924.331505]  [<ffffffffa041e79e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[175924.331807]  [<ffffffffa0744243>] lu_object_find_at+0xb3/0x360 [obdclass]
[175924.332104]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[175924.332403]  [<ffffffffa07413df>] ? keys_fill+0x6f/0x190 [obdclass]
[175924.332746]  [<ffffffffa0744506>] lu_object_find+0x16/0x20 [obdclass]
[175924.333035]  [<ffffffffa0549ea6>] mdt_object_find+0x56/0x170 [mdt]
[175924.333398]  [<ffffffffa0586e63>] mdt_lvbo_fill+0x2f3/0x800 [mdt]
[175924.333715]  [<ffffffffa0845c1a>] ldlm_server_completion_ast+0x18a/0x640 [ptlrpc]
[175924.334204]  [<ffffffffa0845a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc]
[175924.334655]  [<ffffffffa081bbdc>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc]
[175924.334976]  [<ffffffffa085c18f>] ptlrpc_set_wait+0x6f/0x880 [ptlrpc]
[175924.335264]  [<ffffffff81090154>] ? __init_waitqueue_head+0x24/0x40
[175924.335559]  [<ffffffffa041e8a5>] ? cfs_waitq_init+0x15/0x20 [libcfs]
[175924.335867]  [<ffffffffa085876e>] ? ptlrpc_prep_set+0x11e/0x300 [ptlrpc]
[175924.336134]  [<ffffffffa081bb10>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc]
[175924.336444]  [<ffffffffa081e19b>] ldlm_run_ast_work+0x1db/0x460 [ptlrpc]
[175924.336767]  [<ffffffffa081eda4>] ldlm_reprocess_all+0x114/0x300 [ptlrpc]
[175924.337067]  [<ffffffffa08372e3>] ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc]
[175924.337445]  [<ffffffffa083bbab>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
[175924.337719]  [<ffffffffa083bf42>] ldlm_blocking_ast_nocheck+0x92/0x320 [ptlrpc]
[175924.338177]  [<ffffffffa0819070>] ? lock_res_and_lock+0x30/0x50 [ptlrpc]
[175924.338464]  [<ffffffffa0549d40>] mdt_blocking_ast+0x190/0x2a0 [mdt]
[175924.338759]  [<ffffffffa042e401>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[175924.339051]  [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10
[175924.339339]  [<ffffffffa083f950>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
[175924.339814]  [<ffffffffa0820cc6>] ldlm_lock_decref_internal+0x426/0xc80 [ptlrpc]
[175924.340282]  [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10
[175924.340614]  [<ffffffffa0712217>] ? class_handle2object+0x97/0x170 [obdclass]
[175924.341175]  [<ffffffffa0821f49>] ldlm_lock_decref+0x39/0x90 [ptlrpc]
[175924.341527]  [<ffffffffa087112b>] ptlrpc_hr_main+0x39b/0x760 [ptlrpc]
[175924.341824]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[175924.342141]  [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc]
[175924.342444]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[175924.342734]  [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc]
[175924.343068]  [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc]
[175924.343376]  [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Oleg Drokin [ 13/Feb/13 ]

I dumped a core from this node just in case it's useful:
/exports/crashdumps/lu2807/vmcore and modules there too.

Comment by Bruno Faccini (Inactive) [ 28/Feb/13 ]

I went thru the crash-dump, and I agree, this thread and an other ("mdt01_008") are stuck in lu_object_find_at() awaiting for an object end-of-life. This means that htable_lookup() returned -EAGAIN.

Going deeper into the crash-dump I did not found any object linked into the related (as far as my manual hashing computations were good to find the good one ...) hash-bucket descriptor, but anyway both hung threads really have their cfs_waitlink_t linked onto the lsb_marche_funebre list, so this means that these threads are stuck there for ever.

Could be related to LU-1017, but I still have to work on it in order to understand what was the fix for this one regarding the original crash that was reported ...

BTW, are there known locking issues/holes around hash-bucket descriptor's lsb_marche_funebre waitq-list management that may explain this scenario of orphan threads ?

Comment by Bruno Faccini (Inactive) [ 08/Mar/13 ]

In fact this situation looks like an other possible consequence of the bug described in LU-1640/LU-2492. There is a fix for these tickets (http://review.whamcloud.com/3439), and it has been cherry-picked on Jan. 8th 2013 (as 6f34d6b85466bc8cddb8de1816734528af9da09b), so now I 1st need to confirm it was in at the time of the crash-dump and then depending either duplicate this to LU-2492 or find more about/against its fix.

Comment by Oleg Drokin [ 08/Mar/13 ]

I can confirm I still hit this almost daily with current master.

Comment by Bruno Faccini (Inactive) [ 13/Mar/13 ]

Bobijam agrees that there may be some more work to complement patch he already provided for LU-2942. I am working on this and on the difference from LU-2492 I found in the crash-dump, where there was no more object beeing hashed.

Comment by Bruno Faccini (Inactive) [ 13/Mar/13 ]

More crash-dump learning (or deceptions ...) indicate that :

_ both hung threads in lu_object_find_at() are in this state since ages (more than 100000s/1day)
_ thus there is no more traces about them in the lustre debug-trace buffer
_ since I am not absolutely sure about my manual hashing compute, I tried to find a reference about the dying object left in their stacks, but no-way memory and registers re-use cleared it.
_ there are also a bunch of racer processes/commands that are stuck since about the same second, but I do not find a common direct cause ...

Thus, and since problem is "easily" reproducible, I would like to get a new crash-dump when running with a debug patch to print the dying object reference.

Oleg, can you tell me more detail on the platform and racer configs you have when you trigger problem ?

Debug patch is at http://review.whamcloud.com/5706.

Comment by Oleg Drokin [ 13/Mar/13 ]

details:
lustre from master, single node, 3-4G ram, 8 contended cpu cores.
run this for a day or two: while :; do rm -rf /tmp/* ; SLOW=yes REFORMAT=yes DURATION=$((900*3)) PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super cache" DEBUG_SIZE=100 sh racer.sh ; sh llmountcleanup.sh ; done

I will try to start your patch later today.

Comment by Oleg Drokin [ 14/Mar/13 ]

reproduced with your patch.
crashdump is in /exports/crashdumps/t3/lu2807-1.dmp and the modules are in the same dir if needed.
I am leaving the node (centos6-10) in this state in case you want to poke at it (you can attach gdb at localhost:1210 if you wish) until at least Saturday since I am flying most of tomorrow (Friday).

Comment by Bruno Faccini (Inactive) [ 15/Mar/13 ]

Thank's Oleg, I am working on the new crash-dump with the object reference the debug message printed.

Interesting is that dying object found but still hashed condition was triggered earlier during your tests on same node but did not induce threads hangs. This is likely to be the consequence of Bobijam's patch for LU-2942.

Also interesting in our case is that seem that always 2 threads trigger the same dying object condition, and with help of debug patch trace I found that concerned dying object is still hashed :

[ 8468.524706] Lustre: 17062:0:(lu_object.c:565:htable_lookup()) found object ffff880029715ea8 dying but still hashed.
[ 8468.540255] Lustre: 15574:0:(lu_object.c:565:htable_lookup()) found object ffff880029715ea8 dying but still hashed.

PID: 17062  TASK: ffff88001cd863c0  CPU: 6   COMMAND: "mdt01_001"
 #0 [ffff8800b574da50] schedule at ffffffff814f7c98
 #1 [ffff8800b574db18] cfs_waitq_wait at ffffffffa0a4a79e [libcfs]
 #2 [ffff8800b574db28] lu_object_find_at at ffffffffa0d0af83 [obdclass]
 #3 [ffff8800b574dbe8] lu_object_find at ffffffffa0d0b246 [obdclass]
 #4 [ffff8800b574dbf8] mdt_object_find at ffffffffa060eea6 [mdt]
 #5 [ffff8800b574dc28] mdt_lvbo_fill at ffffffffa064c3e3 [mdt]
 #6 [ffff8800b574dcb8] ldlm_handle_enqueue0 at ffffffffa0eb2f47 [ptlrpc]
 #7 [ffff8800b574dd28] mdt_enqueue at ffffffffa0622dd6 [mdt]
 #8 [ffff8800b574dd48] mdt_handle_common at ffffffffa0617f18 [mdt]
 #9 [ffff8800b574dd98] mds_regular_handle at ffffffffa06504c5 [mdt]
#10 [ffff8800b574dda8] ptlrpc_server_handle_request at ffffffffa0ee2de3 [ptlrpc]
#11 [ffff8800b574dea8] ptlrpc_main at ffffffffa0ee58ad [ptlrpc]
#12 [ffff8800b574df48] kernel_thread at ffffffff8100c14a

PID: 15574  TASK: ffff880058d08340  CPU: 2   COMMAND: "ptlrpc_hr00_001"
 #0 [ffff8800affef850] schedule at ffffffff814f7c98
 #1 [ffff8800affef918] cfs_waitq_wait at ffffffffa0a4a79e [libcfs]
 #2 [ffff8800affef928] lu_object_find_at at ffffffffa0d0af83 [obdclass]
 #3 [ffff8800affef9e8] lu_object_find at ffffffffa0d0b246 [obdclass]
 #4 [ffff8800affef9f8] mdt_object_find at ffffffffa060eea6 [mdt]
 #5 [ffff8800affefa28] mdt_lvbo_fill at ffffffffa064c3e3 [mdt]
 #6 [ffff8800affefab8] ldlm_server_completion_ast at ffffffffa0eb3b8a [ptlrpc]
 #7 [ffff8800affefb28] ldlm_work_cp_ast_lock at ffffffffa0e89bdc [ptlrpc]
 #8 [ffff8800affefb58] ptlrpc_set_wait at ffffffffa0eca0ff [ptlrpc]
 #9 [ffff8800affefbf8] ldlm_run_ast_work at ffffffffa0e8c19b [ptlrpc]
#10 [ffff8800affefc28] ldlm_reprocess_all at ffffffffa0e8cda4 [ptlrpc]
#11 [ffff8800affefc78] ldlm_cli_cancel_local at ffffffffa0ea5303 [ptlrpc]
#12 [ffff8800affefca8] ldlm_cli_cancel at ffffffffa0ea9bd0 [ptlrpc]
#13 [ffff8800affefce8] ldlm_blocking_ast_nocheck at ffffffffa0ea9f67 [ptlrpc]
#14 [ffff8800affefd18] mdt_blocking_ast at ffffffffa060ed40 [mdt]
#15 [ffff8800affefd98] ldlm_handle_bl_callback at ffffffffa0ead970 [ptlrpc]
#16 [ffff8800affefdc8] ldlm_lock_decref_internal at ffffffffa0e8ecc6 [ptlrpc]
#17 [ffff8800affefe28] ldlm_lock_decref at ffffffffa0e8ff49 [ptlrpc]
#18 [ffff8800affefe58] ptlrpc_hr_main at ffffffffa0edf11b [ptlrpc]
#19 [ffff8800affeff48] kernel_thread at ffffffff8100c14a

*(struct lu_object_header *)0xffff880029715ea8
$4 = {
  loh_flags = 0x1,
  loh_ref = {
    counter = 0x1
  },
  loh_fid = {
    f_seq = 0x200000401,
    f_oid = 0x42,
    f_ver = 0x0
  },
  loh_attr = 0x8001,
  loh_hash = {
    next = 0x0,
    pprev = 0xffffc90005c96b08
  },
  loh_lru = {
    next = 0xffff880029715ee0,
    prev = 0xffff880029715ee0
  },
  loh_layers = {
    next = 0xffff880029715f18,
    prev = 0xffff880082dc0f08
  },
  loh_reference = {<No data fields>}
}
Comment by Bruno Faccini (Inactive) [ 18/Mar/13 ]

Humm in fact seems that there is a 3rd guy involved (stuck since about the same time and with pointer/reference to the object the 2 others/first threads wait for) here and probably the cause of the remaining reference to the dying object :

PID: 18000  TASK: ffff88004519a500  CPU: 2   COMMAND: "mdt00_004"
 #0 [ffff88002e71d820] schedule at ffffffff814f7c98
 #1 [ffff88002e71d8e8] cfs_waitq_wait at ffffffffa0a4a79e [libcfs]
 #2 [ffff88002e71d8f8] ldlm_completion_ast at ffffffffa0eabdba [ptlrpc]
 #3 [ffff88002e71d9a8] ldlm_cli_enqueue_local at ffffffffa0eab478 [ptlrpc]
 #4 [ffff88002e71da38] mdt_object_lock0 at ffffffffa0611889 [mdt]
 #5 [ffff88002e71dae8] mdt_object_lock at ffffffffa0612104 [mdt]
 #6 [ffff88002e71daf8] mdt_getattr_name_lock at ffffffffa0624ef9 [mdt]
 #7 [ffff88002e71dbb8] mdt_intent_getattr at ffffffffa0625ced [mdt]
 #8 [ffff88002e71dc18] mdt_intent_policy at ffffffffa06228fe [mdt]
 #9 [ffff88002e71dc58] ldlm_lock_enqueue at ffffffffa0e8c70a [ptlrpc]
#10 [ffff88002e71dcb8] ldlm_handle_enqueue0 at ffffffffa0eb2df7 [ptlrpc]
#11 [ffff88002e71dd28] mdt_enqueue at ffffffffa0622dd6 [mdt]
#12 [ffff88002e71dd48] mdt_handle_common at ffffffffa0617f18 [mdt]
#13 [ffff88002e71dd98] mds_regular_handle at ffffffffa06504c5 [mdt]
#14 [ffff88002e71dda8] ptlrpc_server_handle_request at ffffffffa0ee2de3 [ptlrpc]
#15 [ffff88002e71dea8] ptlrpc_main at ffffffffa0ee58ad [ptlrpc]
#16 [ffff88002e71df48] kernel_thread at ffffffff8100c14a

and after some de-assembly to retrieve and resolve where the pointer to dying object come for this thread, I was able to determine it is the parent pointer in :

/*
 * UPDATE lock should be taken against parent, and be release before exit;
 * child_bits lock should be taken against child, and be returned back:
 *            (1)normal request should release the child lock;
 *            (2)intent request will grant the lock to client.
 */
static int mdt_getattr_name_lock(struct mdt_thread_info *info,
                                 struct mdt_lock_handle *lhc,
                                 __u64 child_bits,
                                 struct ldlm_reply *ldlm_rep)
{
        struct ptlrpc_request  *req       = mdt_info_req(info);
        struct mdt_body        *reqbody   = NULL;
        struct mdt_object      *parent    = info->mti_object;  <<<<<<<<<
....

Comment by Bruno Faccini (Inactive) [ 20/Mar/13 ]

Forget the 2nd part of my last update (I might be tired ...) pointer to the dying object in the 3rd (pid 18000) is NOT the parent but the child !!! So it is definitely the one that bumped the object's loh_ref/ref-count and highly probably the cause the 2 others threads have not been awaken/broadcasted.

Comment by Bruno Faccini (Inactive) [ 20/Mar/13 ]

Thus my understanding now of the situation, it is a dead-lock between pid 18000 having set a reference to the object and now waiting for lock completion ast on it, and pid 15574 running the completion but stuck waiting for the object to die that will not happen since reference count is set.

Can this be fixed by canceling the lock during object death ??

Comment by Bruno Faccini (Inactive) [ 25/Mar/13 ]

The dead-lock comes from the fact thread 18000, which owns last object reference, is waiting for lock-completion, which is itself handled by thread 15574. But 15574 can not accomplish the completion for this lock because, in parsing its l_cp_ast list, it is already running completion for an other lock on same object when it has been marked dying in-between and due to lvb an object lookup has been necessary, causing to wait for object full death that can never happen.

I am currently reviewing concerned source code to propose a fix now.

Comment by Bruno Faccini (Inactive) [ 27/Mar/13 ]

The object lookup, due to lvb and causing the dead-lock, is the result of recent integration of patches for layout-swap, mainly from LU-1876.

I'll ask Jinshan if this scenario sounds familiar to him and if has some idea on how to fix it without breaking layout-swap.

Comment by Jinshan Xiong (Inactive) [ 27/Mar/13 ]

I think this is a race between unlink and getattr. Let's make up a test case for this race, say:
1. client1 unlink reaches the MDT;
2. before unlink enqueues lock, client2 tries to send a getattr intent req;
3. unlink acquires inodebits dlm lock;
4. before unlink releases the lock, getattr comes to acquire the lock, blocked;
5. unlink finishes and releases the lock, getattr's completion_ast will be invoked;
6. this problem should be reproduced.

If this is the case, we can work out a lu_object_lookup() and if the object is already killed or not existed, -ENOENT should be returned; then -ENOENT should be returned to getattr intent request too.

Comment by Bruno Faccini (Inactive) [ 28/Mar/13 ]

Thank's Jinshan, so for you problem has not been introduced by LVB/layout-swap changes but only highlighted.

And the fix you suggest is to give getattr the mean to detect unlink occurred and object is dying with a new lu_object_lookup() method, just after it acquired the "inodebit dlm lock" and return ENOENT if object is dying ?

Comment by Jinshan Xiong (Inactive) [ 28/Mar/13 ]

I mean we can declare a new function, say: mdt_object_lookup() which will lookup the hash table and make sure the object exists in the cache. In mdt_object_lookup(), it also calls lu_object_find(), but with a new flags in lu_object_conf, say: LOC_F_LOOKUP. With this flag, lu_object_find() will look up hash table only, and of course, if the object is dying it will return -ENOENT.

This assumes that the object must have been referenced by someone. For getattr intent request, this is true. However we need to check other code path to make sure.

Comment by Jinshan Xiong (Inactive) [ 01/Apr/13 ]

patch is at: http://review.whamcloud.com/5911

Comment by Jinshan Xiong (Inactive) [ 05/Apr/13 ]

I'm going to fix this issue by finding a field in ldlm_lock, say l_tree_node, to store mdt_object, if it's an intent operation which find the object firstly and then request dlm lock. So in mdt_lvbo_fill(), it only calls mdt_object_find() if it's NULL.

Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ]

So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ?

Comment by Alex Zhuravlev [ 05/Apr/13 ]

frankly, I can't say this is very nice solution.. and I don't think one more RPC to fetch LOV after data restore is such a big problem.

Comment by Jinshan Xiong (Inactive) [ 08/Apr/13 ]

So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ?

This is not an issue about layout-swap or something. Maybe I missed something in our previous conversation

Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ]

No, I did not mean the problem is in layout-swap/lvb but that it was put back to front due to it. We already agreed it is an old/known problem/race between unlink and getattr. I just wanted to comment on the fact that now you think the best place to handle and fix this is in mdt_lvbo_fill() where I was pointing that the extra-lookup causing the hung situation is.

Comment by Jinshan Xiong (Inactive) [ 16/Apr/13 ]

Will be fixed in LU-3124 with patch http://review.whamcloud.com/6042

Generated at Sat Feb 10 01:28:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.