Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.4.0
-
3
-
6794
Description
Running racer I hit a problem multiple times where on completion AST the callback gets stuck looking for some object.
Alex thinks it's a not fully fixed race vs object deletion of some sort.
The stack trace looks like this:
[175924.328073] INFO: task ptlrpc_hr01_003:16414 blocked for more than 120 seconds. [175924.328610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [175924.329108] ptlrpc_hr01_0 D 0000000000000006 3952 16414 2 0x00000000 [175924.329432] ffff880076a19920 0000000000000046 0000000000000040 0000000000000286 [175924.329950] ffff880076a198a0 0000000000000286 0000000000000286 ffffc9000376b040 [175924.330457] ffff8800573a67b8 ffff880076a19fd8 000000000000fba8 ffff8800573a67b8 [175924.330950] Call Trace: [175924.331191] [<ffffffffa0743c36>] ? htable_lookup+0x1a6/0x1c0 [obdclass] [175924.331505] [<ffffffffa041e79e>] cfs_waitq_wait+0xe/0x10 [libcfs] [175924.331807] [<ffffffffa0744243>] lu_object_find_at+0xb3/0x360 [obdclass] [175924.332104] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [175924.332403] [<ffffffffa07413df>] ? keys_fill+0x6f/0x190 [obdclass] [175924.332746] [<ffffffffa0744506>] lu_object_find+0x16/0x20 [obdclass] [175924.333035] [<ffffffffa0549ea6>] mdt_object_find+0x56/0x170 [mdt] [175924.333398] [<ffffffffa0586e63>] mdt_lvbo_fill+0x2f3/0x800 [mdt] [175924.333715] [<ffffffffa0845c1a>] ldlm_server_completion_ast+0x18a/0x640 [ptlrpc] [175924.334204] [<ffffffffa0845a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc] [175924.334655] [<ffffffffa081bbdc>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc] [175924.334976] [<ffffffffa085c18f>] ptlrpc_set_wait+0x6f/0x880 [ptlrpc] [175924.335264] [<ffffffff81090154>] ? __init_waitqueue_head+0x24/0x40 [175924.335559] [<ffffffffa041e8a5>] ? cfs_waitq_init+0x15/0x20 [libcfs] [175924.335867] [<ffffffffa085876e>] ? ptlrpc_prep_set+0x11e/0x300 [ptlrpc] [175924.336134] [<ffffffffa081bb10>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc] [175924.336444] [<ffffffffa081e19b>] ldlm_run_ast_work+0x1db/0x460 [ptlrpc] [175924.336767] [<ffffffffa081eda4>] ldlm_reprocess_all+0x114/0x300 [ptlrpc] [175924.337067] [<ffffffffa08372e3>] ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc] [175924.337445] [<ffffffffa083bbab>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc] [175924.337719] [<ffffffffa083bf42>] ldlm_blocking_ast_nocheck+0x92/0x320 [ptlrpc] [175924.338177] [<ffffffffa0819070>] ? lock_res_and_lock+0x30/0x50 [ptlrpc] [175924.338464] [<ffffffffa0549d40>] mdt_blocking_ast+0x190/0x2a0 [mdt] [175924.338759] [<ffffffffa042e401>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [175924.339051] [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10 [175924.339339] [<ffffffffa083f950>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [175924.339814] [<ffffffffa0820cc6>] ldlm_lock_decref_internal+0x426/0xc80 [ptlrpc] [175924.340282] [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10 [175924.340614] [<ffffffffa0712217>] ? class_handle2object+0x97/0x170 [obdclass] [175924.341175] [<ffffffffa0821f49>] ldlm_lock_decref+0x39/0x90 [ptlrpc] [175924.341527] [<ffffffffa087112b>] ptlrpc_hr_main+0x39b/0x760 [ptlrpc] [175924.341824] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [175924.342141] [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc] [175924.342444] [<ffffffff8100c14a>] child_rip+0xa/0x20 [175924.342734] [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc] [175924.343068] [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc] [175924.343376] [<ffffffff8100c140>] ? child_rip+0x0/0x20
Thus my understanding now of the situation, it is a dead-lock between pid 18000 having set a reference to the object and now waiting for lock completion ast on it, and pid 15574 running the completion but stuck waiting for the object to die that will not happen since reference count is set.
Can this be fixed by canceling the lock during object death ??