[LU-5497] Many MDS service threads blocked in ldlm_completion_ast() Created: 15/Aug/14  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Ned Bass Assignee: Oleg Drokin
Resolution: Won't Fix Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-4579 Timeout system horribly broken Resolved
Severity: 3
Rank (Obsolete): 15338

 Description   

Our production MDS systems occasionally get stuck with many service threads stuck in ldlm_completion_ast(). The details were described in LU-4579, but that issue was closed when the patch landed which fixed how timeouts are reported.

When this happens, client access hangs and the MDS appears completely idle.



 Comments   
Comment by Ned Bass [ 15/Aug/14 ]

Example stack trace.

2014-02-03 11:49:01 LustreError: dumping log to /tmp/lustre-log.1391456941.15242
2014-02-03 11:49:02 Pid: 14993, comm: mdt00_011
2014-02-03 11:49:02
2014-02-03 11:49:02 Call Trace:
2014-02-03 11:49:02  [<ffffffffa05cc341>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2014-02-03 11:49:02  [<ffffffffa05bc77e>] cfs_waitq_wait+0xe/0x10 [libcfs]
2014-02-03 11:49:02  [<ffffffffa08914ca>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa088cb60>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
2014-02-03 11:49:02  [<ffffffff81063ba0>] ? default_wake_function+0x0/0x20
2014-02-03 11:49:02  [<ffffffffa0890b78>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2014-02-03 11:49:02  [<ffffffffa0e48c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
2014-02-03 11:49:02  [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2014-02-03 11:49:02  [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa0e494f4>] mdt_object_lock+0x14/0x20 [mdt]
2014-02-03 11:49:02  [<ffffffffa0e586d9>] mdt_getattr_name_lock+0xe09/0x1960 [mdt]
2014-02-03 11:49:02  [<ffffffffa08b9cf5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa08e2846>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa08bbf84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa0e594cd>] mdt_intent_getattr+0x29d/0x490 [mdt]
2014-02-03 11:49:02  [<ffffffffa0e45f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
2014-02-03 11:49:02  [<ffffffffa08718b1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa089a9df>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa0e463a6>] mdt_enqueue+0x46/0xe0 [mdt]
2014-02-03 11:49:02  [<ffffffffa0e4d758>] mdt_handle_common+0x648/0x1660 [mdt]
2014-02-03 11:49:02  [<ffffffffa0e86405>] mds_regular_handle+0x15/0x20 [mdt]
2014-02-03 11:49:02  [<ffffffffa08caf88>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa05bc63e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2014-02-03 11:49:02  [<ffffffffa05cde0f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
2014-02-03 11:49:02  [<ffffffffa08c22e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa08cc31e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-02-03 11:49:02  [<ffffffff8100c10a>] child_rip+0xa/0x20
2014-02-03 11:49:02  [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-02-03 11:49:02  [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-02-03 11:49:02  [<ffffffff8100c100>] ? child_rip+0x0/0x20
2014-02-03 11:49:02
Comment by Peter Jones [ 16/Aug/14 ]

Oleg's initial reaction to this ticket was that this was LU-4584 (which you are evaluating a fix for).

Oleg would you care to elaborate?

Comment by Oleg Drokin [ 19/Aug/14 ]

This looks like one of LU-2827 fallouts.

Comment by James A Simmons [ 19/Aug/14 ]

Is Livermore running 2.4 with LU-2827 or LU-4584?

Comment by Christopher Morrone [ 29/Sep/14 ]

We are using LU-4584 with 2.4.

Comment by Oleg Drokin [ 30/Sep/14 ]

Was this with lu4584 in the original form?
There's not too much data here, but I observed similar lockups in my testing using your chaos tree initially.

Using http://review.whamcloud.com/#/c/6511/ + the latest form of lu4584 patch has a high chance of eliminating this problem too, I think.

Comment by Patrick Farrell (Inactive) [ 20/Jan/15 ]

Oleg - LU-4584 is... long. Is the patch you're referring to this one: http://review.whamcloud.com/#/c/9488/ ?

Also, Chris, James, et al. - Any further updates on this? Have the patches been tried? Are you still seeing the problems? Etc.

Comment by James A Simmons [ 20/Jan/15 ]

We have avoided this problem by going to 2.5 for most of our systems. The patch LU-4584 should help with this but if I remember right Chris reported issues at LU-5525 due to that patch.

Comment by Christopher Morrone [ 20/Jan/15 ]

We ran Patch Set 9 of change 9488 as part of 2.4.2, but we dropped it in favor of whatever landed on b2_5 now that we are based on 2.5.3.

Comment by Mahmoud Hanafi [ 03/Mar/15 ]

Looks like we have this this issue at 3 times in the past 48 hours. Lots of threads stuck at ldlm_completion_ast. We are running with 2.4.3. It is not clear from the conversations above what patch is need.

LNet: Service thread pid 4390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 4390, comm: mdt00_094

Call Trace:
 [<ffffffffa07a58b5>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
 [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81080610>] ? process_timeout+0x0/0x10
 [<ffffffffa050d6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
 [<ffffffffa07c9fed>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
 [<ffffffffa07c5780>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
 [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
 [<ffffffffa07c9728>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
 [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
 [<ffffffffa0da7cbb>] mdt_object_lock0+0x33b/0xaf0 [mdt]
 [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
 [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0da8534>] mdt_object_lock+0x14/0x20 [mdt]
 [<ffffffffa0db77a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
 [<ffffffffa07f2125>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa081a636>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
 [<ffffffffa07f43b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
 [<ffffffffa0db85b2>] mdt_intent_getattr+0x2a2/0x4b0 [mdt]
 [<ffffffffa0da4f3e>] mdt_intent_policy+0x39e/0x720 [mdt]
 [<ffffffffa07aa831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa07d11bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa0da53c6>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa0dabad7>] mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa0de58f5>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa08033b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa050d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa051ed5f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa07fa719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81055813>] ? __wake_up+0x53/0x70
 [<ffffffffa080474e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Vinayak (Inactive) [ 13/Jan/16 ]

We have faced this issue in 2.1.0 build. There are too many hung threads in ldlm_completion_ast. Is there any work around for this ?

Oleg, can you please point to the patch which fixes this problem ?

PID: 240529  TASK: ffff8810288aab40  CPU: 8   COMMAND: "mdt_46"
#0 [ffff880fc6d1b860] schedule at ffffffff814ea122
#1 [ffff880fc6d1b928] cfs_waitq_wait at ffffffffa041c69e [libcfs]
#2 [ffff880fc6d1b938] ldlm_completion_ast at ffffffffa0709722 [ptlrpc]
#3 [ffff880fc6d1b9e8] ldlm_cli_enqueue_local at ffffffffa0708e11 [ptlrpc]
#4 [ffff880fc6d1ba78] mdt_object_lock at ffffffffa0bf2a8d [mdt]
#5 [ffff880fc6d1bb18] mdt_getattr_name_lock at ffffffffa0bff710 [mdt]
#6 [ffff880fc6d1bbb8] mdt_intent_getattr at ffffffffa0c005cd [mdt]
#7 [ffff880fc6d1bc08] mdt_intent_policy at ffffffffa0c01679 [mdt]
#8 [ffff880fc6d1bc48] ldlm_lock_enqueue at ffffffffa06ea1f9 [ptlrpc]
#9 [ffff880fc6d1bca8] ldlm_handle_enqueue0 at ffffffffa07120ef [ptlrpc]
#10 [ffff880fc6d1bd18] mdt_enqueue at ffffffffa0c01a16 [mdt]
#11 [ffff880fc6d1bd38] mdt_handle_common at ffffffffa0bf4ffa [mdt]
#12 [ffff880fc6d1bd88] mdt_regular_handle at ffffffffa0bf5eb5 [mdt]
#13 [ffff880fc6d1bd98] ptlrpc_main at ffffffffa0742e0e [ptlrpc]
#14 [ffff880fc6d1bee8] kthread at ffffffff81090806
#15 [ffff880fc6d1bf48] kernel_thread at ffffffff8100c14a
Comment by James A Simmons [ 14/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:51:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.