[LU-5497] Many MDS service threads blocked in ldlm_completion_ast() Created: 15/Aug/14 Updated: 14/Aug/16 Resolved: 14/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Ned Bass | Assignee: | Oleg Drokin |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15338 | ||||||||
| Description |
|
Our production MDS systems occasionally get stuck with many service threads stuck in ldlm_completion_ast(). The details were described in When this happens, client access hangs and the MDS appears completely idle. |
| Comments |
| Comment by Ned Bass [ 15/Aug/14 ] |
|
Example stack trace. 2014-02-03 11:49:01 LustreError: dumping log to /tmp/lustre-log.1391456941.15242 2014-02-03 11:49:02 Pid: 14993, comm: mdt00_011 2014-02-03 11:49:02 2014-02-03 11:49:02 Call Trace: 2014-02-03 11:49:02 [<ffffffffa05cc341>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-02-03 11:49:02 [<ffffffffa05bc77e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2014-02-03 11:49:02 [<ffffffffa08914ca>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa088cb60>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2014-02-03 11:49:02 [<ffffffff81063ba0>] ? default_wake_function+0x0/0x20 2014-02-03 11:49:02 [<ffffffffa0890b78>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e48c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e494f4>] mdt_object_lock+0x14/0x20 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e586d9>] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2014-02-03 11:49:02 [<ffffffffa08b9cf5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08e2846>] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08bbf84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e594cd>] mdt_intent_getattr+0x29d/0x490 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e45f1e>] mdt_intent_policy+0x39e/0x720 [mdt] 2014-02-03 11:49:02 [<ffffffffa08718b1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa089a9df>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e463a6>] mdt_enqueue+0x46/0xe0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e4d758>] mdt_handle_common+0x648/0x1660 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e86405>] mds_regular_handle+0x15/0x20 [mdt] 2014-02-03 11:49:02 [<ffffffffa08caf88>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa05bc63e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2014-02-03 11:49:02 [<ffffffffa05cde0f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2014-02-03 11:49:02 [<ffffffffa08c22e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08cc31e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffff8100c10a>] child_rip+0xa/0x20 2014-02-03 11:49:02 [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffff8100c100>] ? child_rip+0x0/0x20 2014-02-03 11:49:02 |
| Comment by Peter Jones [ 16/Aug/14 ] |
|
Oleg's initial reaction to this ticket was that this was Oleg would you care to elaborate? |
| Comment by Oleg Drokin [ 19/Aug/14 ] |
|
This looks like one of |
| Comment by James A Simmons [ 19/Aug/14 ] |
| Comment by Christopher Morrone [ 29/Sep/14 ] |
|
We are using |
| Comment by Oleg Drokin [ 30/Sep/14 ] |
|
Was this with lu4584 in the original form? Using http://review.whamcloud.com/#/c/6511/ + the latest form of lu4584 patch has a high chance of eliminating this problem too, I think. |
| Comment by Patrick Farrell (Inactive) [ 20/Jan/15 ] |
|
Oleg - Also, Chris, James, et al. - Any further updates on this? Have the patches been tried? Are you still seeing the problems? Etc. |
| Comment by James A Simmons [ 20/Jan/15 ] |
|
We have avoided this problem by going to 2.5 for most of our systems. The patch |
| Comment by Christopher Morrone [ 20/Jan/15 ] |
|
We ran Patch Set 9 of change 9488 as part of 2.4.2, but we dropped it in favor of whatever landed on b2_5 now that we are based on 2.5.3. |
| Comment by Mahmoud Hanafi [ 03/Mar/15 ] |
|
Looks like we have this this issue at 3 times in the past 48 hours. Lots of threads stuck at ldlm_completion_ast. We are running with 2.4.3. It is not clear from the conversations above what patch is need. LNet: Service thread pid 4390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 4390, comm: mdt00_094 Call Trace: [<ffffffffa07a58b5>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc] [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0 [<ffffffff81080610>] ? process_timeout+0x0/0x10 [<ffffffffa050d6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa07c9fed>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa07c5780>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff81063be0>] ? default_wake_function+0x0/0x20 [<ffffffffa07c9728>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa0da7cbb>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0da8534>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0db77a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] [<ffffffffa07f2125>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa081a636>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [<ffffffffa07f43b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [<ffffffffa0db85b2>] mdt_intent_getattr+0x2a2/0x4b0 [mdt] [<ffffffffa0da4f3e>] mdt_intent_policy+0x39e/0x720 [mdt] [<ffffffffa07aa831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] [<ffffffffa07d11bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] [<ffffffffa0da53c6>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0dabad7>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa0de58f5>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa08033b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa050d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa051ed5f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07fa719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81055813>] ? __wake_up+0x53/0x70 [<ffffffffa080474e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Vinayak (Inactive) [ 13/Jan/16 ] |
|
We have faced this issue in 2.1.0 build. There are too many hung threads in ldlm_completion_ast. Is there any work around for this ? Oleg, can you please point to the patch which fixes this problem ? PID: 240529 TASK: ffff8810288aab40 CPU: 8 COMMAND: "mdt_46" #0 [ffff880fc6d1b860] schedule at ffffffff814ea122 #1 [ffff880fc6d1b928] cfs_waitq_wait at ffffffffa041c69e [libcfs] #2 [ffff880fc6d1b938] ldlm_completion_ast at ffffffffa0709722 [ptlrpc] #3 [ffff880fc6d1b9e8] ldlm_cli_enqueue_local at ffffffffa0708e11 [ptlrpc] #4 [ffff880fc6d1ba78] mdt_object_lock at ffffffffa0bf2a8d [mdt] #5 [ffff880fc6d1bb18] mdt_getattr_name_lock at ffffffffa0bff710 [mdt] #6 [ffff880fc6d1bbb8] mdt_intent_getattr at ffffffffa0c005cd [mdt] #7 [ffff880fc6d1bc08] mdt_intent_policy at ffffffffa0c01679 [mdt] #8 [ffff880fc6d1bc48] ldlm_lock_enqueue at ffffffffa06ea1f9 [ptlrpc] #9 [ffff880fc6d1bca8] ldlm_handle_enqueue0 at ffffffffa07120ef [ptlrpc] #10 [ffff880fc6d1bd18] mdt_enqueue at ffffffffa0c01a16 [mdt] #11 [ffff880fc6d1bd38] mdt_handle_common at ffffffffa0bf4ffa [mdt] #12 [ffff880fc6d1bd88] mdt_regular_handle at ffffffffa0bf5eb5 [mdt] #13 [ffff880fc6d1bd98] ptlrpc_main at ffffffffa0742e0e [ptlrpc] #14 [ffff880fc6d1bee8] kthread at ffffffff81090806 #15 [ffff880fc6d1bf48] kernel_thread at ffffffff8100c14a |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
Old blocker for unsupported version |