[LU-7640] stuck mdt thread required reboot of mds Created: 08/Jan/16  Updated: 26/Apr/17  Resolved: 04/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None

Attachments: Text File lustre-log.1452225773.16286.gz     File messages.gz    
Issue Links:
Related
is related to LU-7372 replay-dual test_26: test failed to r... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS reported stuck mdt threads and dump stack trace

<code>
Jan 7 20:02:53 nbp8-mds1 kernel: LNet: Service thread pid 16286 was inactive for 464.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 7 20:02:53 nbp8-mds1 kernel: LNet: Skipped 4 previous similar messages
Jan 7 20:02:57 nbp8-mds1 kernel: Pid: 16286, comm: mdt02_020
Jan 7 20:02:57 nbp8-mds1 kernel:
Jan 7 20:02:57 nbp8-mds1 kernel: Call Trace:
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffffa04eee01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffffa078f835>] ldlm_completion_ast+0x545/0x920 [ptlrpc]
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 7 20:02:57 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Jan 7 20:03:00 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Jan 7 20:03:01 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Jan 7 20:03:01 nbp8-mds1 kernel:
Jan 7 20:03:01 nbp8-mds1 kernel: LustreError: dumping log to /tmp/lustre-log.1452225773.16286
<code>

I am attaching /var/log/messages and lustre debug dump.

The mds need to be rebooted to clear up the error state.



 Comments   
Comment by Zhenyu Xu [ 08/Jan/16 ]

it's dup of LU-7372, and there is patch at http://review.whamcloud.com/17853

Comment by Jay Lan (Inactive) [ 20/Jan/16 ]

How did you think this is a dup of LU-7372? The stack trace do not look alike.

Comment by Zhenyu Xu [ 21/Jan/16 ]

The thread is waiting for a lock get granted or cancelled (ldlm_completion_ast()), and that never happens. And #17853 has fix about ldlm_expired_completion_wait() returning -ETIMEDOUT other than 0, so that the thread won't stuck.

Comment by Jay Lan (Inactive) [ 21/Jan/16 ]

Thank you Zhenyu!
After a series of problems last night to the mds/mgs of one of our lustre filesystems, it was upgraded to run with LU-7372 patch. We will see

Comment by Mahmoud Hanafi [ 21/Jan/16 ]

We had a crash after the patch was applied.

Comment by Zhenyu Xu [ 22/Jan/16 ]

what's the crash backtrace?

Comment by Peter Jones [ 04/Feb/16 ]

duplicate of LU-7692

Comment by John Hammond [ 26/Apr/17 ]

Just to clarify, recent versions of http://review.whamcloud.com/17853 no longer contain the change to ldlm_expired_completion_wait() mentioned above and this should not be considered a duplicate of LU-7372.

Generated at Sat Feb 10 02:10:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.