Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 1.8.6
-
None
-
CentOS 5.3
-
3
-
24,450
-
6583
Description
At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem.
The trace looks like:
Call Trace:
[<ffffffff888e8c10>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
[<ffffffff888ea762>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
[<ffffffff888cf709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
[<ffffffff8008d07b>] default_wake_function+0x0/0xe
[<ffffffff888cab6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
[<ffffffff888e92cb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
[<ffffffff88becd7a>] enqueue_ordered_locks+0x26a/0x4d0 [mds]
[<ffffffff888e6bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
[<ffffffff888ea2a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
[<ffffffff88bed5c1>] mds_get_parent_child_locked+0x5e1/0x8a0 [mds]
[<ffffffff88c0f394>] mds_open+0xc44/0x35f8 [mds]
[<ffffffff8899c2b6>] kiblnd_post_tx_locked+0x566/0x730 [ko2iblnd]
[<ffffffff888e6d5e>] ldlm_blocking_ast+0x19e/0x2a0 [ptlrpc]
[<ffffffff887dcb38>] upcall_cache_get_entry+0x958/0xa50 [lvfs]
[<ffffffff888eb9b8>] ldlm_handle_bl_callback+0x1c8/0x230 [ptlrpc]
[<ffffffff88be7f49>] mds_reint_rec+0x1d9/0x2b0 [mds]
[<ffffffff88c13c32>] mds_open_unpack+0x312/0x430 [mds]
[<ffffffff88bdae7a>] mds_reint+0x35a/0x420 [mds]
[<ffffffff88bd9d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
[<ffffffff88be4bfc>] mds_intent_policy+0x4ac/0xc80 [mds]
[<ffffffff888d18b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
[<ffffffff888ceeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
[<ffffffff888cb7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
[<ffffffff888f3720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
[<ffffffff888f0849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
[<ffffffff88be3b20>] mds_handle+0x4130/0x4d60 [mds]
[<ffffffff887ffbe5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
[<ffffffff88914705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
[<ffffffff8006e244>] do_gettimeoffset_tsc+0x19/0x3c
[<ffffffff8891bc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
[<ffffffff8008ca80>] __activate_task+0x56/0x6d
[<ffffffff8008c865>] dequeue_task+0x18/0x37
[<ffffffff80062ff8>] thread_return+0x62/0xfe
[<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
[<ffffffff8001cb46>] __mod_timer+0x100/0x10f
[<ffffffff8891f698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
[<ffffffff8008d07b>] default_wake_function+0x0/0xe
[<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8891e440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
[<ffffffff8005dfa7>] child_rip+0x0/0x11
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
duplicate of
LU-1269