[LU-500] MDS threads hang ldlm_expired_completion_wait+ - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.6
Labels:
None
Environment:
CentOS 5.3

Severity:
3
Bugzilla ID:
24,450
Rank (Obsolete):
6583

Description

At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem.
The trace looks like:

Call Trace:
[<ffffffff888e8c10>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
[<ffffffff888ea762>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
[<ffffffff888cf709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
[<ffffffff8008d07b>] default_wake_function+0x0/0xe
[<ffffffff888cab6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
[<ffffffff888e92cb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
[<ffffffff88becd7a>] enqueue_ordered_locks+0x26a/0x4d0 [mds]
[<ffffffff888e6bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
[<ffffffff888ea2a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
[<ffffffff88bed5c1>] mds_get_parent_child_locked+0x5e1/0x8a0 [mds]
[<ffffffff88c0f394>] mds_open+0xc44/0x35f8 [mds]
[<ffffffff8899c2b6>] kiblnd_post_tx_locked+0x566/0x730 [ko2iblnd]
[<ffffffff888e6d5e>] ldlm_blocking_ast+0x19e/0x2a0 [ptlrpc]
[<ffffffff887dcb38>] upcall_cache_get_entry+0x958/0xa50 [lvfs]
[<ffffffff888eb9b8>] ldlm_handle_bl_callback+0x1c8/0x230 [ptlrpc]
[<ffffffff88be7f49>] mds_reint_rec+0x1d9/0x2b0 [mds]
[<ffffffff88c13c32>] mds_open_unpack+0x312/0x430 [mds]
[<ffffffff88bdae7a>] mds_reint+0x35a/0x420 [mds]
[<ffffffff88bd9d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
[<ffffffff88be4bfc>] mds_intent_policy+0x4ac/0xc80 [mds]
[<ffffffff888d18b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
[<ffffffff888ceeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
[<ffffffff888cb7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
[<ffffffff888f3720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
[<ffffffff888f0849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
[<ffffffff88be3b20>] mds_handle+0x4130/0x4d60 [mds]
[<ffffffff887ffbe5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
[<ffffffff88914705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
[<ffffffff8006e244>] do_gettimeoffset_tsc+0x19/0x3c
[<ffffffff8891bc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
[<ffffffff8008ca80>] __activate_task+0x56/0x6d
[<ffffffff8008c865>] dequeue_task+0x18/0x37
[<ffffffff80062ff8>] thread_return+0x62/0xfe
[<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
[<ffffffff8001cb46>] __mod_timer+0x100/0x10f
[<ffffffff8891f698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
[<ffffffff8008d07b>] default_wake_function+0x0/0xe
[<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8891e440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
[<ffffffff8005dfa7>] child_rip+0x0/0x11

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lu-500.widow-mds3.20110606.kern.log
84 kB
14/Jul/11 8:14 PM
rpctrace-mds3-20110606.tgz
4.07 MB
14/Jul/11 8:14 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Oleg Drokin

Reporter:: Steven Woods

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 12/Jul/11 5:15 PM

Updated:: 29/Mar/12 11:04 AM

Resolved:: 29/Mar/12 11:04 AM