[LU-500] MDS threads hang ldlm_expired_completion_wait+ - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.6
Labels:
None
Environment:
CentOS 5.3

Severity:
3
Bugzilla ID:
24,450
Rank (Obsolete):
6583

Description

At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem.
The trace looks like:

Call Trace:
[<ffffffff888e8c10>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
[<ffffffff888ea762>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
[<ffffffff888cf709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
[<ffffffff8008d07b>] default_wake_function+0x0/0xe
[<ffffffff888cab6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
[<ffffffff888e92cb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
[<ffffffff88becd7a>] enqueue_ordered_locks+0x26a/0x4d0 [mds]
[<ffffffff888e6bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
[<ffffffff888ea2a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
[<ffffffff88bed5c1>] mds_get_parent_child_locked+0x5e1/0x8a0 [mds]
[<ffffffff88c0f394>] mds_open+0xc44/0x35f8 [mds]
[<ffffffff8899c2b6>] kiblnd_post_tx_locked+0x566/0x730 [ko2iblnd]
[<ffffffff888e6d5e>] ldlm_blocking_ast+0x19e/0x2a0 [ptlrpc]
[<ffffffff887dcb38>] upcall_cache_get_entry+0x958/0xa50 [lvfs]
[<ffffffff888eb9b8>] ldlm_handle_bl_callback+0x1c8/0x230 [ptlrpc]
[<ffffffff88be7f49>] mds_reint_rec+0x1d9/0x2b0 [mds]
[<ffffffff88c13c32>] mds_open_unpack+0x312/0x430 [mds]
[<ffffffff88bdae7a>] mds_reint+0x35a/0x420 [mds]
[<ffffffff88bd9d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
[<ffffffff88be4bfc>] mds_intent_policy+0x4ac/0xc80 [mds]
[<ffffffff888d18b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
[<ffffffff888ceeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
[<ffffffff888cb7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
[<ffffffff888f3720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
[<ffffffff888f0849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
[<ffffffff88be3b20>] mds_handle+0x4130/0x4d60 [mds]
[<ffffffff887ffbe5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
[<ffffffff88914705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
[<ffffffff8006e244>] do_gettimeoffset_tsc+0x19/0x3c
[<ffffffff8891bc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
[<ffffffff8008ca80>] __activate_task+0x56/0x6d
[<ffffffff8008c865>] dequeue_task+0x18/0x37
[<ffffffff80062ff8>] thread_return+0x62/0xfe
[<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
[<ffffffff8001cb46>] __mod_timer+0x100/0x10f
[<ffffffff8891f698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
[<ffffffff8008d07b>] default_wake_function+0x0/0xe
[<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8891e440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
[<ffffffff8005dfa7>] child_rip+0x0/0x11

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lu-500.widow-mds3.20110606.kern.log
84 kB
14/Jul/11 8:14 PM
rpctrace-mds3-20110606.tgz
4.07 MB
14/Jul/11 8:14 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-500] MDS threads hang ldlm_expired_completion_wait+

James A Simmons added a comment - 29/Mar/12 9:12 AM

~~LU-1269~~ has those patches ported to Lustre 1.8-wc branch. If those patches are the solution then this ticket can be marked as a duplicate of ~~LU-1269~~. What do you say Cory?

James A Simmons added a comment - 29/Mar/12 9:12 AM LU-1269 has those patches ported to Lustre 1.8-wc branch. If those patches are the solution then this ticket can be marked as a duplicate of LU-1269 . What do you say Cory?

Cory Spitz added a comment - 26/Mar/12 3:39 PM

> Which patches from bz 24450?
The ones that Oracle has landed; namely attachments 33099, 33106, 33137, and 33144.

Cory Spitz added a comment - 26/Mar/12 3:39 PM > Which patches from bz 24450? The ones that Oracle has landed; namely attachments 33099, 33106, 33137, and 33144.

James A Simmons added a comment - 26/Mar/12 3:11 PM

Which patches from bz 24450?

James A Simmons added a comment - 26/Mar/12 3:11 PM Which patches from bz 24450?

Cory Spitz added a comment - 23/Mar/12 5:02 PM

on 2/Sep/11 I mentioned ~~LU-146~~. Although that could have been a contributor, the serialization of ptlrpc sets of size PARALLEL_AST_LIMIT cause these threads to trigger watchdog. I think we need the patches posted to bz 24450 pulled to WC's b1_8 in order to close this ticket.

Cory Spitz added a comment - 23/Mar/12 5:02 PM on 2/Sep/11 I mentioned LU-146 . Although that could have been a contributor, the serialization of ptlrpc sets of size PARALLEL_AST_LIMIT cause these threads to trigger watchdog. I think we need the patches posted to bz 24450 pulled to WC's b1_8 in order to close this ticket.

James A Simmons added a comment - 06/Mar/12 1:02 PM

Can this bug be closed now?

James A Simmons added a comment - 06/Mar/12 1:02 PM Can this bug be closed now?

Cory Spitz added a comment - 02/Sep/11 11:55 AM

I believe that the 1.8.6 instances of this bug are caused by incorrect lock ordering introduced by the patch from bug 24437 and is being pursued under ~~LU-146~~.

Cory Spitz added a comment - 02/Sep/11 11:55 AM I believe that the 1.8.6 instances of this bug are caused by incorrect lock ordering introduced by the patch from bug 24437 and is being pursued under LU-146 .

Oleg Drokin added a comment - 02/Aug/11 11:52 PM

Lukasz, wcan you share exact backtrace? Also how long were yo waiting to relieve the situation?
Usually such things are happening due to a packet loss (or a node death at a bad time).

Additionally if you are having constant recovery problems, can you please file a separate ticket for it?

Thanks.

Oleg Drokin added a comment - 02/Aug/11 11:52 PM Lukasz, wcan you share exact backtrace? Also how long were yo waiting to relieve the situation? Usually such things are happening due to a packet loss (or a node death at a bad time). Additionally if you are having constant recovery problems, can you please file a separate ticket for it? Thanks.

Lukasz Flis added a comment - 25/Jul/11 11:04 AM

It seems that the issue is common to 1.8.6 and 2.1.

I've found our MDS (2.1) hanging today morning with plenty of the same errors that Steven Reported.
Our Environment consists of 1.8.6-wc1 clients and 2.1 servers.

In order to recover from it we have remounted MDT resource. And recovery begun. Unfrotunatelly we have never seen
successfull MDS recovery in 2.1. This time system was able to recover 460/890 clients and got stuck (transaction numbers were increasing until the timeout occured, and one client got evicted, this loop could take a long time so we had to abort it)

We can provide lustre log files and kernel stack traces if needed.

Lukasz Flis added a comment - 25/Jul/11 11:04 AM It seems that the issue is common to 1.8.6 and 2.1. I've found our MDS (2.1) hanging today morning with plenty of the same errors that Steven Reported. Our Environment consists of 1.8.6-wc1 clients and 2.1 servers. In order to recover from it we have remounted MDT resource. And recovery begun. Unfrotunatelly we have never seen successfull MDS recovery in 2.1. This time system was able to recover 460/890 clients and got stuck (transaction numbers were increasing until the timeout occured, and one client got evicted, this loop could take a long time so we had to abort it) We can provide lustre log files and kernel stack traces if needed.

Oleg Drokin added a comment - 24/Jul/11 2:26 PM

If the thread never recovers I am interested to see a backtrace dump from such a thread.
Unfortunately the kernel log excerpt is way to small. We see a few backtraces from threads blocked, then a client is evicted an that's where the log ends. So I have no idea if the thread is doing anything else reports anything else and so on.

I am nor sure how rpctrace is going to help, though.

Oleg Drokin added a comment - 24/Jul/11 2:26 PM If the thread never recovers I am interested to see a backtrace dump from such a thread. Unfortunately the kernel log excerpt is way to small. We see a few backtraces from threads blocked, then a client is evicted an that's where the log ends. So I have no idea if the thread is doing anything else reports anything else and so on. I am nor sure how rpctrace is going to help, though.

Steven Woods added a comment - 19/Jul/11 4:09 PM

AFAIK initially before the patch to 23352 things never cleared up until a reboot. After that they can keep running but I do not believe the thread ever recovers.

Steven Woods added a comment - 19/Jul/11 4:09 PM AFAIK initially before the patch to 23352 things never cleared up until a reboot. After that they can keep running but I do not believe the thread ever recovers.

Cory Spitz added a comment - 19/Jul/11 3:03 PM

Oleg, Steve can answer questions about the most recent dumps. Internally, we've seen similar behavior on internal Cray systems. In those cases we see that an unresponsive client does eventually get evicted and the MDS recovers, but the service threads that are waiting for the blocking callbacks to complete never seem to unwedge. Also in these cases, the client initially fails for something completely unrelated; e.g. kernel null dereference. ...not sure that it is the same symptoms witnessed in the Spider dumps.

Cory Spitz added a comment - 19/Jul/11 3:03 PM Oleg, Steve can answer questions about the most recent dumps. Internally, we've seen similar behavior on internal Cray systems. In those cases we see that an unresponsive client does eventually get evicted and the MDS recovers, but the service threads that are waiting for the blocking callbacks to complete never seem to unwedge. Also in these cases, the client initially fails for something completely unrelated; e.g. kernel null dereference. ...not sure that it is the same symptoms witnessed in the Spider dumps.

People

Assignee:: Oleg Drokin

Reporter:: Steven Woods

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 12/Jul/11 5:15 PM

Updated:: 29/Mar/12 11:04 AM

Resolved:: 29/Mar/12 11:04 AM