[LU-1395] MDS hangs after calltrace at ldlm_expired_completion_wait() Created: 10/May/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File mds_hang.tar.gz    
Severity: 3
Rank (Obsolete): 10343

 Description   

We saw the following call traces on MDS and it hanged after it.

Apr 23 15:58:34 ALPL505 kernel: Call Trace:
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88953a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88955542>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8893a709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008e421>] default_wake_function+0x0/0xe
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88935b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889540bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88caa157>] enqueue_ordered_locks+0x387/0x4d0 [mds]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889519a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88955080>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88caa8e9>] mds_get_parent_child_locked+0x649/0x960 [mds]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88c9b652>] mds_getattr_lock+0x632/0xc90 [mds]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88c96dda>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88ca1d83>] mds_intent_policy+0x623/0xc20 [mds]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8893c270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88939eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889367fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8895e870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8895bb39>] ldlm_handle_enqueue+0xc09/0x1210 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88ca0b30>] mds_handle+0x40e0/0x4d10 [mds]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff800774ed>] smp_send_reschedule+0x4e/0x53
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008ddcd>] enqueue_task+0x41/0x56
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8897fd55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889896d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88989e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8898adc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88989e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

this might be related to LU-59, but please review on this.



 Comments   
Comment by Peter Jones [ 10/May/12 ]

Oleg will look into this one

Comment by Shuichi Ihara (Inactive) [ 24/Jul/12 ]

Hi Peter, Oleg,
could you plesae view on this quickly? we saw similar problems at a couple of sites.

Comment by Oleg Drokin [ 24/Jul/12 ]

This trace is just a sign of client not responding to lock cancel request. It should be followed by a client being evicted.
We need the client log to see what was happening there, I guess.

Comment by Kit Westneat (Inactive) [ 19/Oct/12 ]

This looks like a dupe of LU-500 and LU-1269. I think because LU-1269 is marked as an improvement instead of bug, it hasn't been getting the attention it should be. It appears as if there are several different ideas for fixing the issue. Can someone take a look at it? We have been hitting this bug regularly, most recently at IU.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:16:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.