[LU-1578] ll_ost_io_xx thread hangs and OST status get into down Created: 28/Jun/12  Updated: 31/Aug/12  Resolved: 31/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL5.7


Attachments: File messages.tar.gz    
Severity: 3
Rank (Obsolete): 6370

 Description   

At the our customer site, some of ll_ost_io_xx threads hang and OST's status get into "DOWN" with "lctl dl".
We saw the following Call traces on these OSSs during the time.

Jun 28 16:22:18 nos011i kernel: Pid: 32708, comm: ll_ost_io_188
Jun 28 16:22:18 nos011i kernel: 
Jun 28 16:22:18 nos011i kernel: Call Trace:
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888a914d>] ldlm_cli_enqueue_local+0x4fd/0x520 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff800645e3>] __down_write_nested+0x7a/0x92
Jun 28 16:22:18 nos011i kernel:  [<ffffffff88bd7b29>] filter_destroy+0x969/0x1f90 [obdfilter]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff8876ecfd>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888d2cd2>] lustre_pack_reply_flags+0x8e2/0x950 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888d2d69>] lustre_pack_reply+0x29/0xb0 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff88b89070>] ost_destroy+0x660/0x790 [ost]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888ceef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff88b92a46>] ost_handle+0x1556/0x55b0 [ost]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888de6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888dfdc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 28 16:22:18 nos011i kernel:  [<ffffffff888dee60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Jun 28 16:22:18 nos011i kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun 28 16:22:18 nos011i kernel: 
Jun 28 16:22:18 nos011i kernel: Lustre: Service thread pid 8870 was inactive for 200.00s. The thread might be hung, or it might
 only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 28 16:22:18 nos011i kernel: LustreError: dumping log to /tmp/lustre-log.1340868138.32708
Jun 28 16:22:18 nos011i kernel: Pid: 8870, comm: ll_ost_io_37
Jun 28 16:22:20 nos011i kernel: 
Jun 28 16:22:20 nos011i kernel: Call Trace:
Jun 28 16:22:21 nos011i kernel:  [<ffffffff800645e3>] __down_write_nested+0x7a/0x92
Jun 28 16:22:21 nos011i kernel:  [<ffffffff88bd7b29>] filter_destroy+0x969/0x1f90 [obdfilter]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff8876ecfd>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff888d2cd2>] lustre_pack_reply_flags+0x8e2/0x950 [ptlrpc]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff888d2d69>] lustre_pack_reply+0x29/0xb0 [ptlrpc]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff88b89070>] ost_destroy+0x660/0x790 [ost]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff888ceef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff88b92a46>] ost_handle+0x1556/0x55b0 [ost]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff888de6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Jun 28 16:22:21 nos011i kernel:  [<ffffffff888dfdc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 28 16:22:21 nos011i kernel:  [<ffffffff888dee60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Jun 28 16:22:21 nos011i kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun 28 16:22:21 nos011i kernel: 


 Comments   
Comment by Shuichi Ihara (Inactive) [ 28/Jun/12 ]

two OSS's log files are attached. we saw this situation on these two OSSs.

Comment by Cliff White (Inactive) [ 29/Jun/12 ]

Are you sure the disks were healthy during this time?
Can you get a stack dump of all threads when you see the OST in 'down' state and attach to the bug?

Comment by Bob Glossman (Inactive) [ 29/Jun/12 ]

Could you try upgrading to 1.8.8-wc1? There have been a number of fixes since 1.8.7 that may address your problem.

Comment by Bob Glossman (Inactive) [ 29/Jun/12 ]

By the way, where did the 1.8.7 release you are running come from, Whamcloud directly or some other supplier? Can you upgrade to a Whamcloud release or is that path blocked for you?

Comment by Shuichi Ihara (Inactive) [ 29/Jun/12 ]

What is LU-XX affect this issue? really need detailed information if this ticket is duplicated or fixed in 1.8.8.
And, yes, vanilla 1.8.7-wc1 is running.

Comment by Bob Glossman (Inactive) [ 29/Jun/12 ]

Currently suspect LU-952 as being related to your issue. Don't know for sure yet.

Comment by Shuichi Ihara (Inactive) [ 29/Jun/12 ]

we are monitoring all storage devices (not only health check, but latency, etc), but there are not much differences from normal.
btw, I forgot an thing, we already applied patch for LU-952 to 1.8.7-wc1 at this site, LU-952 should be fixed in this site.

Comment by Shuichi Ihara (Inactive) [ 03/Jul/12 ]

Bob, what's progress on this?
There was no storage side issues and patches for LU-952 were already applied on top of lustre-1.8.7-wc1.

Comment by Bob Glossman (Inactive) [ 03/Jul/12 ]

There hasn't been much progress. Attempts to reproduce haven't succeeded. We don't yet have a good theory about how this might happen.

Have more instances happened? Were you able to gather the stack dump of all threads Cliff asked for 6/29?

Comment by Shuichi Ihara (Inactive) [ 05/Jul/12 ]

Today, we had same problem on few OSSs and some OSTs's status on the some clients due to refusing the connection by still busy with XX active RPCs. What's RPC waiting and where?
I uploaded these clients and OSS's debug files and syslog on /uploads/LU-1578.

This happens very often here and critical. Please investigate deeply.

Comment by Shuichi Ihara (Inactive) [ 10/Jul/12 ]

Bob,

Any progress on this? the cusotmer is waiting for responses. thanks!

Comment by Niu Yawei (Inactive) [ 11/Jul/12 ]

Hi, Ihara

Is it possible to get the full stack trace of the OSS whe the "ll_ost_io_xx thread hangs" happen? that'll be easier for us to figure out if there is deadlock. Thanks.

BTW: I found there is LBUG in the messages, it looks like LU-1563, did you tried the fix from LU-1563?

Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ]

Hi Niu,

The patch for LU-1563 is not applied yet on this site. The next maintenance window on the next mask, we will apply these patches. So, we are still sometimes seeing this problem.

Let me check we could get stack trace when ll_ost_io threads hangs.

Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ]

Niu,

we got a similar situation on the another client, but it's ll_mdt_xx haning and MDT was down status. (got still busy with XX active RPCs) Should I file it as well? that situation, we could get stack traces when the problem happens.

Comment by Niu Yawei (Inactive) [ 11/Jul/12 ]

Yes, please get the full stack trace. I think it should be different problem, let's check the stack trace first. Thanks.

Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ]

filed nmd01.stack on uploads/LU-1578.

thank you!

Comment by Niu Yawei (Inactive) [ 12/Jul/12 ]

I think it's LU-1563, a mdt thread hit LBUG in in quota_pending_commit(), which is holding ldlm lock and blocked other operations.

ll_mdt_34     D ffff810021004420     0  7589      1          7590  7588 (L-TLB)
 ffff8105fc56f720 0000000000000046 0000000000000000 0000000000000286
 ffff8105fc56f6e0 0000000000000009 ffff8105f72a07e0 ffffffff80316b60
 00023eb511a49dac 000000000000109f ffff8105f72a09c8 000000008003bbca
Call Trace:
 [<ffffffff88767c26>] :libcfs:lbug_with_loc+0xc6/0xd0
 [<ffffffff8876ffc0>] :libcfs:tracefile_init+0x0/0x110
 [<ffffffff889c02ef>] :lquota:quota_pending_commit+0x41f/0x5b0
 [<ffffffff88aff5d1>] :jbd2:jbd2_journal_stop+0x209/0x215
 [<ffffffff88b40317>] :ldiskfs:__ldiskfs_journal_stop+0x67/0x80
 [<ffffffff88c24d8b>] :mds:mds_open+0x30eb/0x3758
 [<ffffffff888d1491>] :ptlrpc:lustre_swab_buf+0x81/0x170
 [<ffffffff8002cc91>] mntput_no_expire+0x19/0x89
 [<ffffffff8000d5a8>] dput+0x2c/0x113
 [<ffffffff88bfb0d5>] :mds:mds_reint_rec+0x365/0x550
 [<ffffffff88c259be>] :mds:mds_update_unpack+0x1fe/0x280
 [<ffffffff88bededa>] :mds:mds_reint+0x35a/0x420
 [<ffffffff88becdea>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0
 [<ffffffff88bf7c1c>] :mds:mds_intent_policy+0x4ac/0xc20
 [<ffffffff88892270>] :ptlrpc:ldlm_resource_putref_internal+0x230/0x460
 [<ffffffff8888feb6>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb20
 [<ffffffff8888c7fd>] :ptlrpc:ldlm_lock_create+0x9bd/0x9f0
 [<ffffffff888b4870>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
 [<ffffffff888b1b39>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1210
 [<ffffffff88bf6b40>] :mds:mds_handle+0x40e0/0x4d10
 [<ffffffff8008e1a4>] enqueue_task+0x41/0x56
 [<ffffffff8008e20f>] __activate_task+0x56/0x6d
 [<ffffffff888d5d55>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
 [<ffffffff888df6d9>] :ptlrpc:ptlrpc_server_handle_request+0x989/0xe00
Comment by Shuichi Ihara (Inactive) [ 12/Jul/12 ]

OK, thanks.
I will post stack trace when ll_ost_io_xx thread hangs on the next time.

Comment by Shuichi Ihara (Inactive) [ 13/Jul/12 ]

Niu,

when we evicted on the client manually, it connected well well.
But, we see same problem on the another node sometimes. If we evicted, is the holding lock released? or just connect with different thread and still keeping the lock? this is why client can connect?
But, the root cause of LBUG can't fix, this is why we saw same issue on the another client. is this correct?

Comment by Shuichi Ihara (Inactive) [ 19/Jul/12 ]

Hello Niu,

I just upload today's MDS's syslog on uploads/LU-1578/messages.nmd01. We hit same problem today, but not only single client, but saw Down states on the many clients. As far as we see the log files, it seems to be same issue.
Howerver, question is that as I asked before, what exctly happens if MDS does LBUG and ll_mdt_xx hangs?
Please advise!

Comment by Niu Yawei (Inactive) [ 19/Jul/12 ]

yes, I checked log, it seems same problem (LU-1563). If the MDS hit the LBUG, then MDS will not responsive and the whole cluster will not functional.

Generated at Sat Feb 10 01:17:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.