[LU-1578] ll_ost_io_xx thread hangs and OST status get into down Created: 28/Jun/12 Updated: 31/Aug/12 Resolved: 31/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL5.7 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6370 |
| Description |
|
At the our customer site, some of ll_ost_io_xx threads hang and OST's status get into "DOWN" with "lctl dl". Jun 28 16:22:18 nos011i kernel: Pid: 32708, comm: ll_ost_io_188 Jun 28 16:22:18 nos011i kernel: Jun 28 16:22:18 nos011i kernel: Call Trace: Jun 28 16:22:18 nos011i kernel: [<ffffffff888a914d>] ldlm_cli_enqueue_local+0x4fd/0x520 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff800645e3>] __down_write_nested+0x7a/0x92 Jun 28 16:22:18 nos011i kernel: [<ffffffff88bd7b29>] filter_destroy+0x969/0x1f90 [obdfilter] Jun 28 16:22:18 nos011i kernel: [<ffffffff8876ecfd>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs] Jun 28 16:22:18 nos011i kernel: [<ffffffff888d2cd2>] lustre_pack_reply_flags+0x8e2/0x950 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff888d2d69>] lustre_pack_reply+0x29/0xb0 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff88b89070>] ost_destroy+0x660/0x790 [ost] Jun 28 16:22:18 nos011i kernel: [<ffffffff888ceef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff88b92a46>] ost_handle+0x1556/0x55b0 [ost] Jun 28 16:22:18 nos011i kernel: [<ffffffff888de6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Jun 28 16:22:18 nos011i kernel: [<ffffffff888dfdc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 28 16:22:18 nos011i kernel: [<ffffffff888dee60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Jun 28 16:22:18 nos011i kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 28 16:22:18 nos011i kernel: Jun 28 16:22:18 nos011i kernel: Lustre: Service thread pid 8870 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 28 16:22:18 nos011i kernel: LustreError: dumping log to /tmp/lustre-log.1340868138.32708 Jun 28 16:22:18 nos011i kernel: Pid: 8870, comm: ll_ost_io_37 Jun 28 16:22:20 nos011i kernel: Jun 28 16:22:20 nos011i kernel: Call Trace: Jun 28 16:22:21 nos011i kernel: [<ffffffff800645e3>] __down_write_nested+0x7a/0x92 Jun 28 16:22:21 nos011i kernel: [<ffffffff88bd7b29>] filter_destroy+0x969/0x1f90 [obdfilter] Jun 28 16:22:21 nos011i kernel: [<ffffffff8876ecfd>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs] Jun 28 16:22:21 nos011i kernel: [<ffffffff888d2cd2>] lustre_pack_reply_flags+0x8e2/0x950 [ptlrpc] Jun 28 16:22:21 nos011i kernel: [<ffffffff888d2d69>] lustre_pack_reply+0x29/0xb0 [ptlrpc] Jun 28 16:22:21 nos011i kernel: [<ffffffff88b89070>] ost_destroy+0x660/0x790 [ost] Jun 28 16:22:21 nos011i kernel: [<ffffffff888ceef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Jun 28 16:22:21 nos011i kernel: [<ffffffff88b92a46>] ost_handle+0x1556/0x55b0 [ost] Jun 28 16:22:21 nos011i kernel: [<ffffffff888de6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Jun 28 16:22:21 nos011i kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Jun 28 16:22:21 nos011i kernel: [<ffffffff888dfdc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Jun 28 16:22:21 nos011i kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 28 16:22:21 nos011i kernel: [<ffffffff888dee60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Jun 28 16:22:21 nos011i kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 28 16:22:21 nos011i kernel: |
| Comments |
| Comment by Shuichi Ihara (Inactive) [ 28/Jun/12 ] |
|
two OSS's log files are attached. we saw this situation on these two OSSs. |
| Comment by Cliff White (Inactive) [ 29/Jun/12 ] |
|
Are you sure the disks were healthy during this time? |
| Comment by Bob Glossman (Inactive) [ 29/Jun/12 ] |
|
Could you try upgrading to 1.8.8-wc1? There have been a number of fixes since 1.8.7 that may address your problem. |
| Comment by Bob Glossman (Inactive) [ 29/Jun/12 ] |
|
By the way, where did the 1.8.7 release you are running come from, Whamcloud directly or some other supplier? Can you upgrade to a Whamcloud release or is that path blocked for you? |
| Comment by Shuichi Ihara (Inactive) [ 29/Jun/12 ] |
|
What is LU-XX affect this issue? really need detailed information if this ticket is duplicated or fixed in 1.8.8. |
| Comment by Bob Glossman (Inactive) [ 29/Jun/12 ] |
|
Currently suspect |
| Comment by Shuichi Ihara (Inactive) [ 29/Jun/12 ] |
|
we are monitoring all storage devices (not only health check, but latency, etc), but there are not much differences from normal. |
| Comment by Shuichi Ihara (Inactive) [ 03/Jul/12 ] |
|
Bob, what's progress on this? |
| Comment by Bob Glossman (Inactive) [ 03/Jul/12 ] |
|
There hasn't been much progress. Attempts to reproduce haven't succeeded. We don't yet have a good theory about how this might happen. Have more instances happened? Were you able to gather the stack dump of all threads Cliff asked for 6/29? |
| Comment by Shuichi Ihara (Inactive) [ 05/Jul/12 ] |
|
Today, we had same problem on few OSSs and some OSTs's status on the some clients due to refusing the connection by still busy with XX active RPCs. What's RPC waiting and where? This happens very often here and critical. Please investigate deeply. |
| Comment by Shuichi Ihara (Inactive) [ 10/Jul/12 ] |
|
Bob, Any progress on this? the cusotmer is waiting for responses. thanks! |
| Comment by Niu Yawei (Inactive) [ 11/Jul/12 ] |
|
Hi, Ihara Is it possible to get the full stack trace of the OSS whe the "ll_ost_io_xx thread hangs" happen? that'll be easier for us to figure out if there is deadlock. Thanks. BTW: I found there is LBUG in the messages, it looks like |
| Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ] |
|
Hi Niu, The patch for Let me check we could get stack trace when ll_ost_io threads hangs. |
| Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ] |
|
Niu, we got a similar situation on the another client, but it's ll_mdt_xx haning and MDT was down status. (got still busy with XX active RPCs) Should I file it as well? that situation, we could get stack traces when the problem happens. |
| Comment by Niu Yawei (Inactive) [ 11/Jul/12 ] |
|
Yes, please get the full stack trace. I think it should be different problem, let's check the stack trace first. Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ] |
|
filed nmd01.stack on uploads/ thank you! |
| Comment by Niu Yawei (Inactive) [ 12/Jul/12 ] |
|
I think it's ll_mdt_34 D ffff810021004420 0 7589 1 7590 7588 (L-TLB) ffff8105fc56f720 0000000000000046 0000000000000000 0000000000000286 ffff8105fc56f6e0 0000000000000009 ffff8105f72a07e0 ffffffff80316b60 00023eb511a49dac 000000000000109f ffff8105f72a09c8 000000008003bbca Call Trace: [<ffffffff88767c26>] :libcfs:lbug_with_loc+0xc6/0xd0 [<ffffffff8876ffc0>] :libcfs:tracefile_init+0x0/0x110 [<ffffffff889c02ef>] :lquota:quota_pending_commit+0x41f/0x5b0 [<ffffffff88aff5d1>] :jbd2:jbd2_journal_stop+0x209/0x215 [<ffffffff88b40317>] :ldiskfs:__ldiskfs_journal_stop+0x67/0x80 [<ffffffff88c24d8b>] :mds:mds_open+0x30eb/0x3758 [<ffffffff888d1491>] :ptlrpc:lustre_swab_buf+0x81/0x170 [<ffffffff8002cc91>] mntput_no_expire+0x19/0x89 [<ffffffff8000d5a8>] dput+0x2c/0x113 [<ffffffff88bfb0d5>] :mds:mds_reint_rec+0x365/0x550 [<ffffffff88c259be>] :mds:mds_update_unpack+0x1fe/0x280 [<ffffffff88bededa>] :mds:mds_reint+0x35a/0x420 [<ffffffff88becdea>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 [<ffffffff88bf7c1c>] :mds:mds_intent_policy+0x4ac/0xc20 [<ffffffff88892270>] :ptlrpc:ldlm_resource_putref_internal+0x230/0x460 [<ffffffff8888feb6>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb20 [<ffffffff8888c7fd>] :ptlrpc:ldlm_lock_create+0x9bd/0x9f0 [<ffffffff888b4870>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d [<ffffffff888b1b39>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1210 [<ffffffff88bf6b40>] :mds:mds_handle+0x40e0/0x4d10 [<ffffffff8008e1a4>] enqueue_task+0x41/0x56 [<ffffffff8008e20f>] __activate_task+0x56/0x6d [<ffffffff888d5d55>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff888df6d9>] :ptlrpc:ptlrpc_server_handle_request+0x989/0xe00 |
| Comment by Shuichi Ihara (Inactive) [ 12/Jul/12 ] |
|
OK, thanks. |
| Comment by Shuichi Ihara (Inactive) [ 13/Jul/12 ] |
|
Niu, when we evicted on the client manually, it connected well well. |
| Comment by Shuichi Ihara (Inactive) [ 19/Jul/12 ] |
|
Hello Niu, I just upload today's MDS's syslog on uploads/ |
| Comment by Niu Yawei (Inactive) [ 19/Jul/12 ] |
|
yes, I checked log, it seems same problem ( |