[LU-2854] hanging ll_ost_io and refuse connections Created: 22/Feb/13  Updated: 19/Apr/14  Resolved: 19/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: ptr

Attachments: File oss01_messages     File oss02_messages     File oss03_messages     File oss04_messages    
Severity: 3
Rank (Obsolete): 6915

 Description   

We see the hanging ll_ost_io threads, stacked something in here. And the new connections are refused until this stacked threads finished up. we only had OSS failover to fix this issue. we got backtrace, please have a look at them if see where is problem.

Feb 20 09:22:46 oss01 kernel: Call Trace:
Feb 20 09:22:46 oss01 kernel:  [<ffffffff80062ff2>] thread_return+0x62/0xfe
Feb 20 09:22:46 oss01 kernel:  [<ffffffff80046c6e>] try_to_wake_up+0x472/0x484
Feb 20 09:22:46 oss01 kernel:  [<ffffffff800649e1>] __down+0x99/0xd8
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8028b1d5>] __down_trylock+0x44/0x4e
Feb 20 09:22:46 oss01 kernel:  [<ffffffff800646c9>] __down_failed+0x35/0x3a
Feb 20 09:22:46 oss01 kernel:  [<ffffffff887fe406>] .text.lock.ldlm_pool+0x37/0x71 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff80064614>] __down_read+0x12/0x92
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8003f279>] shrink_slab+0xd0/0x153
Feb 20 09:22:46 oss01 kernel:  [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8000985a>] __d_lookup+0xb0/0xff
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b134d2>] filter_fid2dentry+0x512/0x740 [obdfilter]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8000f46f>] __alloc_pages+0x78/0x308
Feb 20 09:22:46 oss01 kernel:  [<ffffffff888a33ad>] kiblnd_launch_tx+0x16d/0x9d0 [ko2iblnd]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff80025e20>] find_or_create_page+0x32/0x72
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b27445>] filter_get_page+0x35/0x70 [obdfilter]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b27a81>] filter_preprw_read+0x601/0xd30 [obdfilter]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff886ff873>] lnet_send+0x9a3/0x9d0 [lnet]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff886fdaf7>] lnet_prep_send+0x67/0xb0 [lnet]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b29f0c>] filter_preprw+0x1d5c/0x1dc0 [obdfilter]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88816c3a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8880eaa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88acfac3>] ost_brw_read+0xb33/0x1a70 [ost]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88812ed5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88813088>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88ad8363>] ost_handle+0x2e73/0x55b0 [ost]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff80154b81>] __next_cpu+0x19/0x28
Feb 20 09:22:46 oss01 kernel:  [<ffffffff800778c5>] smp_send_reschedule+0x4e/0x53
Feb 20 09:22:46 oss01 kernel:  [<ffffffff888226b9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88822e15>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88823da6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 20 09:22:46 oss01 kernel:  [<ffffffff88822e40>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 20 09:22:46 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 20 09:22:46 oss01 kernel: 
Feb 20 09:22:46 oss01 kernel: LustreError: dumping log to /tmp/lustre-log.1361319766.23768
Feb 20 09:35:48 oss01 kernel: Lustre: 13106:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply
Feb 20 09:35:48 oss01 kernel:   req@ffff8105cc619800 x1420503574224933/t0 o3->28190515-07cf-b016-422b-1b05f42dd534@NET_0x50000c0a803ae_UUID:0/0 lens 448/400 e 4 to 0 dl 1361320553 ref 2 fl Interpret:/
0/0 rc 0/0
Feb 20 09:35:48 oss01 kernel: Lustre: 13106:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Feb 20 09:35:49 oss01 kernel: Lustre: 8237:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply
Feb 20 09:35:49 oss01 kernel:   req@ffff81061e650400 x1420503605573149/t0 o3->554c2172-ea07-8255-881e-847f196c5522@NET_0x50000c0a80399_UUID:0/0 lens 448/400 e 4 to 0 dl 1361320554 ref 2 fl Interpret:/
0/0 rc 0/0
Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:574:target_handle_reconnect()) lustre-OST0000: 28190515-07cf-b016-422b-1b05f42dd534 reconnecting
Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 65 previous similar messages
Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0000: refuse reconnection from 28190515-07cf-b016-422b-1b05f42dd534@192.168.3.174@o2ib to 0xffff8102d40
8d600; still busy with 1 active RPCs
Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:874:target_handle_connect()) Skipped 63 previous similar messages
Feb 20 09:38:06 oss01 kernel: LustreError: 10022:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810181ffd000 x1420503574335450/t0 o8->28190515-07cf-b016-422b-1b05f42dd
534@NET_0x50000c0a803ae_UUID:0/0 lens 368/264 e 0 to 0 dl 1361320786 ref 1 fl Interpret:/0/0 rc -16/0
Feb 20 09:38:06 oss01 kernel: LustreError: 10022:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 63 previous similar messages


 Comments   
Comment by Peter Jones [ 23/Feb/13 ]

Hongchao

Could you please comment on this one?

Thanks

Peter

Comment by Shuichi Ihara (Inactive) [ 26/Feb/13 ]

Hongchao
do you have chance to have a look at log files?

Comment by Hongchao Zhang [ 27/Feb/13 ]

Hi, this issue is related to ldlm_pools_shrink and there could be a deadlock on some ldlm_lock, will need some more time to investigate it,
and will update its status soon.

Comment by Hongchao Zhang [ 04/Mar/13 ]

this issue is a little similar with LU-1520, which is caused by the contest of different calls of "ldlm_pools_shrink" under low memory situation.
could you please try the patch http://review.whamcloud.com/#change,3270 in LU-1520? thanks!

the log in OSS02 seems to be different with the others, and it could be related to the lock contending of inode->i_alloc_sem.

Comment by Shuichi Ihara (Inactive) [ 04/Mar/13 ]

the log in OSS02 seems to be different with the others, and it could be related to the lock contending of inode->i_alloc_sem.

You mean not related to LU-1520, but another root cause for this?

Comment by Hongchao Zhang [ 06/Mar/13 ]

Yes, for the stack trace is different. But no cause is found until now, Is this issue reproducible?
What is the result of test with the patch from LU-1520?

Comment by Shuichi Ihara (Inactive) [ 19/Apr/14 ]

This patches have been landed in 1.8.9 and we upgraded system with it. So, please close this ticket.

Comment by Peter Jones [ 19/Apr/14 ]

ok thanks Ihara

Generated at Sat Feb 10 01:28:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.