Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 1.8.8
-
3
-
6915
Description
We see the hanging ll_ost_io threads, stacked something in here. And the new connections are refused until this stacked threads finished up. we only had OSS failover to fix this issue. we got backtrace, please have a look at them if see where is problem.
Feb 20 09:22:46 oss01 kernel: Call Trace: Feb 20 09:22:46 oss01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Feb 20 09:22:46 oss01 kernel: [<ffffffff80046c6e>] try_to_wake_up+0x472/0x484 Feb 20 09:22:46 oss01 kernel: [<ffffffff800649e1>] __down+0x99/0xd8 Feb 20 09:22:46 oss01 kernel: [<ffffffff8028b1d5>] __down_trylock+0x44/0x4e Feb 20 09:22:46 oss01 kernel: [<ffffffff800646c9>] __down_failed+0x35/0x3a Feb 20 09:22:46 oss01 kernel: [<ffffffff887fe406>] .text.lock.ldlm_pool+0x37/0x71 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff80064614>] __down_read+0x12/0x92 Feb 20 09:22:46 oss01 kernel: [<ffffffff8003f279>] shrink_slab+0xd0/0x153 Feb 20 09:22:46 oss01 kernel: [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd Feb 20 09:22:46 oss01 kernel: [<ffffffff8000985a>] __d_lookup+0xb0/0xff Feb 20 09:22:46 oss01 kernel: [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442 Feb 20 09:22:46 oss01 kernel: [<ffffffff88b134d2>] filter_fid2dentry+0x512/0x740 [obdfilter] Feb 20 09:22:46 oss01 kernel: [<ffffffff8000f46f>] __alloc_pages+0x78/0x308 Feb 20 09:22:46 oss01 kernel: [<ffffffff888a33ad>] kiblnd_launch_tx+0x16d/0x9d0 [ko2iblnd] Feb 20 09:22:46 oss01 kernel: [<ffffffff80025e20>] find_or_create_page+0x32/0x72 Feb 20 09:22:46 oss01 kernel: [<ffffffff88b27445>] filter_get_page+0x35/0x70 [obdfilter] Feb 20 09:22:46 oss01 kernel: [<ffffffff88b27a81>] filter_preprw_read+0x601/0xd30 [obdfilter] Feb 20 09:22:46 oss01 kernel: [<ffffffff886ff873>] lnet_send+0x9a3/0x9d0 [lnet] Feb 20 09:22:46 oss01 kernel: [<ffffffff886fdaf7>] lnet_prep_send+0x67/0xb0 [lnet] Feb 20 09:22:46 oss01 kernel: [<ffffffff88b29f0c>] filter_preprw+0x1d5c/0x1dc0 [obdfilter] Feb 20 09:22:46 oss01 kernel: [<ffffffff88816c3a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff8880eaa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff88acfac3>] ost_brw_read+0xb33/0x1a70 [ost] Feb 20 09:22:46 oss01 kernel: [<ffffffff88812ed5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 20 09:22:46 oss01 kernel: [<ffffffff88813088>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff88ad8363>] ost_handle+0x2e73/0x55b0 [ost] Feb 20 09:22:46 oss01 kernel: [<ffffffff80154b81>] __next_cpu+0x19/0x28 Feb 20 09:22:46 oss01 kernel: [<ffffffff800778c5>] smp_send_reschedule+0x4e/0x53 Feb 20 09:22:46 oss01 kernel: [<ffffffff888226b9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff88822e15>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 20 09:22:46 oss01 kernel: [<ffffffff88823da6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Feb 20 09:22:46 oss01 kernel: [<ffffffff88822e40>] ptlrpc_main+0x0/0x1120 [ptlrpc] Feb 20 09:22:46 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Feb 20 09:22:46 oss01 kernel: Feb 20 09:22:46 oss01 kernel: LustreError: dumping log to /tmp/lustre-log.1361319766.23768 Feb 20 09:35:48 oss01 kernel: Lustre: 13106:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply Feb 20 09:35:48 oss01 kernel: req@ffff8105cc619800 x1420503574224933/t0 o3->28190515-07cf-b016-422b-1b05f42dd534@NET_0x50000c0a803ae_UUID:0/0 lens 448/400 e 4 to 0 dl 1361320553 ref 2 fl Interpret:/ 0/0 rc 0/0 Feb 20 09:35:48 oss01 kernel: Lustre: 13106:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message Feb 20 09:35:49 oss01 kernel: Lustre: 8237:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply Feb 20 09:35:49 oss01 kernel: req@ffff81061e650400 x1420503605573149/t0 o3->554c2172-ea07-8255-881e-847f196c5522@NET_0x50000c0a80399_UUID:0/0 lens 448/400 e 4 to 0 dl 1361320554 ref 2 fl Interpret:/ 0/0 rc 0/0 Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:574:target_handle_reconnect()) lustre-OST0000: 28190515-07cf-b016-422b-1b05f42dd534 reconnecting Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 65 previous similar messages Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0000: refuse reconnection from 28190515-07cf-b016-422b-1b05f42dd534@192.168.3.174@o2ib to 0xffff8102d40 8d600; still busy with 1 active RPCs Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:874:target_handle_connect()) Skipped 63 previous similar messages Feb 20 09:38:06 oss01 kernel: LustreError: 10022:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@ffff810181ffd000 x1420503574335450/t0 o8->28190515-07cf-b016-422b-1b05f42dd 534@NET_0x50000c0a803ae_UUID:0/0 lens 368/264 e 0 to 0 dl 1361320786 ref 1 fl Interpret:/0/0 rc -16/0 Feb 20 09:38:06 oss01 kernel: LustreError: 10022:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 63 previous similar messages