Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2854

hanging ll_ost_io and refuse connections

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.9
    • Lustre 1.8.8
    • 3
    • 6915

    Description

      We see the hanging ll_ost_io threads, stacked something in here. And the new connections are refused until this stacked threads finished up. we only had OSS failover to fix this issue. we got backtrace, please have a look at them if see where is problem.

      Feb 20 09:22:46 oss01 kernel: Call Trace:
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff80062ff2>] thread_return+0x62/0xfe
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff80046c6e>] try_to_wake_up+0x472/0x484
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff800649e1>] __down+0x99/0xd8
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8028b1d5>] __down_trylock+0x44/0x4e
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff800646c9>] __down_failed+0x35/0x3a
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff887fe406>] .text.lock.ldlm_pool+0x37/0x71 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff80064614>] __down_read+0x12/0x92
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8003f279>] shrink_slab+0xd0/0x153
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8000985a>] __d_lookup+0xb0/0xff
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b134d2>] filter_fid2dentry+0x512/0x740 [obdfilter]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8000f46f>] __alloc_pages+0x78/0x308
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff888a33ad>] kiblnd_launch_tx+0x16d/0x9d0 [ko2iblnd]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff80025e20>] find_or_create_page+0x32/0x72
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b27445>] filter_get_page+0x35/0x70 [obdfilter]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b27a81>] filter_preprw_read+0x601/0xd30 [obdfilter]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff886ff873>] lnet_send+0x9a3/0x9d0 [lnet]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff886fdaf7>] lnet_prep_send+0x67/0xb0 [lnet]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88b29f0c>] filter_preprw+0x1d5c/0x1dc0 [obdfilter]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88816c3a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8880eaa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88acfac3>] ost_brw_read+0xb33/0x1a70 [ost]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88812ed5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88813088>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88ad8363>] ost_handle+0x2e73/0x55b0 [ost]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff80154b81>] __next_cpu+0x19/0x28
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff800778c5>] smp_send_reschedule+0x4e/0x53
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff888226b9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88822e15>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88823da6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff88822e40>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Feb 20 09:22:46 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Feb 20 09:22:46 oss01 kernel: 
      Feb 20 09:22:46 oss01 kernel: LustreError: dumping log to /tmp/lustre-log.1361319766.23768
      Feb 20 09:35:48 oss01 kernel: Lustre: 13106:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply
      Feb 20 09:35:48 oss01 kernel:   req@ffff8105cc619800 x1420503574224933/t0 o3->28190515-07cf-b016-422b-1b05f42dd534@NET_0x50000c0a803ae_UUID:0/0 lens 448/400 e 4 to 0 dl 1361320553 ref 2 fl Interpret:/
      0/0 rc 0/0
      Feb 20 09:35:48 oss01 kernel: Lustre: 13106:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      Feb 20 09:35:49 oss01 kernel: Lustre: 8237:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply
      Feb 20 09:35:49 oss01 kernel:   req@ffff81061e650400 x1420503605573149/t0 o3->554c2172-ea07-8255-881e-847f196c5522@NET_0x50000c0a80399_UUID:0/0 lens 448/400 e 4 to 0 dl 1361320554 ref 2 fl Interpret:/
      0/0 rc 0/0
      Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:574:target_handle_reconnect()) lustre-OST0000: 28190515-07cf-b016-422b-1b05f42dd534 reconnecting
      Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 65 previous similar messages
      Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0000: refuse reconnection from 28190515-07cf-b016-422b-1b05f42dd534@192.168.3.174@o2ib to 0xffff8102d40
      8d600; still busy with 1 active RPCs
      Feb 20 09:38:06 oss01 kernel: Lustre: 10022:0:(ldlm_lib.c:874:target_handle_connect()) Skipped 63 previous similar messages
      Feb 20 09:38:06 oss01 kernel: LustreError: 10022:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810181ffd000 x1420503574335450/t0 o8->28190515-07cf-b016-422b-1b05f42dd
      534@NET_0x50000c0a803ae_UUID:0/0 lens 368/264 e 0 to 0 dl 1361320786 ref 1 fl Interpret:/0/0 rc -16/0
      Feb 20 09:38:06 oss01 kernel: LustreError: 10022:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 63 previous similar messages
      

      Attachments

        1. oss01_messages
          934 kB
        2. oss02_messages
          999 kB
        3. oss03_messages
          812 kB
        4. oss04_messages
          931 kB

        Activity

          [LU-2854] hanging ll_ost_io and refuse connections
          pjones Peter Jones added a comment -

          ok thanks Ihara

          pjones Peter Jones added a comment - ok thanks Ihara

          This patches have been landed in 1.8.9 and we upgraded system with it. So, please close this ticket.

          ihara Shuichi Ihara (Inactive) added a comment - This patches have been landed in 1.8.9 and we upgraded system with it. So, please close this ticket.

          Yes, for the stack trace is different. But no cause is found until now, Is this issue reproducible?
          What is the result of test with the patch from LU-1520?

          hongchao.zhang Hongchao Zhang added a comment - Yes, for the stack trace is different. But no cause is found until now, Is this issue reproducible? What is the result of test with the patch from LU-1520 ?

          the log in OSS02 seems to be different with the others, and it could be related to the lock contending of inode->i_alloc_sem.

          You mean not related to LU-1520, but another root cause for this?

          ihara Shuichi Ihara (Inactive) added a comment - the log in OSS02 seems to be different with the others, and it could be related to the lock contending of inode->i_alloc_sem. You mean not related to LU-1520 , but another root cause for this?

          this issue is a little similar with LU-1520, which is caused by the contest of different calls of "ldlm_pools_shrink" under low memory situation.
          could you please try the patch http://review.whamcloud.com/#change,3270 in LU-1520? thanks!

          the log in OSS02 seems to be different with the others, and it could be related to the lock contending of inode->i_alloc_sem.

          hongchao.zhang Hongchao Zhang added a comment - this issue is a little similar with LU-1520 , which is caused by the contest of different calls of "ldlm_pools_shrink" under low memory situation. could you please try the patch http://review.whamcloud.com/#change,3270 in LU-1520 ? thanks! the log in OSS02 seems to be different with the others, and it could be related to the lock contending of inode->i_alloc_sem.

          Hi, this issue is related to ldlm_pools_shrink and there could be a deadlock on some ldlm_lock, will need some more time to investigate it,
          and will update its status soon.

          hongchao.zhang Hongchao Zhang added a comment - Hi, this issue is related to ldlm_pools_shrink and there could be a deadlock on some ldlm_lock, will need some more time to investigate it, and will update its status soon.

          Hongchao
          do you have chance to have a look at log files?

          ihara Shuichi Ihara (Inactive) added a comment - Hongchao do you have chance to have a look at log files?
          pjones Peter Jones added a comment -

          Hongchao

          Could you please comment on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Could you please comment on this one? Thanks Peter

          People

            hongchao.zhang Hongchao Zhang
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: