Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1219

The connection is refused due to still busy with 1 active RPCs

Details

    • Bug
    • Resolution: Incomplete
    • Minor
    • None
    • Lustre 1.8.7
    • None
    • server : lustre-1.8.7, client : lustre-1.8.4.ddn2.2
    • 3
    • 6434

    Description

      We got the following call traces at the customer site, and one OST refuses to connect due to still busy with 1 active RPCs.

      Mar 15 00:14:14 t2s007059 kernel: Pid: 13147, comm: ll_ost_io_229
      Mar 15 00:14:14 t2s007059 kernel: 
      Mar 15 00:14:14 t2s007059 kernel: Call Trace:
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8002e024>] __wake_up+0x38/0x4f
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff889fc7f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88b1390b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88b54144>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff886da3a6>] lnet_ni_send+0x96/0xe0 [lnet]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88aeeede>] ost_checksum_bulk+0x36e/0x5a0 [ost]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88af5d09>] ost_brw_write+0x1c99/0x2480 [ost]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ebac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887b68b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887f00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88af909e>] ost_handle+0x2bae/0x55b0 [ost]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88741d00>] class_handle2object+0xe0/0x170 [obdclass]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887aa19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887af168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ff6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ffe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88800dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ffe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Mar 15 00:14:14 t2s007059 kernel: 
      

      Attachments

        Issue Links

          Activity

            [LU-1219] The connection is refused due to still busy with 1 active RPCs

            Close this old issue until there is more information available.

            adilger Andreas Dilger added a comment - Close this old issue until there is more information available.

            Hi Shuichi

            Sorry for delayed response.
            please get the thread info of tasks by sysrq (echo "t" >/proc/sysrq-trigger) of the node at where the OST resides,
            and it's better to get the current running address of process "kjournald2" ATM, Thanks!

            hongchao.zhang Hongchao Zhang added a comment - Hi Shuichi Sorry for delayed response. please get the thread info of tasks by sysrq (echo "t" >/proc/sysrq-trigger) of the node at where the OST resides, and it's better to get the current running address of process "kjournald2" ATM, Thanks!

            Hi,
            the customer is waiting for what they should get them when the same problem happens at the next time.
            please let me know what exactly we should do.

            ihara Shuichi Ihara (Inactive) added a comment - Hi, the customer is waiting for what they should get them when the same problem happens at the next time. please let me know what exactly we should do.

            Hongchao,

            Unfortunately, we only have /var/log/messages and /tmp/lustre-log.<timestamp>

            When if the problem happens again, we can collect all information you want.
            Please let us know, what should we run the commands before reboot servers.

            ihara Shuichi Ihara (Inactive) added a comment - Hongchao, Unfortunately, we only have /var/log/messages and /tmp/lustre-log.<timestamp> When if the problem happens again, we can collect all information you want. Please let us know, what should we run the commands before reboot servers.

            the journal is stuck when committing the previous transaction, was the info of process in this node available?
            the stack trace of "kjournald2" should give some info about where it was stuck. thanks!

            hongchao.zhang Hongchao Zhang added a comment - the journal is stuck when committing the previous transaction, was the info of process in this node available? the stack trace of "kjournald2" should give some info about where it was stuck. thanks!
            green Oleg Drokin added a comment -

            The underlying issue is the write stuck in jbd somehow, LU-793 would not fix this, it just papers over some of the symptoms.

            green Oleg Drokin added a comment - The underlying issue is the write stuck in jbd somehow, LU-793 would not fix this, it just papers over some of the symptoms.

            This is possibly a duplicate of LU-793, for which Oleg already has a patch.

            adilger Andreas Dilger added a comment - This is possibly a duplicate of LU-793 , for which Oleg already has a patch.
            pjones Peter Jones added a comment -

            Hongchao

            Could you please help with this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please help with this one? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: