Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5569

recreating a reverse import produce a various fails.

Details

    • 3
    • 15532

    Description

      Don't reallocate a new reverse import for each client reconnect.
      a reverse import disconnecting on each client reconnect open
      several races in request sending (AST mostly) code.

      First problem is send_rpc vs class_destroy_import() race. If sending
      RPC (or resending) issued after class_destroy_import function was
      called, RPC sending will failed due import generation check.

      Second problem, Target_handle_connect function stop an update a
      connection information for older reverse import. So RPC can't be
      delivered from server to the client due wrong connection information
      or security flavor changed.

      Target_handle_connect function stops update connection information
      for older reverse import. So we can't delivers a RPC from server to
      the client due wrong connection information or security flavor
      changed.

      Third problem, connection flags aren't updates atomically for an
      import. Target_handle_connect function does link new import before
      message headers flags are set. So, RPC will have a wrong flags set
      if it would be sent at the same time.

      Fourth problem, client reconnecting after network flap have result
      none wakeup event send to a RPC in import queues. That situation adds
      noticeable timeout in case server don't send request before network
      flap.

      some examples

      00000100:00100000:1.0:1407845348.937766:0:62024:0:(service.c:1929:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_419:4960df0f-75ed-07a2-cee7-063090dc59cd+4:19257:x1475700821793316:12345-1748@gni1:8 Request procesed in 55us (106us total) trans 0 rc 0/0
      
      00000100:00020000:1.0:1407845393.600747:0:81897:0:(client.c:1115:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff880304e39800 x1475078782385806/t0(0) o105->snx11063-OST0070@1748@gni1:15/16 lens 344/192 e 0 to 1 dl 1407845389 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      

      Attachments

        Issue Links

          Activity

            [LU-5569] recreating a reverse import produce a various fails.

            Chris,

            you have a correct description. Thanks for rephase!

            shadow Alexey Lyashkov added a comment - Chris, you have a correct description. Thanks for rephase!
            hornc Chris Horn added a comment -

            Fourth problem, client reconnecting after network flap have result
            none wakeup event send to a RPC in import queues. That situation adds
            noticeable timeout in case server don't send request before network
            flap.

            Alexey, I'm having trouble understanding your description of this fourth problem. Is the following description correct?

            When a client reconnects after a network flap we do not currently wakeup any RPCs in the (reverse) import queue (specifically the imp_sending_list of the reverse import). This means we need to wait for the original request to timeout before the server can resend the request.

            hornc Chris Horn added a comment - Fourth problem, client reconnecting after network flap have result none wakeup event send to a RPC in import queues. That situation adds noticeable timeout in case server don't send request before network flap. Alexey, I'm having trouble understanding your description of this fourth problem. Is the following description correct? When a client reconnects after a network flap we do not currently wakeup any RPCs in the (reverse) import queue (specifically the imp_sending_list of the reverse import). This means we need to wait for the original request to timeout before the server can resend the request.
            shadow Alexey Lyashkov added a comment - http://review.whamcloud.com/11750

            I talked to Alexey on Skype to get an answer to the question above. The problem is that ldlm_handle_ast_error() doesn't evict the client in some error cases (he mentioned EIO & EPROTO) where the AST wasn't delivered or properly processed by the client. The server then cancels the lock locally and grants the conflicting lock while the client still thinks it owns a valid lock and might continue writing to the file.

            johann Johann Lombardi (Inactive) added a comment - I talked to Alexey on Skype to get an answer to the question above. The problem is that ldlm_handle_ast_error() doesn't evict the client in some error cases (he mentioned EIO & EPROTO) where the AST wasn't delivered or properly processed by the client. The server then cancels the lock locally and grants the conflicting lock while the client still thinks it owns a valid lock and might continue writing to the file.

            Alexey, in gerrit 9335, you mentioned a "data corruption" problem. Could you please elaborate and explain why this issue only shows up with the AST resend patch? Thanks in advance

            johann Johann Lombardi (Inactive) added a comment - Alexey, in gerrit 9335, you mentioned a "data corruption" problem. Could you please elaborate and explain why this issue only shows up with the AST resend patch? Thanks in advance

            patch to tests for such bugs.

            http://review.whamcloud.com/11724

            shadow Alexey Lyashkov added a comment - patch to tests for such bugs. http://review.whamcloud.com/11724

            People

              yujian Jian Yu
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: