Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5569

recreating a reverse import produce a various fails.

Details

    • 3
    • 15532

    Description

      Don't reallocate a new reverse import for each client reconnect.
      a reverse import disconnecting on each client reconnect open
      several races in request sending (AST mostly) code.

      First problem is send_rpc vs class_destroy_import() race. If sending
      RPC (or resending) issued after class_destroy_import function was
      called, RPC sending will failed due import generation check.

      Second problem, Target_handle_connect function stop an update a
      connection information for older reverse import. So RPC can't be
      delivered from server to the client due wrong connection information
      or security flavor changed.

      Target_handle_connect function stops update connection information
      for older reverse import. So we can't delivers a RPC from server to
      the client due wrong connection information or security flavor
      changed.

      Third problem, connection flags aren't updates atomically for an
      import. Target_handle_connect function does link new import before
      message headers flags are set. So, RPC will have a wrong flags set
      if it would be sent at the same time.

      Fourth problem, client reconnecting after network flap have result
      none wakeup event send to a RPC in import queues. That situation adds
      noticeable timeout in case server don't send request before network
      flap.

      some examples

      00000100:00100000:1.0:1407845348.937766:0:62024:0:(service.c:1929:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_419:4960df0f-75ed-07a2-cee7-063090dc59cd+4:19257:x1475700821793316:12345-1748@gni1:8 Request procesed in 55us (106us total) trans 0 rc 0/0
      
      00000100:00020000:1.0:1407845393.600747:0:81897:0:(client.c:1115:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff880304e39800 x1475078782385806/t0(0) o105->snx11063-OST0070@1748@gni1:15/16 lens 344/192 e 0 to 1 dl 1407845389 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      

      Attachments

        Issue Links

          Activity

            [LU-5569] recreating a reverse import produce a various fails.
            adilger Andreas Dilger added a comment - - edited

            This patch caused a regression on master. See LU-7221 for details.

            adilger Andreas Dilger added a comment - - edited This patch caused a regression on master. See LU-7221 for details.
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11750/
            Subject: LU-5569 ptlrpc: change reverse import life cycle
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 892078e3b566c04471e7dcf2c28e66f2f3584f93

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11750/ Subject: LU-5569 ptlrpc: change reverse import life cycle Project: fs/lustre-release Branch: master Current Patch Set: Commit: 892078e3b566c04471e7dcf2c28e66f2f3584f93

            Alex email your latest LU-4134 and I will push it for you.

            simmonsja James A Simmons added a comment - Alex email your latest LU-4134 and I will push it for you.

            Oleg's backtraces related to the different bug. It's related to the wrong obd device release process where we may hold a any export live after obd device freed aka LU-4134.
            patch exist on
            http://review.whamcloud.com/#/c/8045/

            but i still not able to update due lack access to Intel Gerrit with google account and password login blocked by intel admins.

            shadow Alexey Lyashkov added a comment - Oleg's backtraces related to the different bug. It's related to the wrong obd device release process where we may hold a any export live after obd device freed aka LU-4134 . patch exist on http://review.whamcloud.com/#/c/8045/ but i still not able to update due lack access to Intel Gerrit with google account and password login blocked by intel admins.

            It failed Oleg's review process. He list the backtrace he gotten.

            simmonsja James A Simmons added a comment - It failed Oleg's review process. He list the backtrace he gotten.
            spitzcor Cory Spitz added a comment -

            http://review.whamcloud.com/#/c/11750 still needs help getting through review.

            spitzcor Cory Spitz added a comment - http://review.whamcloud.com/#/c/11750 still needs help getting through review.
            green Oleg Drokin added a comment -

            Ok. So this is a problem that has been there for quite a while I see?
            In this case I imagine there's no super critical rush to get this into 2.7 as there are still some discussion about this patch and surrounding areas. The reality is such that 2.7 code freeze is right around the corner and we cannot drag this ticket indefinitely.
            I don't think Cray or anybody else plans to jump to 2.7 right the second it gets released. So if we need more time to hash it out to get a solution that satisfies everyone and as the result it slips into 2.8 (and if 2.7 happens to be a maintenance release, also then backported to 2.7.1 or something like that and 2.5.5 or whatever is the going on thing then and Cray will backport it to their tree anyway no matter where it comes from....) that should be totally ok in my view?

            In other words I guess I am asking does anybody have any compelling reasons for why this should basically be treated differently than what's described above (i.e. as a blocker because I am overlooking something and it's a new disasterous failure of epic proportiions instead?)

            green Oleg Drokin added a comment - Ok. So this is a problem that has been there for quite a while I see? In this case I imagine there's no super critical rush to get this into 2.7 as there are still some discussion about this patch and surrounding areas. The reality is such that 2.7 code freeze is right around the corner and we cannot drag this ticket indefinitely. I don't think Cray or anybody else plans to jump to 2.7 right the second it gets released. So if we need more time to hash it out to get a solution that satisfies everyone and as the result it slips into 2.8 (and if 2.7 happens to be a maintenance release, also then backported to 2.7.1 or something like that and 2.5.5 or whatever is the going on thing then and Cray will backport it to their tree anyway no matter where it comes from....) that should be totally ok in my view? In other words I guess I am asking does anybody have any compelling reasons for why this should basically be treated differently than what's described above (i.e. as a blocker because I am overlooking something and it's a new disasterous failure of epic proportiions instead?)
            hornc Chris Horn added a comment -

            We discovered the bugs addressed by this change while investigating some non-POSIX compliant behavior exhibited by Lustre. Below is a description of the problem based on my own understanding and the fixes that were proposed to address it.

            Higher layers of Lustre (CLIO) generally rely on lower layers to enforce POSIX compliance. In this case, the Lustre Distributed Lock Manager (LDLM) and ptlrpc layers are interacting in such a way that results in inappropriate errors being returned to the client. The interaction revolves around a pair of clients performing I/O.

            One client (the writer) creates and writes data to a single file striped across eight OSTs. A second client (the reader) reads the data written by the writer. The reader requests a protected read lock for each stripe of the file from the corresponding OST. Upon receipt of the lock enqueue request, the OSS notes that it has already granted a conflicting lock to the writer. As a result the server sends a blocking AST (BL AST) to the writer. This is a request that the writer cancel its lock so that it may be granted to the reader. Upon receipt of the BL AST the writer should first reply that it has received the BL AST, and then, after flushing any pages to the server, send a lock cancel request back to the server. When the server receives both the reply to the BL AST and the lock cancel request it can then grant the lock to the reader via a completion AST (CP AST). The server sends a CP AST to the reader who must then acknowledge receipt of the CP AST before it can use the resource covered by the requested lock.

            In Lustre, different components communicate via import and export pairs. An import is for sending requests and receiving replies, and an export is for receiving requests and sending replies. However, it is not possible to send a request via an export. As a result, servers utilize a reverse import to send AST requests to clients. A reverse import converts an import and export pair into a corresponding export and import pair. Currently, a new reverse import is created whenever the server creates an export for a client (re)connection. This prevents us from being able to re-send requests that were linked to the old reverse import. Historically this has not been problematic as servers did not have the ability to resend ASTs. With LU-5520 servers are now able to resend ASTs.

            This particular bug (there are other potential flavors) arises if the reader reconnects to an OST granting the lock while the OSS is trying to deliver the CP AST (or, equivalently, if the client is unable to acknowledge receipt of the CP AST). Based on Lustre trace data we determined that an OSS was unable to deliver a CP AST to the reader. While the OSS was waiting for the CP AST acknowledgement the reader reconnected to the OST granting the lock. As mentioned above, this created a new reverse import for this client. When the OSS attempted to resend the CP AST (after a timeout) it found that the old import for the reader had been destroyed. It was thus unable to re-send the request, and the request was immediately failed with a status of -EIO. When LDLM interpreted the failed request, it did not handle the -EIO request status appropriately. LDLM converted the -EIO error into -EAGAIN which was then returned to the client.

            Two fixes are proposed to address different aspects of this bug:
            1. Server-side: evict clients returning errors on ASTs

            • Fix tracked by LU-5581
            • This immediately fixes the posix non-compliance issue, and helps to prevent potential data corruption cases.

            2. Server-side: change reverse import life cycle:

            • Fix tracked by LU-5569
            • Fixes a race between send_rpc and class_destroy_import() that results in the import generation mismatch.
            • Properly update connection information for “older” reverse import in the event that client nid changes.
            • Ensure connection flags on the reverse import are updated prior to sending any rpcs on the reverse import after reconnect.
            • Wakeup (resend) requests on the reverse import sending list when a client reconnects rather than waiting for the original requests to timeout.
            hornc Chris Horn added a comment - We discovered the bugs addressed by this change while investigating some non-POSIX compliant behavior exhibited by Lustre. Below is a description of the problem based on my own understanding and the fixes that were proposed to address it. Higher layers of Lustre (CLIO) generally rely on lower layers to enforce POSIX compliance. In this case, the Lustre Distributed Lock Manager (LDLM) and ptlrpc layers are interacting in such a way that results in inappropriate errors being returned to the client. The interaction revolves around a pair of clients performing I/O. One client (the writer) creates and writes data to a single file striped across eight OSTs. A second client (the reader) reads the data written by the writer. The reader requests a protected read lock for each stripe of the file from the corresponding OST. Upon receipt of the lock enqueue request, the OSS notes that it has already granted a conflicting lock to the writer. As a result the server sends a blocking AST (BL AST) to the writer. This is a request that the writer cancel its lock so that it may be granted to the reader. Upon receipt of the BL AST the writer should first reply that it has received the BL AST, and then, after flushing any pages to the server, send a lock cancel request back to the server. When the server receives both the reply to the BL AST and the lock cancel request it can then grant the lock to the reader via a completion AST (CP AST). The server sends a CP AST to the reader who must then acknowledge receipt of the CP AST before it can use the resource covered by the requested lock. In Lustre, different components communicate via import and export pairs. An import is for sending requests and receiving replies, and an export is for receiving requests and sending replies. However, it is not possible to send a request via an export. As a result, servers utilize a reverse import to send AST requests to clients. A reverse import converts an import and export pair into a corresponding export and import pair. Currently, a new reverse import is created whenever the server creates an export for a client (re)connection. This prevents us from being able to re-send requests that were linked to the old reverse import. Historically this has not been problematic as servers did not have the ability to resend ASTs. With LU-5520 servers are now able to resend ASTs. This particular bug (there are other potential flavors) arises if the reader reconnects to an OST granting the lock while the OSS is trying to deliver the CP AST (or, equivalently, if the client is unable to acknowledge receipt of the CP AST). Based on Lustre trace data we determined that an OSS was unable to deliver a CP AST to the reader. While the OSS was waiting for the CP AST acknowledgement the reader reconnected to the OST granting the lock. As mentioned above, this created a new reverse import for this client. When the OSS attempted to resend the CP AST (after a timeout) it found that the old import for the reader had been destroyed. It was thus unable to re-send the request, and the request was immediately failed with a status of -EIO. When LDLM interpreted the failed request, it did not handle the -EIO request status appropriately. LDLM converted the -EIO error into -EAGAIN which was then returned to the client. Two fixes are proposed to address different aspects of this bug: 1. Server-side: evict clients returning errors on ASTs Fix tracked by LU-5581 This immediately fixes the posix non-compliance issue, and helps to prevent potential data corruption cases. 2. Server-side: change reverse import life cycle: Fix tracked by LU-5569 Fixes a race between send_rpc and class_destroy_import() that results in the import generation mismatch. Properly update connection information for “older” reverse import in the event that client nid changes. Ensure connection flags on the reverse import are updated prior to sending any rpcs on the reverse import after reconnect. Wakeup (resend) requests on the reverse import sending list when a client reconnects rather than waiting for the original requests to timeout.

            Oleg,

            which patch you point to? first patch (test only) should be failed in new added tests, second patch should be don't fail as fixes issue.

            shadow Alexey Lyashkov added a comment - Oleg, which patch you point to? first patch (test only) should be failed in new added tests, second patch should be don't fail as fixes issue.

            People

              yujian Jian Yu
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: