Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-793

Reconnections should not be refused when there is a request in progress from this client.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • Lustre 2.1.0, Lustre 2.2.0, Lustre 2.4.0, Lustre 1.8.6
    • 3
    • 5914

    Description

      While originally this was a useful workaround, it created a lot of other unintended problems.

      This code must be disabled and instead we just should disable handling several duplicate requests at the same time.

      Attachments

        Issue Links

          Activity

            [LU-793] Reconnections should not be refused when there is a request in progress from this client.

            The patch

            http://review.whamcloud.com/#/c/9211/

            was determined to not be needed for b2_4.

            morrone Christopher Morrone (Inactive) added a comment - The patch http://review.whamcloud.com/#/c/9211/ was determined to not be needed for b2_4.

            Hi

            We are seeing similar issues on Lustre 2.1.6 release, so is this patch compatible with 2.1.x release and if yes then can this be backported to branch b2_1.

            Thank You,
            Manish

            manish Manish Patel (Inactive) added a comment - Hi We are seeing similar issues on Lustre 2.1.6 release, so is this patch compatible with 2.1.x release and if yes then can this be backported to branch b2_1. Thank You, Manish
            pjones Peter Jones added a comment - Backports to b2_4 http://review.whamcloud.com/#/c/9209/ http://review.whamcloud.com/#/c/9210/ http://review.whamcloud.com/#/c/9211/

            Peter, the LU-793 and LU-4349 are needed.

            tappro Mikhail Pershin added a comment - Peter, the LU-793 and LU-4349 are needed.
            pjones Peter Jones added a comment -

            Mike

            Could you please clarify what LLNL would need to port in order to use this fix on b2_4?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please clarify what LLNL would need to port in order to use this fix on b2_4? Thanks Peter

            I think this patch introduced a timeout in conf-sanity (LU-4349), so that needs to be addressed before this patch is introduced into the 2.4 release.

            adilger Andreas Dilger added a comment - I think this patch introduced a timeout in conf-sanity ( LU-4349 ), so that needs to be addressed before this patch is introduced into the 2.4 release.

            The behavior is almost the same as before for bulks. Currently all pending bulks are aborted if new reconnect arrived from client and reconnect is refused with -EBUSY until there will be no more active requests, this is how it is handled before patch. With this patch we accept reconnect even if there are active requests and all bulks from last connection are aborted. Basically it is the same behavior as before, but now the connection count is checked instead of specific flag.
            Client will resend aborted bulks, yes. Also the client is able to reconnect always, but resent bulk may stuck on original bulk until it is aborted.

            tappro Mikhail Pershin added a comment - The behavior is almost the same as before for bulks. Currently all pending bulks are aborted if new reconnect arrived from client and reconnect is refused with -EBUSY until there will be no more active requests, this is how it is handled before patch. With this patch we accept reconnect even if there are active requests and all bulks from last connection are aborted. Basically it is the same behavior as before, but now the connection count is checked instead of specific flag. Client will resend aborted bulks, yes. Also the client is able to reconnect always, but resent bulk may stuck on original bulk until it is aborted.

            Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade.

            Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them?

            Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case?

            morrone Christopher Morrone (Inactive) added a comment - Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade. Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them? Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case?

            Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps?

            tappro Mikhail Pershin added a comment - Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps?

            Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing

            tappro Mikhail Pershin added a comment - Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing

            People

              tappro Mikhail Pershin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: