Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7

Reconnect server->client connection

Details

    • 3,622
    • 8049

    Description

      Local tracking bug for 3622.

      Attachments

        Issue Links

          Activity

            [LU-7] Reconnect server->client connection

            Ping? We have a customer nuisanced by this, too.

            bkorb Bruce Korb (Inactive) added a comment - Ping? We have a customer nuisanced by this, too.

            It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

            yong.fan nasf (Inactive) added a comment - It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

            We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

            nedbass Ned Bass (Inactive) added a comment - We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

            A updated version is available: (process ldlm callback resend in ptlrpc layer)

            http://review.whamcloud.com/#change,125

            yong.fan nasf (Inactive) added a comment - A updated version is available: (process ldlm callback resend in ptlrpc layer) http://review.whamcloud.com/#change,125
            rread Robert Read added a comment -

            I have requested Di and Bobi Jam to inspect this patch, instead of me.

            rread Robert Read added a comment - I have requested Di and Bobi Jam to inspect this patch, instead of me.

            According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

            yong.fan nasf (Inactive) added a comment - According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

            Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

            dferber Dan Ferber (Inactive) added a comment - Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

            I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18.

            yong.fan nasf (Inactive) added a comment - I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18.
            yong.fan nasf (Inactive) added a comment - - edited

            I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622.

            http://review.whamcloud.com/#change,125

            yong.fan nasf (Inactive) added a comment - - edited I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622. http://review.whamcloud.com/#change,125
            yong.fan nasf (Inactive) added a comment - - edited

            I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle's engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful.

            On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance.

            I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection.

            yong.fan nasf (Inactive) added a comment - - edited I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle's engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful. On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance. I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection.

            Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug.

            dferber Dan Ferber (Inactive) added a comment - Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug.

            People

              yong.fan nasf (Inactive)
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: