Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7

Reconnect server->client connection

Details

    • 3,622
    • 8049

    Description

      Local tracking bug for 3622.

      Attachments

        Issue Links

          Activity

            [LU-7] Reconnect server->client connection
            vitaly_fertman Vitaly Fertman added a comment - BL AST resend: http://review.whamcloud.com/9335
            spitzcor Cory Spitz added a comment -

            Re the last comment, that would be LU-1239, http://review.whamcloud.com/2355.

            Also, LU-793, 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

            spitzcor Cory Spitz added a comment - Re the last comment, that would be LU-1239 , http://review.whamcloud.com/2355 . Also, LU-793 , 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

            One aspect of this problem is in the following case:
            1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal.
            2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them.
            3. The client sends MDS_CONNECT, but there is no empty thread to handle it.
            4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced.
            We've got a patch addressing 3 & 4 in inspection (MRP-455).

            nrutman Nathan Rutman added a comment - One aspect of this problem is in the following case: 1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal. 2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them. 3. The client sends MDS_CONNECT, but there is no empty thread to handle it. 4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced. We've got a patch addressing 3 & 4 in inspection (MRP-455).
            rread Robert Read added a comment -

            This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

            rread Robert Read added a comment - This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

            Ping? We have a customer nuisanced by this, too.

            bkorb Bruce Korb (Inactive) added a comment - Ping? We have a customer nuisanced by this, too.

            It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

            yong.fan nasf (Inactive) added a comment - It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

            We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

            nedbass Ned Bass (Inactive) added a comment - We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

            A updated version is available: (process ldlm callback resend in ptlrpc layer)

            http://review.whamcloud.com/#change,125

            yong.fan nasf (Inactive) added a comment - A updated version is available: (process ldlm callback resend in ptlrpc layer) http://review.whamcloud.com/#change,125
            rread Robert Read added a comment -

            I have requested Di and Bobi Jam to inspect this patch, instead of me.

            rread Robert Read added a comment - I have requested Di and Bobi Jam to inspect this patch, instead of me.

            According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

            yong.fan nasf (Inactive) added a comment - According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

            Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

            dferber Dan Ferber (Inactive) added a comment - Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

            People

              yong.fan nasf (Inactive)
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: