Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7

Reconnect server->client connection

Details

    • 3,622
    • 8049

    Description

      Local tracking bug for 3622.

      Attachments

        Issue Links

          Activity

            [LU-7] Reconnect server->client connection

            Closing this old issue.

            All of the sub issues have been closed and patches landed for related problems. Servers should resend RPCs to clients in appropriate circumstances before evicting them.

            adilger Andreas Dilger added a comment - Closing this old issue. All of the sub issues have been closed and patches landed for related problems. Servers should resend RPCs to clients in appropriate circumstances before evicting them.

            What is left for this work?

            simmonsja James A Simmons added a comment - What is left for this work?

            BL AST re-send is moved to LU-5520

            vitaly_fertman Vitaly Fertman added a comment - BL AST re-send is moved to LU-5520

            No, it might work. Haven't had a chance to try it yet.

            morrone Christopher Morrone (Inactive) added a comment - No, it might work. Haven't had a chance to try it yet.
            spitzcor Cory Spitz added a comment -

            Chris, are you saying that the patches from LU-793 are not sufficient to fix your issue?

            spitzcor Cory Spitz added a comment - Chris, are you saying that the patches from LU-793 are not sufficient to fix your issue?

            Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

            morrone Christopher Morrone (Inactive) added a comment - Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

            Chris,

            in fact, you issue is client eviction because cancel is not delivered to server.
            it may have several different reasons:

            1. cancel is lost. it is to be resent - fixed by LU-1565

            2. BL AST is lost. BL AST is to be resent - fixed by this patch.

            3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by LU-793

            4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by LU-1239

            5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by LU-1239

            vitaly_fertman Vitaly Fertman added a comment - Chris, in fact, you issue is client eviction because cancel is not delivered to server. it may have several different reasons: 1. cancel is lost. it is to be resent - fixed by LU-1565 2. BL AST is lost. BL AST is to be resent - fixed by this patch. 3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by LU-793 4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by LU-1239 5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by LU-1239

            Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.

            morrone Christopher Morrone (Inactive) added a comment - Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.
            vitaly_fertman Vitaly Fertman added a comment - BL AST resend: http://review.whamcloud.com/9335
            spitzcor Cory Spitz added a comment -

            Re the last comment, that would be LU-1239, http://review.whamcloud.com/2355.

            Also, LU-793, 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

            spitzcor Cory Spitz added a comment - Re the last comment, that would be LU-1239 , http://review.whamcloud.com/2355 . Also, LU-793 , 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

            One aspect of this problem is in the following case:
            1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal.
            2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them.
            3. The client sends MDS_CONNECT, but there is no empty thread to handle it.
            4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced.
            We've got a patch addressing 3 & 4 in inspection (MRP-455).

            nrutman Nathan Rutman added a comment - One aspect of this problem is in the following case: 1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal. 2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them. 3. The client sends MDS_CONNECT, but there is no empty thread to handle it. 4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced. We've got a patch addressing 3 & 4 in inspection (MRP-455).

            People

              yong.fan nasf (Inactive)
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: