Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7

Reconnect server->client connection

Details

    • 3,622
    • 8049

    Description

      Local tracking bug for 3622.

      Attachments

        Issue Links

          Activity

            [LU-7] Reconnect server->client connection

            What is left for this work?

            simmonsja James A Simmons added a comment - What is left for this work?

            BL AST re-send is moved to LU-5520

            vitaly_fertman Vitaly Fertman added a comment - BL AST re-send is moved to LU-5520

            No, it might work. Haven't had a chance to try it yet.

            morrone Christopher Morrone (Inactive) added a comment - No, it might work. Haven't had a chance to try it yet.
            spitzcor Cory Spitz added a comment -

            Chris, are you saying that the patches from LU-793 are not sufficient to fix your issue?

            spitzcor Cory Spitz added a comment - Chris, are you saying that the patches from LU-793 are not sufficient to fix your issue?

            Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

            morrone Christopher Morrone (Inactive) added a comment - Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

            Chris,

            in fact, you issue is client eviction because cancel is not delivered to server.
            it may have several different reasons:

            1. cancel is lost. it is to be resent - fixed by LU-1565

            2. BL AST is lost. BL AST is to be resent - fixed by this patch.

            3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by LU-793

            4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by LU-1239

            5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by LU-1239

            vitaly_fertman Vitaly Fertman added a comment - Chris, in fact, you issue is client eviction because cancel is not delivered to server. it may have several different reasons: 1. cancel is lost. it is to be resent - fixed by LU-1565 2. BL AST is lost. BL AST is to be resent - fixed by this patch. 3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by LU-793 4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by LU-1239 5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by LU-1239

            Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.

            morrone Christopher Morrone (Inactive) added a comment - Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.
            vitaly_fertman Vitaly Fertman added a comment - BL AST resend: http://review.whamcloud.com/9335
            spitzcor Cory Spitz added a comment -

            Re the last comment, that would be LU-1239, http://review.whamcloud.com/2355.

            Also, LU-793, 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

            spitzcor Cory Spitz added a comment - Re the last comment, that would be LU-1239 , http://review.whamcloud.com/2355 . Also, LU-793 , 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

            One aspect of this problem is in the following case:
            1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal.
            2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them.
            3. The client sends MDS_CONNECT, but there is no empty thread to handle it.
            4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced.
            We've got a patch addressing 3 & 4 in inspection (MRP-455).

            nrutman Nathan Rutman added a comment - One aspect of this problem is in the following case: 1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal. 2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them. 3. The client sends MDS_CONNECT, but there is no empty thread to handle it. 4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced. We've got a patch addressing 3 & 4 in inspection (MRP-455).
            rread Robert Read added a comment -

            This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

            rread Robert Read added a comment - This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

            People

              yong.fan nasf (Inactive)
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: