Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-793

Reconnections should not be refused when there is a request in progress from this client.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • Lustre 2.1.0, Lustre 2.2.0, Lustre 2.4.0, Lustre 1.8.6
    • 3
    • 5914

    Description

      While originally this was a useful workaround, it created a lot of other unintended problems.

      This code must be disabled and instead we just should disable handling several duplicate requests at the same time.

      Attachments

        Issue Links

          Activity

            [LU-793] Reconnections should not be refused when there is a request in progress from this client.

            Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade.

            Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them?

            Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case?

            morrone Christopher Morrone (Inactive) added a comment - Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade. Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them? Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case?

            Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps?

            tappro Mikhail Pershin added a comment - Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps?

            Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing

            tappro Mikhail Pershin added a comment - Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing
            pjones Peter Jones added a comment -

            This is still a support priority, but we need to finalize the fix before we consider it for inclusion in a release

            pjones Peter Jones added a comment - This is still a support priority, but we need to finalize the fix before we consider it for inclusion in a release

            patch was refreshed http://review.whamcloud.com/#/c/4960/. I doesn't handle bulk request for now, this will be solved in the following patch.

            tappro Mikhail Pershin added a comment - patch was refreshed http://review.whamcloud.com/#/c/4960/ . I doesn't handle bulk request for now, this will be solved in the following patch.
            tappro Mikhail Pershin added a comment - - edited

            This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one.
            Btw, the comment about bulk XID and replated code:

            int ptlrpc_register_bulk(struct ptlrpc_request *req)
            {
            ...
            
            	/* An XID is only used for a single request from the client.
            	 * For retried bulk transfers, a new XID will be allocated in
            	 * in ptlrpc_check_set() if it needs to be resent, so it is not
            	 * using the same RDMA match bits after an error.
            	 */
            ...
            }
            
            and
            
            void ptlrpc_resend_req(struct ptlrpc_request *req)
            {
                    DEBUG_REQ(D_HA, req, "going to resend");
                    lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 });
                    req->rq_status = -EAGAIN;
            
            	spin_lock(&req->rq_lock);
                    req->rq_resend = 1;
                    req->rq_net_err = 0;
                    req->rq_timedout = 0;
                    if (req->rq_bulk) {
                            __u64 old_xid = req->rq_xid;
            
                            /* ensure previous bulk fails */
                            req->rq_xid = ptlrpc_next_xid();
                            CDEBUG(D_HA, "resend bulk old x"LPU64" new x"LPU64"\n",
                                   old_xid, req->rq_xid);
                    }
                    ptlrpc_client_wake_req(req);
            	spin_unlock(&req->rq_lock);
            }
            
            tappro Mikhail Pershin added a comment - - edited This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one. Btw, the comment about bulk XID and replated code: int ptlrpc_register_bulk(struct ptlrpc_request *req) { ... /* An XID is only used for a single request from the client. * For retried bulk transfers, a new XID will be allocated in * in ptlrpc_check_set() if it needs to be resent, so it is not * using the same RDMA match bits after an error. */ ... } and void ptlrpc_resend_req(struct ptlrpc_request *req) { DEBUG_REQ(D_HA, req, "going to resend" ); lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 }); req->rq_status = -EAGAIN; spin_lock(&req->rq_lock); req->rq_resend = 1; req->rq_net_err = 0; req->rq_timedout = 0; if (req->rq_bulk) { __u64 old_xid = req->rq_xid; /* ensure previous bulk fails */ req->rq_xid = ptlrpc_next_xid(); CDEBUG(D_HA, "resend bulk old x" LPU64 " new x" LPU64 "\n" , old_xid, req->rq_xid); } ptlrpc_client_wake_req(req); spin_unlock(&req->rq_lock); }

            retriggered

            tappro Mikhail Pershin added a comment - retriggered
            spitzcor Cory Spitz added a comment -

            Change #4960 failed autotest. Do you just need to resubmit?

            spitzcor Cory Spitz added a comment - Change #4960 failed autotest. Do you just need to resubmit?

            Since Andreas recommended to merge Mikhail patch and mine I updated http://review.whamcloud.com/#change,4960 with improvement request.

            igolovach Iurii Golovach (Inactive) added a comment - Since Andreas recommended to merge Mikhail patch and mine I updated http://review.whamcloud.com/#change,4960 with improvement request.

            Hi gents,

            here is a patch version for this issue from our team:
            http://review.whamcloud.com/#change,5054

            igolovach Iurii Golovach (Inactive) added a comment - Hi gents, here is a patch version for this issue from our team: http://review.whamcloud.com/#change,5054

            Cliff, the refused-reconnection problem is the after-effect of whatever problem happened, and there are many ways it can be triggered. You should open a separate bug report for the initial problem of the server going unresponsive, or whatever happened.

            morrone Christopher Morrone (Inactive) added a comment - - edited Cliff, the refused-reconnection problem is the after-effect of whatever problem happened, and there are many ways it can be triggered. You should open a separate bug report for the initial problem of the server going unresponsive, or whatever happened.

            People

              tappro Mikhail Pershin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: