[LU-793] Reconnections should not be refused when there is a request in progress from this client. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 2.4.0, Lustre 1.8.6
Labels:
- JL

Severity:
3
Rank (Obsolete):
5914

Description

While originally this was a useful workaround, it created a lot of other unintended problems.

This code must be disabled and instead we just should disable handling several duplicate requests at the same time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-793.jpg
39 kB
06/Nov/12 9:39 AM

Issue Links

duplicates

LU-7 Reconnect server->client connection

Resolved

is related to

LU-2621 SIngle client timeout hangs MDS -related to LU-793

Resolved

LU-4349 conf-sanity test_47: test failed to respond and timed out

Resolved

LU-4458 Interop 2.5.0<->2.6 failure on test suite recovery-small test_9

Resolved

LU-18072 Lock cancel resending overwhelms ldlm canceld thread

Resolved

LU-7 Reconnect server->client connection

Resolved

LU-4359 1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

Resolved

LU-2429 easy to find bad client

Resolved

LU-4480 cfs_fail_timeout id 214 sleeping for -1000ms

Closed

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

(4 is related to, 1 Trackbacks)

Activity

[LU-793] Reconnections should not be refused when there is a request in progress from this client.

Mikhail Pershin added a comment - 10/Feb/14 8:18 AM

Peter, the ~~LU-793~~ and ~~LU-4349~~ are needed.

Mikhail Pershin added a comment - 10/Feb/14 8:18 AM Peter, the LU-793 and LU-4349 are needed.

Peter Jones added a comment - 08/Feb/14 5:29 AM

Mike

Could you please clarify what LLNL would need to port in order to use this fix on b2_4?

Thanks

Peter

Peter Jones added a comment - 08/Feb/14 5:29 AM Mike Could you please clarify what LLNL would need to port in order to use this fix on b2_4? Thanks Peter

Andreas Dilger added a comment - 11/Dec/13 5:22 PM

I think this patch introduced a timeout in conf-sanity (~~LU-4349~~), so that needs to be addressed before this patch is introduced into the 2.4 release.

Andreas Dilger added a comment - 11/Dec/13 5:22 PM I think this patch introduced a timeout in conf-sanity ( LU-4349 ), so that needs to be addressed before this patch is introduced into the 2.4 release.

Mikhail Pershin added a comment - 27/Nov/13 8:11 PM

The behavior is almost the same as before for bulks. Currently all pending bulks are aborted if new reconnect arrived from client and reconnect is refused with -EBUSY until there will be no more active requests, this is how it is handled before patch. With this patch we accept reconnect even if there are active requests and all bulks from last connection are aborted. Basically it is the same behavior as before, but now the connection count is checked instead of specific flag.
Client will resend aborted bulks, yes. Also the client is able to reconnect always, but resent bulk may stuck on original bulk until it is aborted.

Mikhail Pershin added a comment - 27/Nov/13 8:11 PM The behavior is almost the same as before for bulks. Currently all pending bulks are aborted if new reconnect arrived from client and reconnect is refused with -EBUSY until there will be no more active requests, this is how it is handled before patch. With this patch we accept reconnect even if there are active requests and all bulks from last connection are aborted. Basically it is the same behavior as before, but now the connection count is checked instead of specific flag. Client will resend aborted bulks, yes. Also the client is able to reconnect always, but resent bulk may stuck on original bulk until it is aborted.

Christopher Morrone (Inactive) added a comment - 27/Nov/13 6:21 PM

Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade.

Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them?

Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case?

Christopher Morrone (Inactive) added a comment - 27/Nov/13 6:21 PM Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade. Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them? Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case?

Mikhail Pershin added a comment - 27/Nov/13 4:01 PM

Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps?

Mikhail Pershin added a comment - 27/Nov/13 4:01 PM Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps?

Mikhail Pershin added a comment - 25/Oct/13 3:25 PM

Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing

Mikhail Pershin added a comment - 25/Oct/13 3:25 PM Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing

Peter Jones added a comment - 02/Oct/13 6:38 PM

This is still a support priority, but we need to finalize the fix before we consider it for inclusion in a release

Peter Jones added a comment - 02/Oct/13 6:38 PM This is still a support priority, but we need to finalize the fix before we consider it for inclusion in a release

Mikhail Pershin added a comment - 11/Sep/13 6:12 AM

patch was refreshed http://review.whamcloud.com/#/c/4960/. I doesn't handle bulk request for now, this will be solved in the following patch.

Mikhail Pershin added a comment - 11/Sep/13 6:12 AM patch was refreshed http://review.whamcloud.com/#/c/4960/ . I doesn't handle bulk request for now, this will be solved in the following patch.

Mikhail Pershin added a comment - 02/Sep/13 5:39 AM - edited

This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one.
Btw, the comment about bulk XID and replated code:

int ptlrpc_register_bulk(struct ptlrpc_request *req)
{
...

	/* An XID is only used for a single request from the client.
	 * For retried bulk transfers, a new XID will be allocated in
	 * in ptlrpc_check_set() if it needs to be resent, so it is not
	 * using the same RDMA match bits after an error.
	 */
...
}

and

void ptlrpc_resend_req(struct ptlrpc_request *req)
{
        DEBUG_REQ(D_HA, req, "going to resend");
        lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 });
        req->rq_status = -EAGAIN;

	spin_lock(&req->rq_lock);
        req->rq_resend = 1;
        req->rq_net_err = 0;
        req->rq_timedout = 0;
        if (req->rq_bulk) {
                __u64 old_xid = req->rq_xid;

                /* ensure previous bulk fails */
                req->rq_xid = ptlrpc_next_xid();
                CDEBUG(D_HA, "resend bulk old x"LPU64" new x"LPU64"\n",
                       old_xid, req->rq_xid);
        }
        ptlrpc_client_wake_req(req);
	spin_unlock(&req->rq_lock);
}

Mikhail Pershin added a comment - 02/Sep/13 5:39 AM - edited This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one. Btw, the comment about bulk XID and replated code: int ptlrpc_register_bulk(struct ptlrpc_request *req) { ... /* An XID is only used for a single request from the client. * For retried bulk transfers, a new XID will be allocated in * in ptlrpc_check_set() if it needs to be resent, so it is not * using the same RDMA match bits after an error. */ ... } and void ptlrpc_resend_req(struct ptlrpc_request *req) { DEBUG_REQ(D_HA, req, "going to resend" ); lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 }); req->rq_status = -EAGAIN; spin_lock(&req->rq_lock); req->rq_resend = 1; req->rq_net_err = 0; req->rq_timedout = 0; if (req->rq_bulk) { __u64 old_xid = req->rq_xid; /* ensure previous bulk fails */ req->rq_xid = ptlrpc_next_xid(); CDEBUG(D_HA, "resend bulk old x" LPU64 " new x" LPU64 "\n" , old_xid, req->rq_xid); } ptlrpc_client_wake_req(req); spin_unlock(&req->rq_lock); }

Mikhail Pershin added a comment - 10/Mar/13 11:16 PM

retriggered

Mikhail Pershin added a comment - 10/Mar/13 11:16 PM retriggered

People

Assignee:: Mikhail Pershin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 24 Start watching this issue

Dates

Created:: 25/Oct/11 11:41 AM

Updated:: 29/Jul/24 5:21 PM

Resolved:: 18/Feb/14 10:03 PM