[LU-7] Reconnect server->client connection - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.7.0, Lustre 2.5.5
Affects Version/s: None
Labels:
- llnl

Bugzilla ID:
3,622
Rank (Obsolete):
8049

Description

Local tracking bug for 3622.

Attachments

Issue Links

is duplicated by

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

LU-5520 BL AST resend

Resolved

LU-1239 cascading client evictions

Resolved

LU-1565 lost LDLM_CANCEL RPCs

Resolved

Activity

[LU-7] Reconnect server->client connection

Christopher Morrone (Inactive) added a comment - 23/Jun/14 8:30 PM

Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

Christopher Morrone (Inactive) added a comment - 23/Jun/14 8:30 PM Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

Vitaly Fertman added a comment - 16/Jun/14 3:10 PM

Chris,

in fact, you issue is client eviction because cancel is not delivered to server.
it may have several different reasons:

1. cancel is lost. it is to be resent - fixed by ~~LU-1565~~

2. BL AST is lost. BL AST is to be resent - fixed by this patch.

3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by ~~LU-793~~

4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by ~~LU-1239~~

5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by ~~LU-1239~~

Vitaly Fertman added a comment - 16/Jun/14 3:10 PM Chris, in fact, you issue is client eviction because cancel is not delivered to server. it may have several different reasons: 1. cancel is lost. it is to be resent - fixed by LU-1565 2. BL AST is lost. BL AST is to be resent - fixed by this patch. 3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by LU-793 4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by LU-1239 5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by LU-1239

Christopher Morrone (Inactive) added a comment - 28/May/14 11:42 PM

Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.

Christopher Morrone (Inactive) added a comment - 28/May/14 11:42 PM Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.

Vitaly Fertman added a comment - 20/Feb/14 11:42 PM

BL AST resend: http://review.whamcloud.com/9335

Vitaly Fertman added a comment - 20/Feb/14 11:42 PM BL AST resend: http://review.whamcloud.com/9335

Cory Spitz added a comment - 01/Jun/12 5:49 PM

Re the last comment, that would be ~~LU-1239~~, http://review.whamcloud.com/2355.

Also, ~~LU-793~~, 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

Cory Spitz added a comment - 01/Jun/12 5:49 PM Re the last comment, that would be LU-1239 , http://review.whamcloud.com/2355 . Also, LU-793 , 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

Nathan Rutman added a comment - 07/Mar/12 4:40 PM

One aspect of this problem is in the following case:
1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal.
2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them.
3. The client sends MDS_CONNECT, but there is no empty thread to handle it.
4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced.
We've got a patch addressing 3 & 4 in inspection (MRP-455).

Nathan Rutman added a comment - 07/Mar/12 4:40 PM One aspect of this problem is in the following case: 1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal. 2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them. 3. The client sends MDS_CONNECT, but there is no empty thread to handle it. 4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced. We've got a patch addressing 3 & 4 in inspection (MRP-455).

Robert Read added a comment - 16/Feb/12 2:32 PM

This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

Robert Read added a comment - 16/Feb/12 2:32 PM This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

Bruce Korb (Inactive) added a comment - 15/Feb/12 6:41 PM

Ping? We have a customer nuisanced by this, too.

Bruce Korb (Inactive) added a comment - 15/Feb/12 6:41 PM Ping? We have a customer nuisanced by this, too.

nasf (Inactive) added a comment - 10/Oct/11 4:12 AM

It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

nasf (Inactive) added a comment - 10/Oct/11 4:12 AM It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

Ned Bass (Inactive) added a comment - 12/Aug/11 2:27 PM

We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

Ned Bass (Inactive) added a comment - 12/Aug/11 2:27 PM We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

nasf (Inactive) added a comment - 01/Dec/10 9:30 AM

A updated version is available: (process ldlm callback resend in ptlrpc layer)

http://review.whamcloud.com/#change,125

nasf (Inactive) added a comment - 01/Dec/10 9:30 AM A updated version is available: (process ldlm callback resend in ptlrpc layer) http://review.whamcloud.com/#change,125

People

Assignee:: nasf (Inactive)

Reporter:: Robert Read

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 22/Oct/10 4:07 PM

Updated:: 20/Apr/16 5:57 AM

Resolved:: 20/Apr/16 5:57 AM