[LU-7] Reconnect server->client connection - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.7.0, Lustre 2.5.5
Affects Version/s: None
Labels:
- llnl

Bugzilla ID:
3,622
Rank (Obsolete):
8049

Description

Local tracking bug for 3622.

Attachments

Issue Links

is duplicated by

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

LU-5520 BL AST resend

Resolved

LU-1239 cascading client evictions

Resolved

LU-1565 lost LDLM_CANCEL RPCs

Resolved

Activity

[LU-7] Reconnect server->client connection

Bruce Korb (Inactive) added a comment - 15/Feb/12 6:41 PM

Ping? We have a customer nuisanced by this, too.

Bruce Korb (Inactive) added a comment - 15/Feb/12 6:41 PM Ping? We have a customer nuisanced by this, too.

nasf (Inactive) added a comment - 10/Oct/11 4:12 AM

It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

nasf (Inactive) added a comment - 10/Oct/11 4:12 AM It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

Ned Bass (Inactive) added a comment - 12/Aug/11 2:27 PM

We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

Ned Bass (Inactive) added a comment - 12/Aug/11 2:27 PM We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

nasf (Inactive) added a comment - 01/Dec/10 9:30 AM

A updated version is available: (process ldlm callback resend in ptlrpc layer)

http://review.whamcloud.com/#change,125

nasf (Inactive) added a comment - 01/Dec/10 9:30 AM A updated version is available: (process ldlm callback resend in ptlrpc layer) http://review.whamcloud.com/#change,125

Robert Read added a comment - 22/Nov/10 1:17 PM

I have requested Di and Bobi Jam to inspect this patch, instead of me.

Robert Read added a comment - 22/Nov/10 1:17 PM I have requested Di and Bobi Jam to inspect this patch, instead of me.

nasf (Inactive) added a comment - 17/Nov/10 6:10 AM

According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

nasf (Inactive) added a comment - 17/Nov/10 6:10 AM According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

Dan Ferber (Inactive) added a comment - 17/Nov/10 4:44 AM

Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

Dan Ferber (Inactive) added a comment - 17/Nov/10 4:44 AM Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

nasf (Inactive) added a comment - 16/Nov/10 9:41 PM

I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18.

nasf (Inactive) added a comment - 16/Nov/10 9:41 PM I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18.

nasf (Inactive) added a comment - 13/Nov/10 8:32 PM - edited

I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622.

http://review.whamcloud.com/#change,125

nasf (Inactive) added a comment - 13/Nov/10 8:32 PM - edited I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622. http://review.whamcloud.com/#change,125

nasf (Inactive) added a comment - 10/Nov/10 7:14 AM - edited

I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle's engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful.

On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance.

I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection.

nasf (Inactive) added a comment - 10/Nov/10 7:14 AM - edited I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle's engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful. On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance. I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection.

Dan Ferber (Inactive) added a comment - 09/Nov/10 4:12 PM

Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug.

Dan Ferber (Inactive) added a comment - 09/Nov/10 4:12 PM Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug.

People

Assignee:: nasf (Inactive)

Reporter:: Robert Read

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 22/Oct/10 4:07 PM

Updated:: 20/Apr/16 5:57 AM

Resolved:: 20/Apr/16 5:57 AM