[LU-7] Reconnect server->client connection Created: 22/Oct/10 Updated: 20/Apr/16 Resolved: 20/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Robert Read (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Bugzilla ID: | 3,622 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 8049 | ||||||||||||||||||||||||||||
| Description |
|
Local tracking bug for 3622. |
| Comments |
| Comment by Robert Read (Inactive) [ 22/Oct/10 ] |
|
Bug 20997 added a feature that might make implementing this fairly straightforward. There is now an rq_delay_limit on the request that limits how long ptlrpc will attempt to send (or resend) the request. This is a time limit (rather than a retry count limit as originally suggested), but there are some advantages to using time instead of retry. In particular, the request will still age even while the import is not connected and no attempts are being made. |
| Comment by Christopher Morrone [ 26/Oct/10 ] |
|
Perhaps bug 3622 was not the best starting point for explanation of this bug. Please also see bug 22893, which was the real problem that we had. Let me restate: With an lnet routed configuration and a fairly sizable network, there are times when the client will think that it cannot talk to the server and attempt to reconnect. When this happens, the reconnect can be blocked because the server already has RPCs outstanding to that client. Right now, this is a fatal situation. Eventually the server will time out the outstanding RPCs and evict the client. Evictions mean applications get errors. The server should really recognize that when the client reconnects, there is very little hope that any of the RPCs outstanding to that client will ever complete. It should really cancel the RPCs, allow the client to reconnect, and then replay the RPCs. Since that would be fairly difficult to implement, Andreas suggested that resending the RPCs a number of times would be a cheap solution. We would still need to wait for the first round of RPCs to time out, but after they do the client could reconnect. It would then be able to handle the next round of RPC retries, and therefore avoid eviction. |
| Comment by Robert Read (Inactive) [ 27/Oct/10 ] |
|
I believe the failure when trying to reconnect because of queued RPCs is a relatively recent change, and I don't know the reason it was done, but I'm sure there is a good reason. Another option here is to change the server to not evicting the client when it knows that it is still alive (because it is trying to reconnect). Eventually, the network will stabilize and the allow the rpcs to be sent, but if that never happens then this is also a bug. |
| Comment by Dan Ferber (Inactive) [ 27/Oct/10 ] |
|
Based on talking with Chris and his additional comments and references posted here, changed to type from improvement to bug. Next step is to get an effort analysis, starting from the suggestions Robert has here. |
| Comment by Dan Ferber (Inactive) [ 27/Oct/10 ] |
|
Asked Fan Yong to take a look at this, just to estimate effort and skill set it would take to close. Or to recommend someone else to do this initial look. |
| Comment by nasf (Inactive) [ 28/Oct/10 ] |
|
I have checked bug 22893, it is not a simple duplicate of bug 3622. There are two different issues: 1) As the comment #0 for bug 22893 described: That means the OST refused the first reconnection from the client because of one RPC from the same client (previously connection) was in processing. But after applying the patches from bug 18674, such RPC should be aborted at once, and the second reconnection from such client should be successful. I hope this issue has been fixed by those patches, and I think that you have applied them, but need to confirm with you for that. 2) It is the issue described as bug 3622. The pinger or other RPC from client to server will cause client to reconnect to server if another router is available, but can not guarantee that the server does not evict it before the reconnection success. So here we assume that the first issue has been fixed (see bug 18674), what we should do is to fix the second issue. The basic idea is that before evicting the client (fail related export) for lock callback timeout, check whether some reconnection from such client or not, if yes, do not evict it, instead, giving another chance to allow such client to reconnect, and then resend related RPC to client. But if no reconnection before such eviction check, means either the network partitioning (the same as the unique router down) or client no respond related RPC in spite of alive or not, just evict it under such case. So it is important to guarantee that the client can try to reconnect before such eviction check. On the other hand, to prevent malicious client holding lock(s) without release, we should restrict the times of such reconnection chance. |
| Comment by Christopher Morrone [ 29/Oct/10 ] |
|
Yes, I think you understand the problem now. However, the first issue is not fixed by bug 18674. In bug 22893 we were running something based on 1.8.2, and all of the patches on bug 18674 appear to be landed as of 1.8.2. Now we are running a version of lustre based on 1.8.3, and still seeing the problem. In particular, we are running llnl's own version 1.8.3.0-5chaos. My guess is that 18674 only addressed the glimpse problem, and did not fix the problem for all RPCs. Bug 22893 was marked as a duplicate of bug 3662 because the solution proposed there, having the server retry the RPC several times, would likely eliminate the evictions that we are seeing. Granted, it is not a good solution, because the client still needs to wait for the timeout instead of reconnecting immediately. But at least the eviction would be avoided. |
| Comment by nasf (Inactive) [ 31/Oct/10 ] |
|
The real case is some worse than my expectation. In fact, patches in bug 18674 are not aimed to glimpse callback, but for aborting other active RPC(s) (belong to former connection) when reconnect, it is just such active RPC(s) that prevented the reconnection. I will check your branch for that. As for the eviction issue, based on current LNET mechanism, the PTLRPC layer knows nothing about the router (down/up), client has to wait for RPC timeout, then try to reconnect, but not reconnect immediately just after router down. So current solution for bug 3662 is just workaround, not a perfect one. |
| Comment by nasf (Inactive) [ 31/Oct/10 ] |
|
It is quite possible that the old active RPCs (which belong to the former connection) hung in "ptlrpc_abort_bulk()" when the reconnection arrived and try to abort them. As following: For reconnection: else if (desc->bd_export->exp_abort_active_req) { So the reconnection could not succeed unless all the old active RPCs have been aborted. Further investigation shows that, "ptlrpc_abort_bulk()" => "LNetMDUnlink()" maybe not work as expected, either no "md" found with the given handle, or no event callback to clear "bd_network_rw". The following patch gives some solution, but not sure it is enough. https://bugzilla.lustre.org/show_bug.cgi?id=21760 |
| Comment by Dan Ferber (Inactive) [ 01/Nov/10 ] |
|
Nasf, Would the next step be for the customer to test this patch, or do you want Thanks, – |
| Comment by nasf (Inactive) [ 01/Nov/10 ] |
|
In fact, I have tested such patch locally. But I can verify nothing since I never reproduced such deny of reconnection issue in my small virtual cluster environment. On the other hand, such patch is waiting to be inspected by Oracle, not sure whether can be approved to land yet. So if the customer can reproduce deny of reconnection issue easily, it is better to verify such patch in advance, once failed again, then the log will be helpful for further investigation. At the same time, as I said above, such patch maybe not enough, I will continue to investigate such issues, and try to give more suitable solution. |
| Comment by Christopher Morrone [ 02/Nov/10 ] |
|
It is not obvious to me that Oracle bug 21760 is related to this bug. That seems to be a solution to a problem where the outstanding RPC can never complete. That is not the case here since the RPCs always time out like they should. Maybe I don't know enough about these code paths, but I do not see what ost_brw_read() has to do with an outstanding ldlm_cp_callback. As far as I can tell, bug 18674 only appears to deal with bulk requests, not all RPCs. Yes, target_handle_connect() will set the exp_abort_active_req flag, but that appears to be used in very few places. I am skeptical that it is really actively involved in aborting the ldlm_cp_callback RPC. |
| Comment by nasf (Inactive) [ 02/Nov/10 ] |
|
There seems some misunderstand for the issue about deny reconnection because of the old active RPC(s). The basic view of such issue is similar as following:
We are not sure when "ptlrpc_abort_bulk()" will be finished, but once the lock callback timeout before that, then such client will be evicted. So resolving the deny reconnection issue should be the first step to resolve the client eviction caused by router failure. Another to be noticed is that, in "1)" we think there is some bulk RPC between the client and OST on the old connection, because we found such RPC in the log for bug 18674. But whether there maybe other type RPC(s) or not? I think you maybe skeptical that, right? That is my concern also, since your branch has applied the patches from bug 18674. I am studying the code to try to find out whether there maybe other type RPC(s) non-finished caused such deny reconnection issue. |
| Comment by Christopher Morrone [ 03/Nov/10 ] |
|
There are definitely other RPCs for which it is a problem. It is not bulk RPCs that are currently a problem for us. Specifically, we are seeing rejected reconnects because of outstanding lock callback rpcs. This is documented in bug 22893. |
| Comment by nasf (Inactive) [ 05/Nov/10 ] |
|
I am making patch for the deny of reconnection and router failure recovery issues. |
| Comment by Dan Ferber (Inactive) [ 09/Nov/10 ] |
|
Yong Fan is currently verifying the patch in his local environment. Since it is difficult to reproduce the LLNL issues in his virtual environment, he needs to design some test cases to simulate these kinds of failures. Once his patch passes these local tests, he will push it to gerrit for internal review and approval. |
| Comment by Dan Ferber (Inactive) [ 09/Nov/10 ] |
|
Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug. |
| Comment by nasf (Inactive) [ 10/Nov/10 ] |
|
I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle's engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful. On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance. I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection. |
| Comment by nasf (Inactive) [ 13/Nov/10 ] |
|
I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622. |
| Comment by nasf (Inactive) [ 16/Nov/10 ] |
|
I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18. |
| Comment by Dan Ferber (Inactive) [ 17/Nov/10 ] |
|
Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now? |
| Comment by nasf (Inactive) [ 17/Nov/10 ] |
|
According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible. |
| Comment by Robert Read (Inactive) [ 22/Nov/10 ] |
|
I have requested Di and Bobi Jam to inspect this patch, instead of me. |
| Comment by nasf (Inactive) [ 01/Dec/10 ] |
|
A updated version is available: (process ldlm callback resend in ptlrpc layer) |
| Comment by Ned Bass [ 12/Aug/11 ] |
|
We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix? |
| Comment by nasf (Inactive) [ 10/Oct/11 ] |
|
It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here. |
| Comment by Bruce Korb (Inactive) [ 15/Feb/12 ] |
|
Ping? We have a customer nuisanced by this, too. |
| Comment by Robert Read (Inactive) [ 16/Feb/12 ] |
|
This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one. |
| Comment by Nathan Rutman [ 07/Mar/12 ] |
|
One aspect of this problem is in the following case: |
| Comment by Cory Spitz [ 01/Jun/12 ] |
|
Re the last comment, that would be Also, |
| Comment by Vitaly Fertman [ 20/Feb/14 ] |
|
BL AST resend: http://review.whamcloud.com/9335 |
| Comment by Christopher Morrone [ 28/May/14 ] |
|
Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket. |
| Comment by Vitaly Fertman [ 16/Jun/14 ] |
|
Chris, in fact, you issue is client eviction because cancel is not delivered to server. 1. cancel is lost. it is to be resent - fixed by 2. BL AST is lost. BL AST is to be resent - fixed by this patch. 3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by 4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by 5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by |
| Comment by Christopher Morrone [ 23/Jun/14 ] |
|
Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved. |
| Comment by Cory Spitz [ 24/Jun/14 ] |
|
Chris, are you saying that the patches from |
| Comment by Christopher Morrone [ 24/Jun/14 ] |
|
No, it might work. Haven't had a chance to try it yet. |
| Comment by Vitaly Fertman [ 20/Aug/14 ] |
|
BL AST re-send is moved to |
| Comment by James A Simmons [ 17/Mar/15 ] |
|
What is left for this work? |
| Comment by Andreas Dilger [ 20/Apr/16 ] |
|
Closing this old issue. All of the sub issues have been closed and patches landed for related problems. Servers should resend RPCs to clients in appropriate circumstances before evicting them. |