[LU-7] Reconnect server->client connection Created: 22/Oct/10  Updated: 20/Apr/16  Resolved: 20/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0, Lustre 2.5.5

Type: Bug Priority: Major
Reporter: Robert Read (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Duplicate
is duplicated by LU-793 Reconnections should not be refused w... Resolved
Related
is related to LU-793 Reconnections should not be refused w... Resolved
is related to LU-5520 BL AST resend Resolved
is related to LU-1239 cascading client evictions Resolved
is related to LU-1565 lost LDLM_CANCEL RPCs Resolved
Bugzilla ID: 3,622
Rank (Obsolete): 8049

 Description   

Local tracking bug for 3622.



 Comments   
Comment by Robert Read (Inactive) [ 22/Oct/10 ]

Bug 20997 added a feature that might make implementing this fairly straightforward. There is now an rq_delay_limit on the request that limits how long ptlrpc will attempt to send (or resend) the request. This is a time limit (rather than a retry count limit as originally suggested), but there are some advantages to using time instead of retry. In particular, the request will still age even while the import is not connected and no attempts are being made.

Comment by Christopher Morrone [ 26/Oct/10 ]

Perhaps bug 3622 was not the best starting point for explanation of this bug. Please also see bug 22893, which was the real problem that we had. Let me restate:

With an lnet routed configuration and a fairly sizable network, there are times when the client will think that it cannot talk to the server and attempt to reconnect. When this happens, the reconnect can be blocked because the server already has RPCs outstanding to that client.

Right now, this is a fatal situation. Eventually the server will time out the outstanding RPCs and evict the client. Evictions mean applications get errors.

The server should really recognize that when the client reconnects, there is very little hope that any of the RPCs outstanding to that client will ever complete. It should really cancel the RPCs, allow the client to reconnect, and then replay the RPCs.

Since that would be fairly difficult to implement, Andreas suggested that resending the RPCs a number of times would be a cheap solution. We would still need to wait for the first round of RPCs to time out, but after they do the client could reconnect. It would then be able to handle the next round of RPC retries, and therefore avoid eviction.

Comment by Robert Read (Inactive) [ 27/Oct/10 ]

I believe the failure when trying to reconnect because of queued RPCs is a relatively recent change, and I don't know the reason it was done, but I'm sure there is a good reason. Another option here is to change the server to not evicting the client when it knows that it is still alive (because it is trying to reconnect). Eventually, the network will stabilize and the allow the rpcs to be sent, but if that never happens then this is also a bug.

Comment by Dan Ferber (Inactive) [ 27/Oct/10 ]

Based on talking with Chris and his additional comments and references posted here, changed to type from improvement to bug. Next step is to get an effort analysis, starting from the suggestions Robert has here.

Comment by Dan Ferber (Inactive) [ 27/Oct/10 ]

Asked Fan Yong to take a look at this, just to estimate effort and skill set it would take to close. Or to recommend someone else to do this initial look.

Comment by nasf (Inactive) [ 28/Oct/10 ]

I have checked bug 22893, it is not a simple duplicate of bug 3622. There are two different issues:

1) As the comment #0 for bug 22893 described:
>Here are the console messages on the OSS demonstrating the problem:
>
>010-05-12 08:18:18 Lustre: lsa-OST00b9: Client
>2ee2068f-7647-9abe-8de8-9f92d3f17975 (at 192.168.122.60@o2ib1) refused
>reconnection, still busy with 1 active RPCs
>2010-05-12 08:18:18 Lustre: lsa-OST00b9: Client
>2ee2068f-7647-9abe-8de8-9f92d3f17975 (at 192.168.122.60@o2ib1) refused
>reconnection, still busy with 1 active RPCs
>2010-05-12 08:18:31 Lustre: lsa-OST00b9: Request ldlm_cp_callback sent 38s ago
>to 192.168.122.60@o2ib1 has timed out (limit 38s)
>2010-05-12 08:18:31 LustreError: 138-a: lsa-OST00b9: A client on nid
>192.168.122.60@o2ib1 was evicted due to a lock completion callback to
>192.168.122.60@o2ib1 timed out: rc -107

That means the OST refused the first reconnection from the client because of one RPC from the same client (previously connection) was in processing. But after applying the patches from bug 18674, such RPC should be aborted at once, and the second reconnection from such client should be successful. I hope this issue has been fixed by those patches, and I think that you have applied them, but need to confirm with you for that.

2) It is the issue described as bug 3622. The pinger or other RPC from client to server will cause client to reconnect to server if another router is available, but can not guarantee that the server does not evict it before the reconnection success.

So here we assume that the first issue has been fixed (see bug 18674), what we should do is to fix the second issue. The basic idea is that before evicting the client (fail related export) for lock callback timeout, check whether some reconnection from such client or not, if yes, do not evict it, instead, giving another chance to allow such client to reconnect, and then resend related RPC to client. But if no reconnection before such eviction check, means either the network partitioning (the same as the unique router down) or client no respond related RPC in spite of alive or not, just evict it under such case. So it is important to guarantee that the client can try to reconnect before such eviction check. On the other hand, to prevent malicious client holding lock(s) without release, we should restrict the times of such reconnection chance.

Comment by Christopher Morrone [ 29/Oct/10 ]

Yes, I think you understand the problem now.

However, the first issue is not fixed by bug 18674.

In bug 22893 we were running something based on 1.8.2, and all of the patches on bug 18674 appear to be landed as of 1.8.2. Now we are running a version of lustre based on 1.8.3, and still seeing the problem. In particular, we are running llnl's own version 1.8.3.0-5chaos.

My guess is that 18674 only addressed the glimpse problem, and did not fix the problem for all RPCs.

Bug 22893 was marked as a duplicate of bug 3662 because the solution proposed there, having the server retry the RPC several times, would likely eliminate the evictions that we are seeing. Granted, it is not a good solution, because the client still needs to wait for the timeout instead of reconnecting immediately. But at least the eviction would be avoided.

Comment by nasf (Inactive) [ 31/Oct/10 ]

The real case is some worse than my expectation. In fact, patches in bug 18674 are not aimed to glimpse callback, but for aborting other active RPC(s) (belong to former connection) when reconnect, it is just such active RPC(s) that prevented the reconnection. I will check your branch for that.

As for the eviction issue, based on current LNET mechanism, the PTLRPC layer knows nothing about the router (down/up), client has to wait for RPC timeout, then try to reconnect, but not reconnect immediately just after router down. So current solution for bug 3662 is just workaround, not a perfect one.

Comment by nasf (Inactive) [ 31/Oct/10 ]

It is quite possible that the old active RPCs (which belong to the former connection) hung in "ptlrpc_abort_bulk()" when the reconnection arrived and try to abort them. As following:

For reconnection:
int target_handle_connect(struct ptlrpc_request *req, svc_handler_t handler)

{ ... if (req->rq_export->exp_conn_cnt < lustre_msg_get_conn_cnt(req->rq_reqmsg)) /* try to abort active requests */ ===> req->rq_export->exp_abort_active_req = 1; spin_unlock(&export->exp_lock); GOTO(out, rc = -EBUSY); ... } For the old active RPC(s): static int ost_brw_write(struct ptlrpc_request *req, struct obd_trans_info *oti) { ... rc = l_wait_event(desc->bd_waitq, !ptlrpc_server_bulk_active(desc) || desc->bd_export->exp_failed || desc->bd_export->exp_abort_active_req, &lwi); LASSERT(rc == 0 || rc == -ETIMEDOUT); ... }

else if (desc->bd_export->exp_abort_active_req) {
DEBUG_REQ(D_ERROR, req, "Reconnect on bulk GET");
/* we don't reply anyway */
rc = -ETIMEDOUT;
===> ptlrpc_abort_bulk(desc);
...
}

So the reconnection could not succeed unless all the old active RPCs have been aborted. Further investigation shows that, "ptlrpc_abort_bulk()" => "LNetMDUnlink()" maybe not work as expected, either no "md" found with the given handle, or no event callback to clear "bd_network_rw". The following patch gives some solution, but not sure it is enough.

https://bugzilla.lustre.org/show_bug.cgi?id=21760
https://bugzilla.lustre.org/attachment.cgi?id=32032

Comment by Dan Ferber (Inactive) [ 01/Nov/10 ]

Nasf,

Would the next step be for the customer to test this patch, or do you want
to do some testing yourself first? Let me know what you think the best next
step is.

Thanks,
Dan


Dan Ferber
Customer Service and Business Development Manager
Whamcloud, Inc.
Office: +1 651-344-1846
Cell: +1 651-344-1846
Email: dferber@whamcloud.com

Comment by nasf (Inactive) [ 01/Nov/10 ]

In fact, I have tested such patch locally. But I can verify nothing since I never reproduced such deny of reconnection issue in my small virtual cluster environment. On the other hand, such patch is waiting to be inspected by Oracle, not sure whether can be approved to land yet. So if the customer can reproduce deny of reconnection issue easily, it is better to verify such patch in advance, once failed again, then the log will be helpful for further investigation. At the same time, as I said above, such patch maybe not enough, I will continue to investigate such issues, and try to give more suitable solution.

Comment by Christopher Morrone [ 02/Nov/10 ]

It is not obvious to me that Oracle bug 21760 is related to this bug. That seems to be a solution to a problem where the outstanding RPC can never complete. That is not the case here since the RPCs always time out like they should.

Maybe I don't know enough about these code paths, but I do not see what ost_brw_read() has to do with an outstanding ldlm_cp_callback. As far as I can tell, bug 18674 only appears to deal with bulk requests, not all RPCs.

Yes, target_handle_connect() will set the exp_abort_active_req flag, but that appears to be used in very few places. I am skeptical that it is really actively involved in aborting the ldlm_cp_callback RPC.

Comment by nasf (Inactive) [ 02/Nov/10 ]

There seems some misunderstand for the issue about deny reconnection because of the old active RPC(s). The basic view of such issue is similar as following:

  1. There is some bulk RPC between the client and OST on the old connection.
  2. Some operation triggers a lock callback RPC from the OST to such client, it can be "cp" triggered by the same client, or maybe "bl" or "gl" callback triggered by others.
  3. At such time, the old connection fails (maybe because router or some other reason), because the server can not trigger the reconnect from OST to such client (even it knows the old connection broken), it has to wait for the reconnection from the client.
  4. When client finds the old connection failure, it tries to reconnect to such OST.
  5. When the reconnect request arrives at the OST, the service thread finds there is some old active RPC in processing yet, then it sets "exp_abort_active_req" flag to abort such old RPC.
  6. The related service thread for such bulk RPC is waken up by such "exp_abort_active_req" flag, but before finishing such RPC, it should call "ptlrpc_abort_bulk()" to make some cleanup. It is just at the last step, it maybe blocked as described above. If so, then the succedent reconnections from such client will be deny always until the "ptlrpc_abort_bulk()" finished.

We are not sure when "ptlrpc_abort_bulk()" will be finished, but once the lock callback timeout before that, then such client will be evicted. So resolving the deny reconnection issue should be the first step to resolve the client eviction caused by router failure.

Another to be noticed is that, in "1)" we think there is some bulk RPC between the client and OST on the old connection, because we found such RPC in the log for bug 18674. But whether there maybe other type RPC(s) or not? I think you maybe skeptical that, right? That is my concern also, since your branch has applied the patches from bug 18674. I am studying the code to try to find out whether there maybe other type RPC(s) non-finished caused such deny reconnection issue.

Comment by Christopher Morrone [ 03/Nov/10 ]

There are definitely other RPCs for which it is a problem. It is not bulk RPCs that are currently a problem for us.

Specifically, we are seeing rejected reconnects because of outstanding lock callback rpcs. This is documented in bug 22893.

Comment by nasf (Inactive) [ 05/Nov/10 ]

I am making patch for the deny of reconnection and router failure recovery issues.

Comment by Dan Ferber (Inactive) [ 09/Nov/10 ]

Yong Fan is currently verifying the patch in his local environment. Since it is difficult to reproduce the LLNL issues in his virtual environment, he needs to design some test cases to simulate these kinds of failures. Once his patch passes these local tests, he will push it to gerrit for internal review and approval.

Comment by Dan Ferber (Inactive) [ 09/Nov/10 ]

Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug.

Comment by nasf (Inactive) [ 10/Nov/10 ]

I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle's engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful.

On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance.

I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection.

Comment by nasf (Inactive) [ 13/Nov/10 ]

I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622.

http://review.whamcloud.com/#change,125

Comment by nasf (Inactive) [ 16/Nov/10 ]

I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18.

Comment by Dan Ferber (Inactive) [ 17/Nov/10 ]

Nasf, do you recommend Chris test the patch http://review.whamcloud.com/#change in his environment now?

Comment by nasf (Inactive) [ 17/Nov/10 ]

According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.

Comment by Robert Read (Inactive) [ 22/Nov/10 ]

I have requested Di and Bobi Jam to inspect this patch, instead of me.

Comment by nasf (Inactive) [ 01/Dec/10 ]

A updated version is available: (process ldlm callback resend in ptlrpc layer)

http://review.whamcloud.com/#change,125

Comment by Ned Bass [ 12/Aug/11 ]

We are still hitting this issue fairly frequently on our production 1.8.5 clusters. Is anyone still working on the proposed fix?

Comment by nasf (Inactive) [ 10/Oct/11 ]

It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.

Comment by Bruce Korb (Inactive) [ 15/Feb/12 ]

Ping? We have a customer nuisanced by this, too.

Comment by Robert Read (Inactive) [ 16/Feb/12 ]

This is not a priority for us right now, but we'd be happy to take at a look at patch if you have one.

Comment by Nathan Rutman [ 07/Mar/12 ]

One aspect of this problem is in the following case:
1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal.
2. An rpc times out on a client, leading to it's reconnection. But this client has some locks to cancel, and the MDS is waiting for them.
3. The client sends MDS_CONNECT, but there is no empty thread to handle it.
4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc's take the last thread, thus potentially preventing future hp reqs from being serviced.
We've got a patch addressing 3 & 4 in inspection (MRP-455).

Comment by Cory Spitz [ 01/Jun/12 ]

Re the last comment, that would be LU-1239, http://review.whamcloud.com/2355.

Also, LU-793, 'Reconnections should not be refused when there is a request in progress from this client', http://review.whamcloud.com/#change,1616 would also improve the situation by allowing clients to reconnect with RPCs in process.

Comment by Vitaly Fertman [ 20/Feb/14 ]

BL AST resend: http://review.whamcloud.com/9335

Comment by Christopher Morrone [ 28/May/14 ]

Vitaly, I don't understand what that has to do with this ticket. Please expand the explanation, or start a new ticket.

Comment by Vitaly Fertman [ 16/Jun/14 ]

Chris,

in fact, you issue is client eviction because cancel is not delivered to server.
it may have several different reasons:

1. cancel is lost. it is to be resent - fixed by LU-1565

2. BL AST is lost. BL AST is to be resent - fixed by this patch.

3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress - fixed by LU-793

4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by LU-1239

5. PING cannot be handled by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by LU-1239

Comment by Christopher Morrone [ 23/Jun/14 ]

Well, no, not my issue. Although my issue may have been lumped in with a bunch of other things. We have specifically been waiting years for issue 3 to be solved.

Comment by Cory Spitz [ 24/Jun/14 ]

Chris, are you saying that the patches from LU-793 are not sufficient to fix your issue?

Comment by Christopher Morrone [ 24/Jun/14 ]

No, it might work. Haven't had a chance to try it yet.

Comment by Vitaly Fertman [ 20/Aug/14 ]

BL AST re-send is moved to LU-5520

Comment by James A Simmons [ 17/Mar/15 ]

What is left for this work?

Comment by Andreas Dilger [ 20/Apr/16 ]

Closing this old issue.

All of the sub issues have been closed and patches landed for related problems. Servers should resend RPCs to clients in appropriate circumstances before evicting them.

Generated at Sat Feb 10 01:02:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.