<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:02:49 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7] Reconnect server-&gt;client connection</title>
                <link>https://jira.whamcloud.com/browse/LU-7</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Local tracking bug for &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=3622&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;3622&lt;/a&gt;. &lt;/p&gt;</description>
                <environment></environment>
        <key id="10080">LU-7</key>
            <summary>Reconnect server-&gt;client connection</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="rread">Robert Read</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 22 Oct 2010 16:07:40 +0000</created>
                <updated>Wed, 20 Apr 2016 05:57:55 +0000</updated>
                            <resolved>Wed, 20 Apr 2016 05:57:55 +0000</resolved>
                                                    <fixVersion>Lustre 2.7.0</fixVersion>
                    <fixVersion>Lustre 2.5.5</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>18</watches>
                                                                            <comments>
                            <comment id="10073" author="rread" created="Fri, 22 Oct 2010 16:40:33 +0000"  >&lt;p&gt;Bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=20997&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;20997&lt;/a&gt; added a feature that might make implementing this fairly straightforward.  There is now an &lt;tt&gt;rq_delay_limit&lt;/tt&gt; on the request that limits how long ptlrpc will attempt to send (or resend) the request.  This is a time limit (rather than a retry count limit as originally suggested), but there are some advantages to using time instead of retry. In particular, the request will still age even while the import is not connected and no attempts are being made. &lt;/p&gt;</comment>
                            <comment id="10080" author="morrone" created="Tue, 26 Oct 2010 12:42:44 +0000"  >&lt;p&gt;Perhaps bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=3622&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;3622&lt;/a&gt; was not the best starting point for explanation of this bug.  Please also see bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=22893&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;22893&lt;/a&gt;, which was the real problem that we had.  Let me restate:&lt;/p&gt;

&lt;p&gt;With an lnet routed configuration and a fairly sizable network, there are times when the client will think that it cannot talk to the server and attempt to reconnect.  When this happens, the reconnect can be blocked because the server already has RPCs outstanding to that client.&lt;/p&gt;

&lt;p&gt;Right now, this is a fatal situation.  Eventually the server will time out the outstanding RPCs and evict the client.  Evictions mean applications get errors.&lt;/p&gt;

&lt;p&gt;The server should really recognize that when the client reconnects, there is very little hope that any of the RPCs outstanding to that client will ever complete.  It should really cancel the RPCs, allow the client to reconnect, and then replay the RPCs.&lt;/p&gt;

&lt;p&gt;Since that would be fairly difficult to implement, Andreas suggested that resending the RPCs a number of times would be a cheap solution.  We would still need to wait for the first round of RPCs to time out, but after they do the client could reconnect.  It would then be able to handle the next round of RPC retries, and therefore avoid eviction.&lt;/p&gt;</comment>
                            <comment id="10085" author="rread" created="Wed, 27 Oct 2010 01:08:02 +0000"  >&lt;p&gt;I believe the failure when trying to reconnect because of queued RPCs is a relatively recent change, and I don&apos;t know the reason it was done, but I&apos;m sure there is a good reason.  Another option here is to change the server to not evicting the client when it knows that it is still alive (because it is trying to reconnect). Eventually, the network will stabilize and the allow the rpcs to be sent, but if that never happens then this is also a bug.&lt;/p&gt;
</comment>
                            <comment id="10089" author="dferber" created="Wed, 27 Oct 2010 06:33:16 +0000"  >&lt;p&gt;Based on talking with Chris and his additional comments and references posted here, changed to type from improvement to bug. Next step is to get an effort analysis, starting from the suggestions Robert has here.  &lt;/p&gt;</comment>
                            <comment id="10091" author="dferber" created="Wed, 27 Oct 2010 06:41:48 +0000"  >&lt;p&gt;Asked Fan Yong to take a look at this, just to estimate effort and skill set it would take to close. Or to recommend someone else to do this initial look. &lt;/p&gt;</comment>
                            <comment id="10118" author="yong.fan" created="Thu, 28 Oct 2010 10:58:51 +0000"  >&lt;p&gt;I have checked bug 22893, it is not a simple duplicate of bug 3622. There are two different issues:&lt;/p&gt;

&lt;p&gt;1) As the comment #0 for bug 22893 described:&lt;br/&gt;
&amp;gt;Here are the console messages on the OSS demonstrating the problem:&lt;br/&gt;
&amp;gt;&lt;br/&gt;
&amp;gt;010-05-12 08:18:18 Lustre: lsa-OST00b9: Client&lt;br/&gt;
&amp;gt;2ee2068f-7647-9abe-8de8-9f92d3f17975 (at 192.168.122.60@o2ib1) refused&lt;br/&gt;
&amp;gt;reconnection, still busy with 1 active RPCs&lt;br/&gt;
&amp;gt;2010-05-12 08:18:18 Lustre: lsa-OST00b9: Client&lt;br/&gt;
&amp;gt;2ee2068f-7647-9abe-8de8-9f92d3f17975 (at 192.168.122.60@o2ib1) refused&lt;br/&gt;
&amp;gt;reconnection, still busy with 1 active RPCs&lt;br/&gt;
&amp;gt;2010-05-12 08:18:31 Lustre: lsa-OST00b9: Request ldlm_cp_callback sent 38s ago&lt;br/&gt;
&amp;gt;to 192.168.122.60@o2ib1 has timed out (limit 38s)&lt;br/&gt;
&amp;gt;2010-05-12 08:18:31 LustreError: 138-a: lsa-OST00b9: A client on nid&lt;br/&gt;
&amp;gt;192.168.122.60@o2ib1 was evicted due to a lock completion callback to&lt;br/&gt;
&amp;gt;192.168.122.60@o2ib1 timed out: rc -107&lt;/p&gt;

&lt;p&gt;That means the OST refused the first reconnection from the client because of one RPC from the same client (previously connection) was in processing. But after applying the patches from bug 18674, such RPC should be aborted at once, and the second reconnection from such client should be successful. I hope this issue has been fixed by those patches, and I think that you have applied them, but need to confirm with you for that. &lt;/p&gt;

&lt;p&gt;2) It is the issue described as bug 3622. The pinger or other RPC from client to server will cause client to reconnect to server if another router is available, but can not guarantee that the server does not evict it before the reconnection success.&lt;/p&gt;

&lt;p&gt;So here we assume that the first issue has been fixed (see bug 18674), what we should do is to fix the second issue. The basic idea is that before evicting the client (fail related export) for lock callback timeout, check whether some reconnection from such client or not, if yes, do not evict it, instead, giving another chance to allow such client to reconnect, and then resend related RPC to client. But if no reconnection before such eviction check, means either the network partitioning (the same as the unique router down) or client no respond related RPC in spite of alive or not, just evict it under such case. So it is important to guarantee that the client can try to reconnect before such eviction check. On the other hand, to prevent malicious client holding lock(s) without release, we should restrict the times of such reconnection chance.&lt;/p&gt;</comment>
                            <comment id="10120" author="morrone" created="Fri, 29 Oct 2010 13:42:26 +0000"  >&lt;p&gt;Yes, I think you understand the problem now.&lt;/p&gt;

&lt;p&gt;However, the first issue is &lt;em&gt;not&lt;/em&gt; fixed by bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=18674&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;18674&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=22893&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;22893&lt;/a&gt; we were running something based on 1.8.2, and all of the patches on bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=18674&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;18674&lt;/a&gt; appear to be landed as of 1.8.2.  Now we are running a version of lustre based on 1.8.3, and still seeing the problem.  In particular, we are running llnl&apos;s own version &lt;a href=&quot;http://github.com/morrone/lustre/tree/1.8.3.0-5chaos&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;1.8.3.0-5chaos&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;My guess is that &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=18674&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;18674&lt;/a&gt; only addressed the glimpse problem, and did not fix the problem for all RPCs.&lt;/p&gt;

&lt;p&gt;Bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=22893&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;22893&lt;/a&gt; was marked as a duplicate of bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=3622&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;3662&lt;/a&gt; because the solution proposed there, having the server retry the RPC several times, would likely eliminate the evictions that we are seeing.  Granted, it is not a &lt;em&gt;good&lt;/em&gt; solution, because the client still needs to wait for the timeout instead of reconnecting immediately.  But at least the eviction would be avoided.&lt;/p&gt;</comment>
                            <comment id="10121" author="yong.fan" created="Sun, 31 Oct 2010 09:29:38 +0000"  >&lt;p&gt;The real case is some worse than my expectation. In fact, patches in bug 18674 are not aimed to glimpse callback, but for aborting other active RPC(s) (belong to former connection) when reconnect, it is just such active RPC(s) that prevented the reconnection. I will check your branch for that.&lt;/p&gt;

&lt;p&gt;As for the eviction issue, based on current LNET mechanism, the PTLRPC layer knows nothing about the router (down/up), client has to wait for RPC timeout, then try to reconnect, but not reconnect immediately just after router down. So current solution for bug 3662 is just workaround, not a perfect one.&lt;/p&gt;</comment>
                            <comment id="10124" author="yong.fan" created="Sun, 31 Oct 2010 20:30:06 +0000"  >&lt;p&gt;It is quite possible that the old active RPCs (which belong to the former connection) hung in &quot;ptlrpc_abort_bulk()&quot; when the reconnection arrived and try to abort them. As following:&lt;/p&gt;

&lt;p&gt;For reconnection:&lt;br/&gt;
int target_handle_connect(struct ptlrpc_request *req, svc_handler_t handler)&lt;/p&gt;
{
...
                if (req-&amp;gt;rq_export-&amp;gt;exp_conn_cnt &amp;lt;
                    lustre_msg_get_conn_cnt(req-&amp;gt;rq_reqmsg))
                        /* try to abort active requests */
===&amp;gt;                        req-&amp;gt;rq_export-&amp;gt;exp_abort_active_req = 1;
                spin_unlock(&amp;amp;export-&amp;gt;exp_lock);
                GOTO(out, rc = -EBUSY);
...
}

For the old active RPC(s):
static int ost_brw_write(struct ptlrpc_request *req, struct obd_trans_info *oti)
{
...
                        rc = l_wait_event(desc-&amp;gt;bd_waitq,
                                          !ptlrpc_server_bulk_active(desc) ||
                                          desc-&amp;gt;bd_export-&amp;gt;exp_failed ||
                                          desc-&amp;gt;bd_export-&amp;gt;exp_abort_active_req,
                                          &amp;amp;lwi);
                        LASSERT(rc == 0 || rc == -ETIMEDOUT);
...
                }
&lt;p&gt; else if (desc-&amp;gt;bd_export-&amp;gt;exp_abort_active_req) {&lt;br/&gt;
                        DEBUG_REQ(D_ERROR, req, &quot;Reconnect on bulk GET&quot;);&lt;br/&gt;
                        /* we don&apos;t reply anyway */&lt;br/&gt;
                        rc = -ETIMEDOUT;&lt;br/&gt;
===&amp;gt;                        ptlrpc_abort_bulk(desc);&lt;br/&gt;
...&lt;br/&gt;
}&lt;/p&gt;

&lt;p&gt;So the reconnection could not succeed unless all the old active RPCs have been aborted. Further investigation shows that, &quot;ptlrpc_abort_bulk()&quot; =&amp;gt; &quot;LNetMDUnlink()&quot; maybe not work as expected, either no &quot;md&quot; found with the given handle, or no event callback to clear &quot;bd_network_rw&quot;. The following patch gives some solution, but not sure it is enough.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=21760&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=21760&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://bugzilla.lustre.org/attachment.cgi?id=32032&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/attachment.cgi?id=32032&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="10125" author="dferber" created="Mon, 1 Nov 2010 02:24:27 +0000"  >
&lt;p&gt;Nasf,&lt;/p&gt;

&lt;p&gt;Would the next step be for the customer to test this patch, or do you want&lt;br/&gt;
to do some testing yourself first? Let me know what you think the best next&lt;br/&gt;
step is. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Dan&lt;/p&gt;






&lt;p&gt;&amp;#8211; &lt;br/&gt;
Dan Ferber&lt;br/&gt;
Customer Service and Business Development Manager&lt;br/&gt;
Whamcloud, Inc. &lt;br/&gt;
Office: +1 651-344-1846&lt;br/&gt;
Cell: +1 651-344-1846&lt;br/&gt;
Email:  dferber@whamcloud.com&lt;/p&gt;



</comment>
                            <comment id="10127" author="yong.fan" created="Mon, 1 Nov 2010 05:49:14 +0000"  >&lt;p&gt;In fact, I have tested such patch locally. But I can verify nothing since I never reproduced such deny of reconnection issue in my small virtual cluster environment. On the other hand, such patch is waiting to be inspected by Oracle, not sure whether can be approved to land yet. So if the customer can reproduce deny of reconnection issue easily, it is better to verify such patch in advance, once failed again, then the log will be helpful for further investigation. At the same time, as I said above, such patch maybe not enough, I will continue to investigate such issues, and try to give more suitable solution.&lt;/p&gt;</comment>
                            <comment id="10136" author="morrone" created="Tue, 2 Nov 2010 18:19:57 +0000"  >&lt;p&gt;It is not obvious to me that Oracle bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=21760&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;21760&lt;/a&gt; is related to this bug.  That seems to be a solution to a problem where the outstanding RPC can never complete.  That is not the case here since the RPCs always time out like they should.&lt;/p&gt;

&lt;p&gt;Maybe I don&apos;t know enough about these code paths, but I do not see what ost_brw_read() has to do with an outstanding ldlm_cp_callback.  As far as I can tell, bug 18674 only appears to deal with bulk requests, not all RPCs.&lt;/p&gt;

&lt;p&gt;Yes, target_handle_connect() will set the exp_abort_active_req flag, but that appears to be used in very few places.  I am skeptical that it is really actively involved in aborting the ldlm_cp_callback RPC.&lt;/p&gt;</comment>
                            <comment id="10138" author="yong.fan" created="Tue, 2 Nov 2010 19:35:18 +0000"  >&lt;p&gt;There seems some misunderstand for the issue about deny reconnection because of the old active RPC(s). The basic view of such issue is similar as following:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;There is some bulk RPC between the client and OST on the old connection.&lt;/li&gt;
	&lt;li&gt;Some operation triggers a lock callback RPC from the OST to such client, it can be &quot;cp&quot; triggered by the same client, or maybe &quot;bl&quot; or &quot;gl&quot; callback triggered by others.&lt;/li&gt;
	&lt;li&gt;At such time, the old connection fails (maybe because router or some other reason), because the server can not trigger the reconnect from OST to such client (even it knows the old connection broken), it has to wait for the reconnection from the client.&lt;/li&gt;
	&lt;li&gt;When client finds the old connection failure, it tries to reconnect to such OST.&lt;/li&gt;
	&lt;li&gt;When the reconnect request arrives at the OST, the service thread finds there is some old active RPC in processing yet, then it sets &quot;exp_abort_active_req&quot; flag to abort such old RPC.&lt;/li&gt;
	&lt;li&gt;The related service thread for such bulk RPC is waken up by such &quot;exp_abort_active_req&quot; flag, but before finishing such RPC, it should call &quot;ptlrpc_abort_bulk()&quot; to make some cleanup. It is just at the last step, it maybe blocked as described above. If so, then the succedent reconnections from such client will be deny always until the &quot;ptlrpc_abort_bulk()&quot; finished.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;We are not sure when &quot;ptlrpc_abort_bulk()&quot; will be finished, but once the lock callback timeout before that, then such client will be evicted. So resolving the deny reconnection issue should be the first step to resolve the client eviction caused by router failure.&lt;/p&gt;

&lt;p&gt;Another to be noticed is that, in &quot;1)&quot; we think there is some bulk RPC between the client and OST on the old connection, because we found such RPC in the log for bug 18674. But whether there maybe other type RPC(s) or not? I think you maybe skeptical that, right? That is my concern also, since your branch has applied the patches from bug 18674. I am studying the code to try to find out whether there maybe other type RPC(s) non-finished caused such deny reconnection issue.&lt;/p&gt;</comment>
                            <comment id="10143" author="morrone" created="Wed, 3 Nov 2010 13:36:59 +0000"  >&lt;p&gt;There are &lt;em&gt;definitely&lt;/em&gt; other RPCs for which it is a problem.  It is &lt;em&gt;not&lt;/em&gt; bulk RPCs that are currently a problem for us.&lt;/p&gt;

&lt;p&gt;Specifically, we are seeing rejected reconnects because of outstanding lock callback rpcs.  This is documented in bug &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=22893&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;22893&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="10152" author="yong.fan" created="Fri, 5 Nov 2010 19:06:58 +0000"  >&lt;p&gt;I am making patch for the deny of reconnection and router failure recovery issues.&lt;/p&gt;</comment>
                            <comment id="10161" author="dferber" created="Tue, 9 Nov 2010 03:04:26 +0000"  >&lt;p&gt;Yong Fan is currently verifying the patch in his local environment. Since it is difficult to reproduce the LLNL issues in his virtual environment, he needs to design some test cases to simulate these kinds of failures. Once his patch passes these local tests, he will push it to gerrit for internal review and approval.&lt;/p&gt;</comment>
                            <comment id="10165" author="dferber" created="Tue, 9 Nov 2010 16:12:01 +0000"  >&lt;p&gt;Yong Fan, I suggest posting a short comment in BZ bug 3622 that you are working on this, with a reference to this Jira bug. &lt;/p&gt;</comment>
                            <comment id="10169" author="yong.fan" created="Wed, 10 Nov 2010 07:14:26 +0000"  >&lt;p&gt;I have made a patch for the deny of reconnection issue and the eviction issue, but they are mixed together and not easy to be divided into two parts. Maybe you think we can make the reconnection more quickly by aborting the active RPC(s) belong to the old connection from such client, but it is not easy, because you know, we are not totally clear which (types) RPCs blocked the reconnection yet, even though after about half of years investigation of bug 18674 by Oracle&apos;s engineer, who have tried to fix some cases (like bulk RPC), but not enough (since LLNL can reproduce it after applying related patches). We need more logs to make clear such issue (which RPCs, blocked where, for what), I have added some debug message in my patch, hope it is helpful.&lt;/p&gt;

&lt;p&gt;On the other hand, when server evicting the client depends on the ldlm lock callback timeout, which is not controlled by such client, in spite of how fast the reconnection will be, which can not guarantee the client will not be evicted. So preventing the immediate eviction under such router failure, to give more chance to such client for reconnection is quite necessary. In fact, even if we do nothing, as long as time is a bit long and the active RPCs were not blocked by server locally issues (semaphore or other excluded resource), the existing timeout mechanism will abort such active RPCs also, then reconnection will succeed (client will try for ever as long as no eviction). So it maybe the simplest but not efficient way to resolve the deny of reconnection issue). But we will not make server to wait the reconnection for ever or for very long time, there should be some balance.&lt;/p&gt;

&lt;p&gt;I am verifying such patch locally. Since it is difficult to reproduce the issues in my virtual environment, I need design some test cases to simulate kinds of failures. If possible, I hope these test cases can be part of the patch. Once pass local test, I will push it to gerrit for internal inspection.&lt;/p&gt;</comment>
                            <comment id="10180" author="yong.fan" created="Sat, 13 Nov 2010 20:32:18 +0000"  >&lt;p&gt;I have posted an initial version patch with test cases to gerrit. It is a workaround patch, but according to the latest comment on bug 3622 (comment #18), some mechanism need to be adjusted to match the requirement from original discussion result. And there maybe more technical discussion about it on bug 3622.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,125&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,125&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="10182" author="yong.fan" created="Tue, 16 Nov 2010 21:41:19 +0000"  >&lt;p&gt;I have updated the patch on whamcloud gerrit for internal review according to bug 3622 comment #18.&lt;/p&gt;</comment>
                            <comment id="10183" author="dferber" created="Wed, 17 Nov 2010 04:44:13 +0000"  >&lt;p&gt;Nasf, do you recommend Chris test the patch &lt;a href=&quot;http://review.whamcloud.com/#change&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change&lt;/a&gt; in his environment now?&lt;/p&gt;</comment>
                            <comment id="10185" author="yong.fan" created="Wed, 17 Nov 2010 06:10:45 +0000"  >&lt;p&gt;According to the normal process, such patch should be inspected internal firstly, but I am not sure Robert has enough time to do that in time, so now it is better to be verified by customer if possible.&lt;/p&gt;</comment>
                            <comment id="10220" author="rread" created="Mon, 22 Nov 2010 13:17:07 +0000"  >&lt;p&gt;I have requested Di and Bobi Jam to inspect this patch, instead of me. &lt;/p&gt;</comment>
                            <comment id="10263" author="yong.fan" created="Wed, 1 Dec 2010 09:30:02 +0000"  >&lt;p&gt;A updated version is available: (process ldlm callback resend in ptlrpc layer)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,125&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,125&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="19189" author="nedbass" created="Fri, 12 Aug 2011 14:27:03 +0000"  >&lt;p&gt;We are still hitting this issue fairly frequently on our production 1.8.5 clusters.  Is anyone still working on the proposed fix?&lt;/p&gt;</comment>
                            <comment id="20998" author="yong.fan" created="Mon, 10 Oct 2011 04:12:16 +0000"  >&lt;p&gt;It is still in working queue. Because of other priority tasks, there is not definite release time. Sorry for that. Thanks for keeping trace this ticket, any updating for that will be posted here.&lt;/p&gt;</comment>
                            <comment id="28802" author="bkorb" created="Wed, 15 Feb 2012 18:41:05 +0000"  >&lt;p&gt;Ping?  We have a customer nuisanced by this, too.&lt;/p&gt;</comment>
                            <comment id="29068" author="rread" created="Thu, 16 Feb 2012 14:32:57 +0000"  >&lt;p&gt;This is not a priority for us right now, but we&apos;d be happy to take at a look at patch if you have one. &lt;/p&gt;</comment>
                            <comment id="30656" author="nrutman" created="Wed, 7 Mar 2012 16:40:05 +0000"  >&lt;p&gt;One aspect of this problem is in the following case:&lt;br/&gt;
1. MDS is overloaded with enqueues, consuming all the threads on MDS_REQUEST portal. &lt;br/&gt;
2. An rpc times out on a client, leading to it&apos;s reconnection. But this client has some locks to cancel, and the MDS is waiting for them. &lt;br/&gt;
3. The client sends MDS_CONNECT, but there is no empty thread to handle it. &lt;br/&gt;
4. Additionally, other clients are waiting for their enqueue completions; they try to ping MDS, but PING is also sent to MDS_REQUEST portal. Pings are supposed to be high priority rpcs, but since this service has no srv_hqreq_handler we let other low-priority rpc&apos;s take the last thread, thus potentially preventing future hp reqs from being serviced. &lt;br/&gt;
We&apos;ve got a patch addressing 3 &amp;amp; 4 in inspection (MRP-455). &lt;/p&gt;</comment>
                            <comment id="39853" author="spitzcor" created="Fri, 1 Jun 2012 17:49:30 +0000"  >&lt;p&gt;Re the last comment, that would be &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1239&quot; title=&quot;cascading client evictions&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1239&quot;&gt;&lt;del&gt;LU-1239&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;http://review.whamcloud.com/2355&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2355&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Also, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-793&quot; title=&quot;Reconnections should not be refused when there is a request in progress from this client.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-793&quot;&gt;&lt;del&gt;LU-793&lt;/del&gt;&lt;/a&gt;, &apos;Reconnections should not be refused when there is a request in progress from this client&apos;, &lt;a href=&quot;http://review.whamcloud.com/#change,1616&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,1616&lt;/a&gt; would also improve the situation by allowing clients to reconnect with RPCs in process.&lt;/p&gt;</comment>
                            <comment id="77557" author="vitaly_fertman" created="Thu, 20 Feb 2014 23:42:02 +0000"  >&lt;p&gt;BL AST resend: &lt;a href=&quot;http://review.whamcloud.com/9335&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9335&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="85084" author="morrone" created="Wed, 28 May 2014 23:42:06 +0000"  >&lt;p&gt;Vitaly, I don&apos;t understand what that has to do with this ticket.  Please expand the explanation, or start a new ticket.&lt;/p&gt;
</comment>
                            <comment id="86691" author="vitaly_fertman" created="Mon, 16 Jun 2014 15:10:26 +0000"  >&lt;p&gt;Chris,&lt;/p&gt;

&lt;p&gt;in fact, you issue is client eviction because cancel is not delivered to server.&lt;br/&gt;
it may have several different reasons:&lt;/p&gt;

&lt;p&gt;1. cancel is lost. it is to be resent - fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1565&quot; title=&quot;lost LDLM_CANCEL RPCs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1565&quot;&gt;&lt;del&gt;LU-1565&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2. BL AST is lost. BL AST is to be resent - fixed by this patch.&lt;/p&gt;

&lt;p&gt;3. CANCEL cannot be sent due to absent connection, re-CONNECT fails with rpc in progress  - fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-793&quot; title=&quot;Reconnections should not be refused when there is a request in progress from this client.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-793&quot;&gt;&lt;del&gt;LU-793&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4. CONNECT cannot be handled by server as all the handling threads are stuck with other RPCs in progress - fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1239&quot; title=&quot;cascading client evictions&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1239&quot;&gt;&lt;del&gt;LU-1239&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;5. PING cannot be handled  by server as all the handling threads are stuck with other RPCs and client cannot even start re-CONNECT - fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1239&quot; title=&quot;cascading client evictions&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1239&quot;&gt;&lt;del&gt;LU-1239&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="87313" author="morrone" created="Mon, 23 Jun 2014 20:30:11 +0000"  >&lt;p&gt;Well, no, not &lt;em&gt;my&lt;/em&gt; issue.  Although my issue may have been lumped in with a bunch of other things.  We have specifically been waiting years for issue 3 to be solved.&lt;/p&gt;</comment>
                            <comment id="87367" author="spitzcor" created="Tue, 24 Jun 2014 14:23:42 +0000"  >&lt;p&gt;Chris, are you saying that the patches from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-793&quot; title=&quot;Reconnections should not be refused when there is a request in progress from this client.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-793&quot;&gt;&lt;del&gt;LU-793&lt;/del&gt;&lt;/a&gt; are not sufficient to fix your issue?&lt;/p&gt;</comment>
                            <comment id="87379" author="morrone" created="Tue, 24 Jun 2014 15:56:47 +0000"  >&lt;p&gt;No, it might work.  Haven&apos;t had a chance to try it yet.&lt;/p&gt;</comment>
                            <comment id="92027" author="vitaly_fertman" created="Wed, 20 Aug 2014 11:09:15 +0000"  >&lt;p&gt;BL AST re-send is moved to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5520&quot; title=&quot;BL AST resend&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5520&quot;&gt;&lt;del&gt;LU-5520&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="109927" author="simmonsja" created="Tue, 17 Mar 2015 21:44:05 +0000"  >&lt;p&gt;What is left for this work?&lt;/p&gt;</comment>
                            <comment id="149515" author="adilger" created="Wed, 20 Apr 2016 05:57:55 +0000"  >&lt;p&gt;Closing this old issue.&lt;/p&gt;

&lt;p&gt;All of the sub issues have been closed and patches landed for related problems.  Servers should resend RPCs to clients in appropriate circumstances before evicting them.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="12244">LU-793</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="12244">LU-793</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="26087">LU-5520</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="13626">LU-1239</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="15040">LU-1565</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                    <customfield id="customfield_10020" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Bugzilla ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3622.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvptb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8049</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>