<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:52:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5569] recreating a reverse import produce a various fails.</title>
                <link>https://jira.whamcloud.com/browse/LU-5569</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Don&apos;t reallocate a new reverse import for each client reconnect.&lt;br/&gt;
a reverse import disconnecting on each client reconnect open&lt;br/&gt;
several races in request sending (AST mostly) code.&lt;/p&gt;

&lt;p&gt;First problem is send_rpc vs class_destroy_import() race. If sending&lt;br/&gt;
RPC (or resending) issued after class_destroy_import function was&lt;br/&gt;
called, RPC sending will failed due import generation check.&lt;/p&gt;

&lt;p&gt;Second problem, Target_handle_connect function stop an update a&lt;br/&gt;
connection information for older reverse import. So RPC can&apos;t be&lt;br/&gt;
delivered from server to the client due wrong connection information&lt;br/&gt;
or security flavor changed.&lt;/p&gt;

&lt;p&gt;Target_handle_connect function stops update connection information&lt;br/&gt;
for older reverse import. So we can&apos;t delivers a RPC from server to&lt;br/&gt;
the client due wrong connection information or security flavor&lt;br/&gt;
changed.&lt;/p&gt;

&lt;p&gt;Third problem, connection flags aren&apos;t updates atomically for an&lt;br/&gt;
import. Target_handle_connect function does link new import before&lt;br/&gt;
message headers flags are set. So, RPC will have a wrong flags set&lt;br/&gt;
if it would be sent at the same time.&lt;/p&gt;

&lt;p&gt;Fourth problem, client reconnecting after network flap have result&lt;br/&gt;
none wakeup event send to a RPC in import queues. That situation adds&lt;br/&gt;
noticeable timeout in case server don&apos;t send request before network&lt;br/&gt;
flap.&lt;/p&gt;

&lt;p&gt;some examples&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:1.0:1407845348.937766:0:62024:0:(service.c:1929:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_419:4960df0f-75ed-07a2-cee7-063090dc59cd+4:19257:x1475700821793316:12345-1748@gni1:8 Request procesed in 55us (106us total) trans 0 rc 0/0

00000100:00020000:1.0:1407845393.600747:0:81897:0:(client.c:1115:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff880304e39800 x1475078782385806/t0(0) o105-&amp;gt;snx11063-OST0070@1748@gni1:15/16 lens 344/192 e 0 to 1 dl 1407845389 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</description>
                <environment></environment>
        <key id="26254">LU-5569</key>
            <summary>recreating a reverse import produce a various fails.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yujian">Jian Yu</assignee>
                                    <reporter username="shadow">Alexey Lyashkov</reporter>
                        <labels>
                            <label>HB</label>
                            <label>patch</label>
                            <label>ptlrpc</label>
                    </labels>
                <created>Tue, 2 Sep 2014 05:09:41 +0000</created>
                <updated>Fri, 15 Mar 2019 17:17:21 +0000</updated>
                            <resolved>Sat, 19 Sep 2015 05:29:23 +0000</resolved>
                                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="92954" author="shadow" created="Tue, 2 Sep 2014 13:29:01 +0000"  >&lt;p&gt;patch to tests for such bugs.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/11724&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11724&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="92967" author="johann" created="Tue, 2 Sep 2014 16:04:57 +0000"  >&lt;p&gt;Alexey, in gerrit 9335, you mentioned a &quot;data corruption&quot; problem. Could you please elaborate and explain why this issue only shows up with the AST resend patch? Thanks in advance&lt;/p&gt;</comment>
                            <comment id="93093" author="johann" created="Wed, 3 Sep 2014 08:39:41 +0000"  >&lt;p&gt;I talked to Alexey on Skype to get an answer to the question above. The problem is that ldlm_handle_ast_error() doesn&apos;t evict the client in some error cases (he mentioned EIO &amp;amp; EPROTO) where the AST wasn&apos;t delivered or properly processed by the client. The server then cancels the lock locally and grants the conflicting lock while the client still thinks it owns a valid lock and might continue writing to the file.&lt;/p&gt;</comment>
                            <comment id="93181" author="shadow" created="Thu, 4 Sep 2014 11:28:17 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/11750&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11750&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="93244" author="hornc" created="Thu, 4 Sep 2014 22:02:42 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Fourth problem, client reconnecting after network flap have result&lt;br/&gt;
none wakeup event send to a RPC in import queues. That situation adds&lt;br/&gt;
noticeable timeout in case server don&apos;t send request before network&lt;br/&gt;
flap.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Alexey, I&apos;m having trouble understanding your description of this fourth problem. Is the following description correct?&lt;/p&gt;

&lt;p&gt;When a client reconnects after a network flap we do not currently wakeup any RPCs in the (reverse) import queue (specifically the imp_sending_list of the reverse import). This means we need to wait for the original request to timeout before the server can resend the request.&lt;/p&gt;</comment>
                            <comment id="93309" author="shadow" created="Fri, 5 Sep 2014 13:16:36 +0000"  >&lt;p&gt;Chris,&lt;/p&gt;

&lt;p&gt;you have a correct description. Thanks for rephase!&lt;/p&gt;</comment>
                            <comment id="93351" author="green" created="Fri, 5 Sep 2014 18:01:13 +0000"  >&lt;p&gt;I want to note here too that the test_10d added by one of the patches fails 100% of the time in testing, which implies that it&apos;s either incorrect, or the actual fix fails at fixing the issue at hand.&lt;/p&gt;</comment>
                            <comment id="93362" author="shadow" created="Fri, 5 Sep 2014 18:42:49 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;which patch you point to? first patch (test only) should be failed in new added tests, second patch should be don&apos;t fail as fixes issue.&lt;/p&gt;</comment>
                            <comment id="93375" author="hornc" created="Fri, 5 Sep 2014 20:26:54 +0000"  >&lt;p&gt;We discovered the bugs addressed by this change while investigating some non-POSIX compliant behavior exhibited by Lustre. Below is a description of the problem based on my own understanding and the fixes that were proposed to address it.&lt;/p&gt;

&lt;p&gt;Higher layers of Lustre (CLIO) generally rely on lower layers to enforce POSIX compliance. In this case, the Lustre Distributed Lock Manager (LDLM) and ptlrpc layers are interacting in such a way that results in inappropriate errors being returned to the client. The interaction revolves around a pair of clients performing I/O.&lt;/p&gt;

&lt;p&gt;One client (the writer) creates and writes data to a single file striped across eight OSTs. A second client (the reader) reads the data written by the writer. The reader requests a protected read lock for each stripe of the file from the corresponding OST. Upon receipt of the lock enqueue request, the OSS notes that it has already granted a conflicting lock to the writer. As a result the server sends a blocking AST (BL AST) to the writer. This is a request that the writer cancel its lock so that it may be granted to the reader. Upon receipt of the BL AST the writer should first reply that it has received the BL AST, and then, after flushing any pages to the server, send a lock cancel request back to the server. When the server receives both the reply to the BL AST and the lock cancel request it can then grant the lock to the reader via a completion AST (CP AST). The server sends a CP AST to the reader who must then acknowledge receipt of the CP AST before it can use the resource covered by the requested lock.&lt;/p&gt;

&lt;p&gt;In Lustre, different components communicate via import and export pairs. An import is for sending requests and receiving replies, and an export is for receiving requests and sending replies. However, it is not possible to send a request via an export. As a result, servers utilize a reverse import to send AST requests to clients. A reverse import converts an import and export pair into a corresponding export and import pair. Currently, a new reverse import is created whenever the server creates an export for a client (re)connection. This prevents us from being able to re-send requests that were linked to the old reverse import. Historically this has not been problematic as servers did not have the ability to resend ASTs. With &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5520&quot; title=&quot;BL AST resend&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5520&quot;&gt;&lt;del&gt;LU-5520&lt;/del&gt;&lt;/a&gt; servers are now able to resend ASTs.&lt;/p&gt;

&lt;p&gt;This particular bug (there are other potential flavors) arises if the reader reconnects to an OST granting the lock while the OSS is trying to deliver the CP AST (or, equivalently, if the client is unable to acknowledge receipt of the CP AST). Based on Lustre trace data we determined that an OSS was unable to deliver a CP AST to the reader. While the OSS was waiting for the CP AST acknowledgement the reader reconnected to the OST granting the lock. As mentioned above, this created a new reverse import for this client. When the OSS attempted to resend the CP AST (after a timeout) it found that the old import for the reader had been destroyed. It was thus unable to re-send the request, and the request was immediately failed with a status of -EIO. When LDLM interpreted the failed request, it did not handle the -EIO request status appropriately. LDLM converted the -EIO error into -EAGAIN which was then returned to the client.&lt;/p&gt;

&lt;p&gt;Two fixes are proposed to address different aspects of this bug:&lt;br/&gt;
1. Server-side: evict clients returning errors on ASTs&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Fix tracked by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5581&quot; title=&quot;blocking ast error handling lack eviction for a local errors and some remote.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5581&quot;&gt;&lt;del&gt;LU-5581&lt;/del&gt;&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;This immediately fixes the posix non-compliance issue, and helps to prevent potential data corruption cases.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;2. Server-side: change reverse import life cycle:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Fix tracked by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5569&quot; title=&quot;recreating a reverse import produce a various fails.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5569&quot;&gt;&lt;del&gt;LU-5569&lt;/del&gt;&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;Fixes a race between send_rpc and class_destroy_import() that results in the import generation mismatch.&lt;/li&gt;
	&lt;li&gt;Properly update connection information for &#8220;older&#8221; reverse import in the event that client nid changes.&lt;/li&gt;
	&lt;li&gt;Ensure connection flags on the reverse import are updated prior to sending any rpcs on the reverse import after reconnect.&lt;/li&gt;
	&lt;li&gt;Wakeup (resend) requests on the reverse import sending list when a client reconnects rather than waiting for the original requests to timeout.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="105714" author="green" created="Wed, 4 Feb 2015 19:16:08 +0000"  >&lt;p&gt;Ok. So this is a problem that has been there for quite a while I see?&lt;br/&gt;
In this case I imagine there&apos;s no super critical rush to get this into 2.7 as there are still some discussion about this patch and surrounding areas. The reality is such that 2.7 code freeze is right around the corner and we cannot drag this ticket indefinitely.&lt;br/&gt;
I don&apos;t think Cray or anybody else plans to jump to 2.7 right the second it gets released. So if we need more time to hash it out to get a solution that satisfies everyone and as the result it slips into 2.8 (and if 2.7 happens to be a maintenance release, also then backported to 2.7.1 or something like that and 2.5.5 or whatever is the going on thing then and Cray will backport it to their tree anyway no matter where it comes from....) that should be totally ok in my view?&lt;/p&gt;

&lt;p&gt;In other words I guess I am asking does anybody have any compelling reasons for why this should basically be treated  differently than what&apos;s described above (i.e. as a blocker because I am overlooking something and it&apos;s a new disasterous failure of epic proportiions instead?)&lt;/p&gt;</comment>
                            <comment id="124529" author="spitzcor" created="Wed, 19 Aug 2015 03:36:29 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/11750&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/11750&lt;/a&gt; still needs help getting through review.&lt;/p&gt;</comment>
                            <comment id="127566" author="simmonsja" created="Wed, 16 Sep 2015 21:46:28 +0000"  >&lt;p&gt;It failed Oleg&apos;s review process. He list the backtrace he gotten.&lt;/p&gt;</comment>
                            <comment id="127610" author="shadow" created="Thu, 17 Sep 2015 06:15:39 +0000"  >&lt;p&gt;Oleg&apos;s backtraces related to the different bug. It&apos;s related to the wrong obd device release process where we may hold a any export live after obd device freed aka &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4134&quot; title=&quot;obdfilter-suvery bugs and panics (ioctl API isn&amp;#39;t protected over shutdown/setup property). &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4134&quot;&gt;&lt;del&gt;LU-4134&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
patch exist on&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/8045/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8045/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;but i still not able to update due lack access to Intel Gerrit with google account and password login blocked by intel admins.&lt;/p&gt;</comment>
                            <comment id="127620" author="simmonsja" created="Thu, 17 Sep 2015 13:35:37 +0000"  >&lt;p&gt;Alex email your latest &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4134&quot; title=&quot;obdfilter-suvery bugs and panics (ioctl API isn&amp;#39;t protected over shutdown/setup property). &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4134&quot;&gt;&lt;del&gt;LU-4134&lt;/del&gt;&lt;/a&gt; and I will push it for you.&lt;/p&gt;</comment>
                            <comment id="127891" author="gerrit" created="Sat, 19 Sep 2015 03:50:12 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/11750/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11750/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5569&quot; title=&quot;recreating a reverse import produce a various fails.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5569&quot;&gt;&lt;del&gt;LU-5569&lt;/del&gt;&lt;/a&gt; ptlrpc: change reverse import life cycle&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 892078e3b566c04471e7dcf2c28e66f2f3584f93&lt;/p&gt;</comment>
                            <comment id="127893" author="pjones" created="Sat, 19 Sep 2015 05:29:23 +0000"  >&lt;p&gt;Landed for 2.8&lt;/p&gt;</comment>
                            <comment id="131610" author="adilger" created="Mon, 26 Oct 2015 21:28:15 +0000"  >&lt;p&gt;This patch caused a regression on master.  See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7221&quot; title=&quot;replay-ost-single test_3: ASSERTION( __v &amp;gt; 0 &amp;amp;&amp;amp; __v &amp;lt; ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7221&quot;&gt;&lt;del&gt;LU-7221&lt;/del&gt;&lt;/a&gt; for details.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                            <outwardlinks description="is blocking">
                                        <issuelink>
            <issuekey id="26087">LU-5520</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="26227">LU-5559</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="32355">LU-7221</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="26383">LU-5590</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="26302">LU-5581</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwv27:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15532</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>