<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:12:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14733] brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103</title>
                <link>https://jira.whamcloud.com/browse/LU-14733</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;lnet_selftest fails between two nodes over Omnipath&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;dk.opal63.llnl.gov.7:00000001:00020000:43.0:1622598261.714620:0:129525:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Bulk transfers work over Infiniband (although in that test 1 of the nodes was RHEL 7.9 and an earlier Lustre patch stack).&#160; Bulk transfers also work over tcp using ksocklnd.&lt;/p&gt;

&lt;p&gt;lctl pings work fine between the same two nodes.&lt;/p&gt;

&lt;p&gt;mpibench and other MPI applications also work fine over Omnipath between two nodes.&lt;/p&gt;

&lt;p&gt;See &lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.12.6_9.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.12.6_9.llnl&lt;/a&gt; for the patch stack&lt;/p&gt;</description>
                <environment>lustre-2.12.6_9.llnl client&lt;br/&gt;
kernel-4.18.0-305.0.0.1toss.t4.x86_64&lt;br/&gt;
RHEL84</environment>
        <key id="64515">LU-14733</key>
            <summary>brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>LTS12</label>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 3 Jun 2021 15:12:03 +0000</created>
                <updated>Fri, 18 Mar 2022 21:15:27 +0000</updated>
                            <resolved>Sat, 24 Jul 2021 00:23:53 +0000</resolved>
                                                    <fixVersion>Lustre 2.12.8</fixVersion>
                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="303480" author="ofaaland" created="Thu, 3 Jun 2021 15:55:14 +0000"  >&lt;p&gt;These nodes were RHEL 8.3 based and have been upgraded to RHEL 8.4.&#160; At the same time, they got an updated Lustre client with the following OS-compat patches:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;12d13f33c0 (tag: 2.12.6_9.llnl) &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13783&quot; title=&quot;Support for linux kernel version 5.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13783&quot;&gt;&lt;del&gt;LU-13783&lt;/del&gt;&lt;/a&gt; osc: handle removal of NR_UNSTABLE_NFS&lt;/li&gt;
	&lt;li&gt;3cdb219927 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12355&quot; title=&quot;Support for linux kernel version 5.0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12355&quot;&gt;&lt;del&gt;LU-12355&lt;/del&gt;&lt;/a&gt; llite: MS_* flags and SB_* flags split&lt;/li&gt;
	&lt;li&gt;03e48854cb &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12355&quot; title=&quot;Support for linux kernel version 5.0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12355&quot;&gt;&lt;del&gt;LU-12355&lt;/del&gt;&lt;/a&gt; llite: totalram_pages changed to atomic_long_t&lt;/li&gt;
	&lt;li&gt;b055fe9f7a (tag: 2.12.6_8.llnl) &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14690&quot; title=&quot;RHEL8.4 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14690&quot;&gt;&lt;del&gt;LU-14690&lt;/del&gt;&lt;/a&gt; kernel: new kernel &lt;span class=&quot;error&quot;&gt;&amp;#91;RHEL 8.4 4.18.0-305.el8&amp;#93;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;335e03049d (tag: 2.12.6_7.llnl) &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14673&quot; title=&quot;panic: crc32-table: crc32 alg self test failed in fips mode!&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14673&quot;&gt;&lt;del&gt;LU-14673&lt;/del&gt;&lt;/a&gt; sec: annotate algorithms taking optional key&lt;/li&gt;
&lt;/ul&gt;

</comment>
                            <comment id="303483" author="ofaaland" created="Thu, 3 Jun 2021 15:56:29 +0000"  >&lt;p&gt;See also &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14690&quot; title=&quot;RHEL8.4 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14690&quot;&gt;&lt;del&gt;LU-14690&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="303493" author="pjones" created="Thu, 3 Jun 2021 18:24:34 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="303642" author="ofaaland" created="Fri, 4 Jun 2021 19:57:13 +0000"  >&lt;p&gt;For my reference, my local ticket is TOSS-5228&lt;/p&gt;</comment>
                            <comment id="304383" author="ofaaland" created="Mon, 14 Jun 2021 06:42:58 +0000"  >&lt;p&gt;Hi Serguei, do you have any update or questions on this?  Thanks&lt;/p&gt;</comment>
                            <comment id="304462" author="ssmirnov" created="Mon, 14 Jun 2021 17:21:55 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;So far it looks like it is possible there&apos;s some sort of incompatibility in how ib_post_send is called, but I don&apos;t have anything concrete in this direction yet. Do you have any config logs (from building lustre)? I&apos;m not that familiar with Omnipath either. Which version of Omnipath are you using? Basically I want to make sure we end up calling&#160;ib_post_send correctly.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="304472" author="ofaaland" created="Mon, 14 Jun 2021 18:10:43 +0000"  >&lt;p&gt;Hi Serguei,&lt;br/&gt;
 I&apos;ve attached &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39055/39055_build.txt&quot; title=&quot;build.txt attached to LU-14733&quot;&gt;build.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; - not the config.log, but at least the stdout from ./configure.   I&apos;ll look into Omnipath info.&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="304630" author="ssmirnov" created="Tue, 15 Jun 2021 22:20:31 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;Here&apos;s some more detail regarding what I&apos;d like to try with the OPA build:&lt;/p&gt;

&lt;p&gt;lnet/autoconf/lustre-lnet.m4 has a check for&#160;ib_post_send() and ib_post_recv() to see if they require const ptr parameters. Could you please try removing this check and build without it? (See attached diff file). I suspect there may be an issue with using the wrong header file when building. The kernel code for&#160;4.18.0 appears to define these functions without the &quot;const&quot; and I think that&apos;s what we should be using for OPA, but the stdout you provided indicates that as a result of the configure the &quot;const&quot; version is used.&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39075/39075_diff.txt&quot; title=&quot;diff.txt attached to LU-14733&quot;&gt;diff.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="304743" author="ofaaland" created="Thu, 17 Jun 2021 02:26:27 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Sorry, I didn&apos;t see your message for some reason.  My test was slightly different, but I think still produced the result you need.  With the config check sabotaged so HAVE_IB_POST_SEND_RECV_CONST is not defined, the build fails with&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/g/g0/faaland1/rpmbuild/BUILD/lustre-2.12.6_9.llnl_2_gd3006c6/lnet/klnds/o2iblnd/o2iblnd_cb.c:1002:46: error: passing argument 3 of &apos;ib_post_send&apos; from incompatible pointer type [-Werror=incompatible-pointer-types]
    rc = ib_post_send(conn-&amp;gt;ibc_cmid-&amp;gt;qp, wr, &amp;amp;bad);
                                              ^~~~
In file included from /usr/src/kernels/4.18.0-305.0.0.1toss.t4.x86_64/include/rdma/ib_addr.h:20,
                 from /usr/src/kernels/4.18.0-305.0.0.1toss.t4.x86_64/include/rdma/rdma_cm.h:12,
                 from /g/g0/faaland1/rpmbuild/BUILD/lustre-2.12.6_9.llnl_2_gd3006c6/lnet/klnds/o2iblnd/o2iblnd.h:71,
                 from /g/g0/faaland1/rpmbuild/BUILD/lustre-2.12.6_9.llnl_2_gd3006c6/lnet/klnds/o2iblnd/o2iblnd_cb.c:37:
/usr/src/kernels/4.18.0-305.0.0.1toss.t4.x86_64/include/rdma/ib_verbs.h:3799:37: note: expected &apos;const struct ib_send_wr **&apos; but argument is of type &apos;struct ib_send_wr **&apos;
           const struct ib_send_wr **bad_send_wr)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;My change was&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lnet/autoconf/lustre-lnet.m4 b/lnet/autoconf/lustre-lnet.m4
index 6d4a05d4ef..98490ea9ae 100644
--- a/lnet/autoconf/lustre-lnet.m4
+++ b/lnet/autoconf/lustre-lnet.m4
@@ -539,6 +539,8 @@ AS_IF([test $ENABLEO2IB != &quot;no&quot;], [
        # In MOFED 4.6, the second and third parameters for
        # ib_post_send() and ib_post_recv() are declared with
        # &apos;const&apos;.
+       #
+       # SABOTAGE: force this to fail with extra argument to ib_post_send
        tmp_flags=&quot;$EXTRA_KCFLAGS&quot;
        EXTRA_KCFLAGS=&quot;-Werror&quot;
        LB_CHECK_COMPILE([if &apos;ib_post_send() and ib_post_recv()&apos; have const parameters],
@@ -555,7 +557,7 @@ AS_IF([test $ENABLEO2IB != &quot;no&quot;], [
                #include &amp;lt;rdma/ib_verbs.h&amp;gt;
        ],[
                ib_post_send(NULL, (const struct ib_send_wr *)NULL,
-                            (const struct ib_send_wr **)NULL);
+                            (const struct ib_send_wr **)NULL, NULL);
        ],[
                AC_DEFINE(HAVE_IB_POST_SEND_RECV_CONST, 1,
                        [ib_post_send and ib_post_recv have const parameters])


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="304749" author="ssmirnov" created="Thu, 17 Jun 2021 05:34:11 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;I suppose this means we need to look for a different reason why ib_post_send would fail with EINVAL (22) error code.&lt;/p&gt;

&lt;p&gt;So far I wasn&apos;t able to find the source code for this function provided by Omnipath for RH8.4. Oddly, I also failed to locate Omnipath release notes for RH8.4. Perhaps someone from CN can comment.&lt;/p&gt;

&lt;p&gt;Btw, although this is unlikely to be related to this issue, perhaps you want to port&#160;&lt;del&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13182&quot; title=&quot;MAP_POPULATE hangs with Linux 5.4&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13182&quot;&gt;&lt;del&gt;LU-13182&lt;/del&gt;&lt;/a&gt;&lt;/del&gt;&#160;as well? (see last couple of comments in&#160;&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14690&quot; title=&quot;RHEL8.4 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14690&quot;&gt;&lt;del&gt;LU-14690&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="304810" author="ofaaland" created="Thu, 17 Jun 2021 16:51:03 +0000"  >&lt;p&gt;Hi Serguei, I&apos;m told:&lt;/p&gt;

&lt;p&gt;&quot;OPA hardware does not have hardware verbs support, so it uses the rdmavt layer to provide software translation.  You want to look at rvt_post_send() in drivers/infiniband/sw/rdmavt/qp.c.&quot;&lt;/p&gt;

&lt;p&gt;We&apos;re using the in-kernel OPA driver, so in this case the one in linux-4.18.0-305.el8&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="304844" author="ssmirnov" created="Thu, 17 Jun 2021 23:42:27 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;Have you tried building lustre master and testing if it has the same problem on your RH8.4 OPA machine? There&apos;s a chance that one of the o2iblnd patches there may help. Otherwise I&apos;m thinking I&apos;m going to have to add some debug messages to LLNL patch stack so we can use that to get more data.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="304845" author="ofaaland" created="Thu, 17 Jun 2021 23:42:44 +0000"  >&lt;p&gt;Just to document this where it&apos;s visible to everyone:&lt;/p&gt;

&lt;p&gt;Ran perftest (perftest-4.4-37.0.t4.x86_64) from opal63 (compute), used both opal64 (compute) and opal187 (router) as servers. These are the same nodes, running the same OS, where lnet_selftest fails.&lt;/p&gt;

&lt;p&gt;All these tests succeeded:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ib_write_bw opal64-hsi0
ib_read_bw opal64-hsi0
ib_send_bw opal64-hsi0
ib_atomic_bw opal64-hsi0
ib_write_lat opal187-hsi0
ib_send_bw -a -b -R -F opal187-hsi0
ib_send_bw -a -b -R -F opal64-hsi0
ib_send_bw -a -b -R -F -q 3 opal64-hsi0
ib_send_bw -a -b -R -F -q 3 opal187-hsi0
ib_read_bw -a -b -R -F -q 3 opal187-hsi0
ib_atomic_bw -b -R -F -q 3 opal187-hsi0
ib_send_bw -a -b -R -F -q 3 opal187-hsi0
ib_write_bw -a -b -R -F -q 3 opal187-hsi0
ib_read_bw -a -b -R -F -q 3 opal187-hsi0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Those arguments mean:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; -R, --rdma_cm Connect QPs with rdma_cm and run test on those QPs
 -z, --com_rdma_cm Communicate with rdma_cm module to exchange data - use regular QPs
 -m, --mtu=&amp;lt;mtu&amp;gt; QP Mtu size (default: active_mtu from ibv_devinfo)
 -c, --connection=&amp;lt;type&amp;gt; Connection type RC/UC/UD/XRC/DC/SRD (default RC).
 -d, --ib-dev=&amp;lt;dev&amp;gt; Use IB device &amp;lt;dev&amp;gt; (default: first device found)
 -i, --ib-port=&amp;lt;port&amp;gt; Use network port &amp;lt;port&amp;gt; of IB device (default: 1)
 -s, --size=&amp;lt;size&amp;gt; Size of message to exchange (default: 1)
 -a, --all Run sizes from 2 till 2^23
 -n, --iters=&amp;lt;iters&amp;gt; Number of exchanges (at least 100, default: 1000)
 -x, --gid-index=&amp;lt;index&amp;gt; Test uses GID with GID index taken from command
 -V, --version Display version number
 -e, --events Sleep on CQ events (default poll)
 -F, --CPU-freq Do not fail even if cpufreq_ondemand module
 -I, --inline_size=&amp;lt;size&amp;gt; Max size of message to be sent in inline mode
 -u, --qp-timeout=&amp;lt;timeout&amp;gt; QP timeout = (4 uSec)*(2^timeout) (default: 14)
 -S, --sl=&amp;lt;sl&amp;gt; Service Level (default 0) 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="304846" author="ofaaland" created="Thu, 17 Jun 2021 23:46:51 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Have you tried building lustre master and testing if it has the same problem on your RH8.4 OPA machine?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;We were in the process of setting that up when we lost power to the machine.&#160; As soon as it&apos;s back up, we will test that.&lt;/p&gt;</comment>
                            <comment id="304962" author="defazio" created="Fri, 18 Jun 2021 23:01:14 +0000"  >&lt;p&gt;We&apos;ve now tried lnet_selftest on the OPA compute nodes opal63 and opal64 using lustre-2.14.0_2.llnl &lt;a href=&quot;https://github.com/LLNL/lustre/tree/2.14.0_2.llnl.&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/tree/2.14.0_2.llnl.&lt;/a&gt; The results were the same as with lustre-2.12.6_9.llnl, that is, the test failed and we saw bad numbers for the results (lots of 0s) and the same error in the kernel dumps as before:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
(o2iblnd_cb.c:957:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.5@o2ib18
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We&apos;re in the process of repeating the same test using lustre master.&lt;/p&gt;</comment>
                            <comment id="304971" author="ofaaland" created="Sat, 19 Jun 2021 01:36:00 +0000"  >&lt;p&gt;The test with master hit a list corruption BUG, so it was inconclusive.&#160; I won&apos;t be able to track it down now, I&apos;m leaving for vacation soon and need to prepare.&lt;/p&gt;</comment>
                            <comment id="305052" author="dalessandro" created="Mon, 21 Jun 2021 16:11:31 +0000"  >&lt;p&gt;Can someone post instructions or a test script so we can take a look? I downloaded the LLNL lustre tarball and built it on a RHEL 8.4 box (stock kernel) and would like to give it a shot, but not very familiar with Lnet self test.&lt;/p&gt;</comment>
                            <comment id="305057" author="ssmirnov" created="Mon, 21 Jun 2021 16:23:49 +0000"  >&lt;p&gt;Dennis,&lt;/p&gt;

&lt;p&gt;There&apos;s instructions here: &lt;a href=&quot;https://wiki.lustre.org/LNET_Selftest&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.lustre.org/LNET_Selftest&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Basically, two nodes with LNet running are needed&lt;/p&gt;

&lt;p&gt;1) Make sure selftest is loaded on both nodes: modprobe lnet_selftest&lt;/p&gt;

&lt;p&gt;2) Make sure LNet is configured on both nodes. Run &quot;lnetctl net show&quot; to list local nids on each.&lt;/p&gt;

&lt;p&gt;3) Make the nodes &quot;discover&quot; each other: &quot;lnetctl discover &amp;lt;peer nid&amp;gt;&quot;&lt;/p&gt;

&lt;p&gt;4) Copy the wrapper script on one of the nodes. Use primary nids to fill in &quot;TO&quot; and &quot;FROM&quot;&lt;/p&gt;

&lt;p&gt;5) Run the script&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="305108" author="mmarcini2" created="Mon, 21 Jun 2021 21:25:52 +0000"  >&lt;p&gt;The 305 kernel may be the first RHEL kernel with fmr removed...&lt;/p&gt;</comment>
                            <comment id="305234" author="dalessandro" created="Tue, 22 Jun 2021 20:55:33 +0000"  >&lt;p&gt;Am able to reproduce an EINVAL error internally. Looks like some of the post_send calls work but the ones from kiblnd_sd always fail. Here is a kprobe that I used to dump the return value of rvt_post_send():&lt;/p&gt;

&lt;p&gt;r:testprobe rdmavt:rvt_post_send $retval&lt;/p&gt;

&lt;p&gt;These threads always succeed:&lt;/p&gt;

&lt;p&gt;lst_t_00_24-7492 &lt;span class=&quot;error&quot;&gt;&amp;#91;039&amp;#93;&lt;/span&gt; d... 88520.827471: testprobe: (kiblnd_post_tx_locked+0x857/0xa50 &lt;span class=&quot;error&quot;&gt;&amp;#91;ko2iblnd&amp;#93;&lt;/span&gt; &amp;lt;- rvt_post_send) arg1=0x0&lt;/p&gt;

&lt;p&gt;kiblnd_connd-7547 &lt;span class=&quot;error&quot;&gt;&amp;#91;000&amp;#93;&lt;/span&gt; d... 88520.827888: testprobe: (ib_send_mad+0x235/0x420 &lt;span class=&quot;error&quot;&gt;&amp;#91;ib_core&amp;#93;&lt;/span&gt; &amp;lt;- rvt_post_send) arg1=0x0&lt;/p&gt;

&lt;p&gt;monitor_thread-7556 &lt;span class=&quot;error&quot;&gt;&amp;#91;032&amp;#93;&lt;/span&gt; d... 88521.503931: testprobe: (kiblnd_post_tx_locked+0x857/0xa50 &lt;span class=&quot;error&quot;&gt;&amp;#91;ko2iblnd&amp;#93;&lt;/span&gt; &amp;lt;- rvt_post_send) arg1=0x0&lt;/p&gt;

&lt;p&gt;These threads always fail:&lt;/p&gt;

&lt;p&gt;kiblnd_sd_00_01-7549 &lt;span class=&quot;error&quot;&gt;&amp;#91;014&amp;#93;&lt;/span&gt; d... 88520.827678: testprobe: (kiblnd_post_tx_locked+0x857/0xa50 &lt;span class=&quot;error&quot;&gt;&amp;#91;ko2iblnd&amp;#93;&lt;/span&gt; &amp;lt;- rvt_post_send) arg1=0xffffffea&lt;/p&gt;

&lt;p&gt;Note 0xffffffea decodes to -22 in decimal.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="305467" author="mmarcini2" created="Thu, 24 Jun 2021 21:13:09 +0000"  >&lt;p&gt;The -EINVAL is happening because a post_send of a IB_WR_LOCAL_INV operation is failing.&#160; &#160;So this is an issue with the fastreg stuff as I expected.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;^kprobes-off.sh,&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;^kprobes.sh,&amp;#93;&lt;/span&gt;&#160;and &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39289/39289_trace1.txt&quot; title=&quot;trace1.txt attached to LU-14733&quot;&gt;trace1.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&#160;contain the kprobes used and the tracing.&lt;/p&gt;

&lt;p&gt;I&apos;m trying to get more details now.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="305468" author="mmarcini2" created="Thu, 24 Jun 2021 21:27:35 +0000"  >&lt;p&gt;I should point our our upstream CI testing validates NFS/RDMA, iSer, SRP use of the fast reg feature and has been passing without issue.&lt;/p&gt;</comment>
                            <comment id="305637" author="mmarcini2" created="Sat, 26 Jun 2021 19:17:13 +0000"  >&lt;p&gt;I have attached refined kprobes and trace2.txt.&lt;/p&gt;</comment>
                            <comment id="305638" author="mmarcini2" created="Sat, 26 Jun 2021 19:25:26 +0000"  >&lt;p&gt;Here is the invalidate code:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
{
&amp;lt;snip&amp;gt;
        mr = rcu_dereference(
                rkt-&amp;gt;table[(rkey &amp;gt;&amp;gt; (32 - dev-&amp;gt;dparms.lkey_table_size))]);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (unlikely(!mr || mr-&amp;gt;lkey != rkey || qp-&amp;gt;ibqp.pd != mr-&amp;gt;pd)) *** rkey is != mr-&amp;gt;lkey
                &lt;span class=&quot;code-keyword&quot;&gt;goto&lt;/span&gt; bail;
&amp;lt;snip&amp;gt;
bail:
        rcu_read_unlock();
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; -EINVAL;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Here is an excerpt from the trace:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*** focusing on keys that being with 0x16f700    
     lst_t_00_21-7489  [011] d... 426264.174874: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0xe0 pd=0xffffa0ab4831b780 rkey=0x16f7000
     lst_t_00_21-7489  [011] d.Z. 426264.174879: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x26/0x60 [rdmavt]) mr=0xffffc0dfca0d3b78 table_ptr=0xffffc0dfca0d3b78
     lst_t_00_21-7489  [011] d... 426264.174881: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) rkey=0x16f7000 mr_lkey=0x16f7000
     lst_t_00_21-7489  [011] d... 426264.174882: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] &amp;lt;- rvt_invalidate_rkey) ret=0x0
*** 0x16f7000 has been invalidated ***
     lst_t_00_21-7489  [011] d... 426264.174884: rvt_fast_reg_mr_p: (rvt_fast_reg_mr+0x0/0x70 [rdmavt]) qpn=0xe0 ibmr=0xffffa0a316a16a00 pd=0xffffa0ab4831b780 key=0x16f7001
     lst_t_00_21-7489  [011] d... 426264.174886: rvt_fast_reg_mr: (rvt_post_send+0x1a3/0x800 [rdmavt] &amp;lt;- rvt_fast_reg_mr) ret=0x0
*** 0x16f7001 has been written into ibmr.mr keys because of the above fast reg ***
*** then an invalidate is posted for 0x16f7000
 kiblnd_sd_00_03-7551  [015] d... 426264.175096: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0xe0 pd=0xffffa0ab4831b780 rkey=0x16f7000
 kiblnd_sd_00_03-7551  [015] d.Z. 426264.175100: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x26/0x60 [rdmavt]) mr=0xffffc0dfca0d3b78 table_ptr=0xffffc0dfca0d3b78
*** the key in the mr is the one fastreg&apos;ed from 426264.174884
 kiblnd_sd_00_03-7551  [015] d... 426264.175101: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) rkey=0x16f7000 mr_lkey=0x16f7001
 kiblnd_sd_00_03-7551  [015] d... 426264.175103: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] &amp;lt;- rvt_invalidate_rkey) ret=0xffffffea
 kiblnd_sd_00_03-7551  [015] d.Z. 426264.175104: rvt_post_send_err1: (rvt_post_send+0x475/0x800 [rdmavt]) wr=0xffffa0a2e7da7ed0 wr_opcode=7 err=-22
*** and the post send fails ***
 kiblnd_sd_00_03-7551  [015] d... 426264.175105: rvt_post_send: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] &amp;lt;- rvt_post_send) ret=0xffffffea
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It looks to me like Lustre is losing track of the keys for a particular MR?&lt;/p&gt;</comment>
                            <comment id="305714" author="ofaaland" created="Mon, 28 Jun 2021 16:12:03 +0000"  >&lt;p&gt;Thanks, Mike and Dennis.&lt;/p&gt;

&lt;p&gt;Serguei, please let us know when you (or someone) are working on this new information Cornelis came up with.&#160; Thank you.&lt;/p&gt;</comment>
                            <comment id="305741" author="mmarcini2" created="Mon, 28 Jun 2021 19:33:18 +0000"  >&lt;p&gt;It looks like Lustre is sending a gratuitous invalidate because of this code fragment:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
                /* There appears to be a bug in MLX5 code where you must
                 * invalidate the rkey of a &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; FastReg pool before first
                 * using it. Thus, I am marking the FRD invalid here. */
                frd-&amp;gt;frd_valid = &lt;span class=&quot;code-keyword&quot;&gt;false&lt;/span&gt;;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This is not wrong, but difference than other ULPs.&lt;/p&gt;

&lt;p&gt;The following code is then executed before any fast reg has happened:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
                                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!frd-&amp;gt;frd_valid) {
                                        struct ib_rdma_wr *inv_wr;
                                        __u32 key = is_rx ? mr-&amp;gt;rkey : mr-&amp;gt;lkey;

                                        inv_wr = &amp;amp;frd-&amp;gt;frd_inv_wr;
                                        memset(inv_wr, 0, sizeof(*inv_wr));

                                        inv_wr-&amp;gt;wr.opcode = IB_WR_LOCAL_INV;
                                        inv_wr-&amp;gt;wr.wr_id  = IBLND_WID_MR;
                                        inv_wr-&amp;gt;wr.ex.invalidate_rkey = key;

                                        &lt;span class=&quot;code-comment&quot;&gt;/* Bump the key */&lt;/span&gt;
                                        key = ib_inc_rkey(key); 
                                        *** updates keys in ib_mr, but not the rvt_mregion keys ***
                                        ib_update_fast_reg_key(mr, key);
                                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The following code uses &lt;b&gt;struct rvt_mregion keys&lt;/b&gt; to validate and doesn&apos;t see the above key change in the ibmr and fails the invalidate.&#160; &#160;The rkey is correct, but the mr-&amp;gt;lkey hasn&apos;t changed to match until the next&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
{
        struct rvt_dev_info *dev = ib_to_rvt(qp-&amp;gt;ibqp.device);
        struct rvt_lkey_table *rkt = &amp;amp;dev-&amp;gt;lkey_table;
        struct rvt_mregion *mr;

        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rkey == 0)
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; -EINVAL;

        rcu_read_lock();
        mr = rcu_dereference(
                rkt-&amp;gt;table[(rkey &amp;gt;&amp;gt; (32 - dev-&amp;gt;dparms.lkey_table_size))]);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (unlikely(!mr || mr-&amp;gt;lkey != rkey || qp-&amp;gt;ibqp.pd != mr-&amp;gt;pd))
                &lt;span class=&quot;code-keyword&quot;&gt;goto&lt;/span&gt; bail;

        atomic_set(&amp;amp;mr-&amp;gt;lkey_invalid, 1);
        rcu_read_unlock();
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0;

bail:
        rcu_read_unlock();
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; -EINVAL;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I&apos;m working on the following fix:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 601d18dd..528727f 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -691,6 +691,7 @@ int rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
        struct rvt_dev_info *dev = ib_to_rvt(qp-&amp;gt;ibqp.device);
        struct rvt_lkey_table *rkt = &amp;amp;dev-&amp;gt;lkey_table;
        struct rvt_mregion *mr;
+       struct rvt_mr *rmr;

        if (rkey == 0)
                return -EINVAL;
@@ -698,7 +699,11 @@ int rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
        rcu_read_lock();
        mr = rcu_dereference(
                rkt-&amp;gt;table[(rkey &amp;gt;&amp;gt; (32 - dev-&amp;gt;dparms.lkey_table_size))]);
-       if (unlikely(!mr || mr-&amp;gt;lkey != rkey || qp-&amp;gt;ibqp.pd != mr-&amp;gt;pd))
+       if (unlikely(!mr || qp-&amp;gt;ibqp.pd != mr-&amp;gt;pd))
+               goto bail;
+       /* isolate parent */
+       rmr = container_of(mr, struct rvt_mr, mr);
+       if (rmr-&amp;gt;ibmr.type != IB_MR_TYPE_MEM_REG || rmr-&amp;gt;ibmr.rkey != rkey)
                goto bail;

        atomic_set(&amp;amp;mr-&amp;gt;lkey_invalid, 1);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="305747" author="ssmirnov" created="Mon, 28 Jun 2021 22:28:10 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;To me this looks like a very good candidate to fix the issue. Thanks for taking the time to look into this!&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="305777" author="mmarcini2" created="Tue, 29 Jun 2021 13:16:54 +0000"  >&lt;p&gt;&amp;gt; To me this looks like a very good candidate to fix the issue.&lt;/p&gt;

&lt;p&gt;Another potential fix is to delete the MLX5 hack.&#160; &#160;That should work as well.&lt;/p&gt;</comment>
                            <comment id="305788" author="mmarcini2" created="Tue, 29 Jun 2021 15:01:18 +0000"  >&lt;p&gt;The patch didn&apos;t work.&lt;/p&gt;

&lt;p&gt;I need to do more analysis.&lt;/p&gt;</comment>
                            <comment id="305797" author="ofaaland" created="Tue, 29 Jun 2021 15:39:07 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Do you (or Amir, or anyone else you have easy access to) know if the MLX issue that prompted that hack has been fixed?  The JIRA issue was  &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8752&quot; title=&quot;mlx5_warn:mlx5_0:dump_cqe:257:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8752&quot;&gt;&lt;del&gt;LU-8752&lt;/del&gt;&lt;/a&gt; and the commit was:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;commit 783428b60a98874b4783f8da48c66019d68d84d6
Author: Doug Oucharek &amp;lt;doug.s.oucharek@intel.com&amp;gt;
Date: &#160; Mon Dec 12 09:31:37 2016 -0800


&#160; &#160; LU-8752 lnet: Stop MLX5 triggering a dump_cqe
&#160;&#160; &#160;
&#160; &#160; We have found that MLX5 will trigger a dump_cqe if we don&apos;t
&#160; &#160; invalidate the rkey on a newly alloated MR for FastReg usage.
&#160;&#160; &#160;
&#160; &#160; This fix just tags the MR as invalid on its creation if we are
&#160; &#160; using FastReg and that will force it to do an invalidate of the
&#160; &#160; rkey on first usage.
&#160;&#160; &#160;
...

diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c
index e919008d44..ee5a01f9fa 100644
--- a/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1536,7 +1536,10 @@ static int kiblnd_alloc_freg_pool(kib_fmr_poolset_t *fps, kib_fmr_pool_t *fpo)
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; goto out_middle;
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; }
 
- &#160; &#160; &#160; &#160; &#160; &#160; &#160; frd-&amp;gt;frd_valid = true;
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; /* There appears to be a bug in MLX5 code where you must
+&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; * invalidate the rkey of a new FastReg pool before first
+&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; * using it. Thus, I am marking the FRD invalid here. */
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; frd-&amp;gt;frd_valid = false;
 
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; list_add_tail(&amp;amp;frd-&amp;gt;frd_list, &amp;amp;fpo-&amp;gt;fast_reg.fpo_pool_list);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; fpo-&amp;gt;fast_reg.fpo_pool_size++;

 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="305839" author="ofaaland" created="Wed, 30 Jun 2021 01:50:52 +0000"  >&lt;p&gt;I reverted&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LU-8752 lnet: Stop MLX5 triggering a dump_cqe&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and tested. Initially I had nonzero bandwidth, which is different than I recall seeing before. After a few seconds the bandwidth recorded went to 0. lctl dk shows:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000800:00020000:16.0F:1625017325.639837:0:514295:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000800:00020000:6.0F:1625017325.639911:0:514296:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000800:00000100:6.0:1625017325.639916:0:514296:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 192.168.128.3@o2ib18: error -22(waiting)
00000400:00000100:8.0F:1625017325.639943:0:514293:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -22 type 5, RPC errors 1
00000400:00000100:6.0:1625017325.639947:0:514296:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -22 type 5, RPC errors 2
00000001:00020000:7.0F:1625017325.639951:0:514856:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-192.168.128.3@o2ib18: -22
00000001:00020000:53.0:1625017325.639966:0:514858:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-192.168.128.3@o2ib18: -22
00000400:00000100:53.0:1625017325.639968:0:514858:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC 000000007f1b43fb done: service brw_test, peer 12345-192.168.128.3@o2ib18, status SWI_STATE_BULK_STARTED:-5
00000001:00020000:53.0:1625017325.639971:0:514858:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer to 12345-192.168.128.3@o2ib18 has failed: -5
00000800:00000100:17.0:1625017325.640281:0:514294:0:(o2iblnd_cb.c:3734:kiblnd_complete()) RDMA (tx: 0000000005d7d115) failed: 5
00000800:00020000:6.0:1625017325.640283:0:514296:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000800:00000100:6.0:1625017325.640284:0:514296:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 192.168.128.3@o2ib18: error -22(waiting)
00000400:00000100:8.0:1625017325.640285:0:514293:0:(lib-msg.c:698:lnet_attempt_msg_resend()) msg 0@&amp;lt;0:0&amp;gt;-&amp;gt;192.168.128.3@o2ib18 exceeded retry count 0
00000800:00020000:17.0:1625017325.640286:0:514294:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000400:00000100:8.0:1625017325.640287:0:514293:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -5 type 5, RPC errors 3
00000800:00000100:6.0:1625017325.640288:0:514296:0:(o2iblnd_cb.c:3734:kiblnd_complete()) RDMA (tx: 0000000029f85d1a) failed: 5
00000400:00000100:17.0:1625017325.640290:0:514294:0:(lib-msg.c:698:lnet_attempt_msg_resend()) msg 0@&amp;lt;0:0&amp;gt;-&amp;gt;192.168.128.3@o2ib18 exceeded retry count 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="305871" author="mmarcini2" created="Wed, 30 Jun 2021 12:22:45 +0000"  >&lt;p&gt;I do know that patch proposed is just wrong.&#160; &#160; Testing against the &lt;b&gt;struct rvt_mregion lkey&lt;/b&gt; &lt;em&gt;should&lt;/em&gt; be correct.&lt;/p&gt;

&lt;p&gt;I need to add another kprobe with the original kernel to look at the keys for all ib_alloc_mr() allocations from birth and track from that point to failures.&lt;/p&gt;

&lt;p&gt;A portion of the comment says:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;This fix just tags the MR as invalid on its creation if we are&lt;br/&gt;
 using FastReg and that will force it to do an invalidate of the&lt;br/&gt;
 rkey on first usage.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The inference is that the invalidate only happens on first MR use, but I don&apos;t see anywhere that sets &lt;b&gt;frd_invalid&lt;/b&gt; to true?&#160; &#160;It looks like it will happen all the time for all MRs.&lt;/p&gt;</comment>
                            <comment id="305935" author="mmarcini2" created="Wed, 30 Jun 2021 18:50:15 +0000"  >&lt;p&gt;It is starting to look to me like there is a concurency issue where somehow the old key is subsequently passed to an invalidate.&lt;/p&gt;

&lt;p&gt;Here is an invalidate for rkey 0x100:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;     lst_t_00_09-3288  [046] d...  6592.029697: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0x8 pd=0xffff9ebe1a48a200 rkey=0x100
     lst_t_00_09-3288  [046] d...  6592.029700: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) table_ptr=0xffffada64a4d1000
     lst_t_00_09-3288  [046] d.Z.  6592.029702: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x36/0x60 [rdmavt]) mr_lkey=0x100 ib_mr_type=0 ib_mr_rkey=0x101 ib_mr_lkey=0x101
     lst_t_00_09-3288  [046] d...  6592.029705: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] &amp;lt;- rvt_invalidate_rkey) ret=0x0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here is the one that fails:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;kiblnd_sd_00_00-3359  [036] d...  6592.029947: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0x8 pd=0xffff9ebe1a48a200 rkey=0x100
     lst_t_00_13-3292  [050] d...  6592.029947: rvt_lkey_ok_p: (rvt_lkey_ok+0x0/0x380 [rdmavt]) pd=0xffff9ebe1a48a200 sge_lkey=0x0
     lst_t_00_18-3297  [001] d.Z.  6592.029948: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x36/0x60 [rdmavt]) mr_lkey=0x60700 ib_mr_type=0 ib_mr_rkey=0x60701 ib_mr_lkey=0x60701
     lst_t_00_13-3292  [050] d...  6592.029948: rvt_lkey_ok: (rvt_post_send+0x2dc/0x800 [rdmavt] &amp;lt;- rvt_lkey_ok) ret=0x1
 kiblnd_sd_00_00-3359  [036] d...  6592.029950: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) table_ptr=0xffffada64a4d1000
     lst_t_00_18-3297  [001] d...  6592.029951: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] &amp;lt;- rvt_invalidate_rkey) ret=0x0
 kiblnd_sd_00_00-3359  [036] d.Z.  6592.029952: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x36/0x60 [rdmavt]) mr_lkey=0x101 ib_mr_type=0 ib_mr_rkey=0x101 ib_mr_lkey=0x101
     lst_t_00_13-3292  [050] dN..  6592.029952: rvt_post_send: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] &amp;lt;- rvt_post_send) ret=0x0
     lst_t_00_18-3297  [001] d...  6592.029952: rvt_fast_reg_mr_p: (rvt_fast_reg_mr+0x0/0x70 [rdmavt]) qpn=0x4 ibmr=0xffff9ebe3fd0e600 pd=0xffff9ebe1a48a200 key=0x60701
          &amp;lt;idle&amp;gt;-0     [010] d.h.  6592.029952: rvt_lkey_ok_p: (rvt_lkey_ok+0x0/0x380 [rdmavt]) pd=0xffff9ebe1a48a200 sge_lkey=0x0
          &amp;lt;idle&amp;gt;-0     [010] d.h.  6592.029953: rvt_lkey_ok: (rvt_get_rwqe+0x2c8/0x450 [rdmavt] &amp;lt;- rvt_lkey_ok) ret=0x1
     lst_t_00_12-3291  [049] dN..  6592.029953: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0xa pd=0xffff9ebe1a48a200 rkey=0x70800
     lst_t_00_18-3297  [001] d...  6592.029954: rvt_fast_reg_mr: (rvt_post_send+0x1a3/0x800 [rdmavt] &amp;lt;- rvt_fast_reg_mr) ret=0x0
 kiblnd_sd_00_00-3359  [036] d...  6592.029955: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] &amp;lt;- rvt_invalidate_rkey) ret=0xffffffea
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note the different CPU and task name.&lt;/p&gt;</comment>
                            <comment id="306384" author="ofaaland" created="Tue, 6 Jul 2021 21:05:55 +0000"  >&lt;p&gt;Hi Mike and Serguei,&lt;/p&gt;

&lt;p&gt;Do you have any update on this issue?&lt;/p&gt;

&lt;p&gt;I&apos;m expecting to get my OPA test cluster back again today, at which point I plan to more thoroughly compare the behavior of lnet_selftest both with and without the patch to&#160;invalidate the rkey before the mr is used.&lt;/p&gt;

&lt;p&gt;thanks,&lt;/p&gt;

&lt;p&gt;Olaf&lt;/p&gt;</comment>
                            <comment id="306391" author="mmarcini2" created="Tue, 6 Jul 2021 21:59:14 +0000"  >&lt;p&gt;There are two issues with the existing code and hfi1.&lt;/p&gt;

&lt;p&gt;First there is what I suspect is a misplaced racy assignment:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
void
kiblnd_fmr_pool_unmap(struct kib_fmr *fmr, &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; status)
{
&amp;lt;snip&amp;gt; This code returns the frd to the list
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (frd) {
                        frd-&amp;gt;frd_valid = &lt;span class=&quot;code-keyword&quot;&gt;false&lt;/span&gt;;
                        spin_lock(&amp;amp;fps-&amp;gt;fps_lock);
                        list_add_tail(&amp;amp;frd-&amp;gt;frd_list, &amp;amp;fpo-&amp;gt;fast_reg.fpo_pool_list);
                        spin_unlock(&amp;amp;fps-&amp;gt;fps_lock);
                        fmr-&amp;gt;fmr_frd = NULL; &amp;lt;- I think &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; should be before add
                }
&amp;lt;snip&amp;gt; I suspect the NULL should be before adding to the list to avoid a race with the map
&amp;lt;snip&amp;gt; pulling the descriptor from the list
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The mapping looks like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
                        struct kib_rdma_desc *rd, u32 nob, u64 iov,
                        struct kib_fmr *fmr)
{
&amp;lt;snip&amp;gt;This code dequeues the kib_fast_reg_descriptor from a list
                                struct kib_fast_reg_descriptor *frd;
#ifdef HAVE_IB_MAP_MR_SG
                                struct ib_reg_wr *wr;
                                &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; n;
#&lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
                                struct ib_rdma_wr *wr;
                                struct ib_fast_reg_page_list *frpl;
#endif
                                struct ib_mr *mr;

                                frd = list_first_entry(&amp;amp;fpo-&amp;gt;fast_reg.fpo_pool_list,
                                                        struct kib_fast_reg_descriptor,
                                                        frd_list);
                                list_del(&amp;amp;frd-&amp;gt;frd_list);
                                spin_unlock(&amp;amp;fps-&amp;gt;fps_lock);
&amp;lt;snip&amp;gt; This code sets up the invalidate operation embbeded in the kib_fast_reg_descriptor
                                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!frd-&amp;gt;frd_valid) {
                                        struct ib_rdma_wr *inv_wr;
                                        __u32 key = is_rx ? mr-&amp;gt;rkey : mr-&amp;gt;lkey;

                                        inv_wr = &amp;amp;frd-&amp;gt;frd_inv_wr;
                                        memset(inv_wr, 0, sizeof(*inv_wr));

                                        inv_wr-&amp;gt;wr.opcode = IB_WR_LOCAL_INV;
                                        inv_wr-&amp;gt;wr.wr_id  = IBLND_WID_MR;
                                        inv_wr-&amp;gt;wr.ex.invalidate_rkey = key;

                                        &lt;span class=&quot;code-comment&quot;&gt;/* Bump the key */&lt;/span&gt;
                                        key = ib_inc_rkey(key);
                                        ib_update_fast_reg_key(mr, key);
                                }
&amp;lt;snip&amp;gt; The code goes on to register the pages in ib_map_mr_sg and sets up the
&amp;lt;snip&amp;gt; frd_fastreg_wr embedded in the kib_fast_reg_descriptor
&amp;lt;snip&amp;gt;
&amp;lt;snip&amp;gt; The code then fuses the kib_fast_reg_descriptor kib_fmr in the kib_tx
                                fmr-&amp;gt;fmr_key  = is_rx ? mr-&amp;gt;rkey : mr-&amp;gt;lkey;
                                fmr-&amp;gt;fmr_frd  = frd; &amp;lt;--- here
                                fmr-&amp;gt;fmr_pool = fpo;
                                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0;
&amp;lt;snip&amp;gt; At &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; point no posts have been done and they are defered to
&amp;lt;snip&amp;gt; kiblnd_post_tx_locked().
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The post send looks like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;
kiblnd_post_tx_locked(struct kib_conn *conn, struct kib_tx *tx, &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; credit)
__must_hold(&amp;amp;conn-&amp;gt;ibc_lock)
{
&amp;lt;snip&amp;gt; This code sees frd from above and prepends WRs from the kib_fast_reg_descriptor
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (frd != NULL) {
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!frd-&amp;gt;frd_valid) {
                                wr = &amp;amp;frd-&amp;gt;frd_inv_wr.wr;
                                wr-&amp;gt;next = &amp;amp;frd-&amp;gt;frd_fastreg_wr.wr;
                        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; {
                                wr = &amp;amp;frd-&amp;gt;frd_fastreg_wr.wr;
                        }
                        frd-&amp;gt;frd_fastreg_wr.wr.next = &amp;amp;tx-&amp;gt;tx_wrq[0].wr;
                        will_post = &lt;span class=&quot;code-keyword&quot;&gt;true&lt;/span&gt;;
                }
&amp;lt;snip&amp;gt; The post is here
               &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lnet_send_error_simulation(tx-&amp;gt;tx_lntmsg[0], &amp;amp;tx-&amp;gt;tx_hstatus))
                        rc = -EINVAL;
                &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
#ifdef HAVE_IB_POST_SEND_RECV_CONST
                        rc = ib_post_send(conn-&amp;gt;ibc_cmid-&amp;gt;qp, wr,
                                          (&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct ib_send_wr **)&amp;amp;bad);
#&lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
                        rc = ib_post_send(conn-&amp;gt;ibc_cmid-&amp;gt;qp, wr, &amp;amp;bad);
#endif

        conn-&amp;gt;ibc_last_send = ktime_get();

        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc == 0)
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0; &amp;lt;-- &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; here with the WRs from the mapping complete
&amp;lt;snip&amp;gt; At &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; point there is a tx with everthing ready to go, BUT
&amp;lt;snip&amp;gt; all posts using the tx until is unmapped will send the invalidate and fast reg
&amp;lt;snip&amp;gt; and the invalidate has the OLD key forever since nothing has been done to
&amp;lt;snip&amp;gt; remember and disable the invalid WRs &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; the next post using the tx.
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Everytime the &lt;b&gt;kib_tx&lt;/b&gt; is used after the first there will be a superflous &lt;b&gt;OLD&lt;/b&gt; invalidate followed by an &lt;b&gt;OLD&lt;/b&gt; fast req that will change the key to what it currently is.&#160;&lt;/p&gt;

&lt;p&gt;For hfi1 the second invalidate will get a &lt;b&gt;-EINVAL&lt;/b&gt; return code because the keys don&apos;t match.&lt;/p&gt;

&lt;p&gt;There are two possible fixes:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Add state to the&#160;kib_fast_reg_descriptor that keeps track of if the fast reg WRs have been posted and patch the post logic to only post the WRs if they had not already been posted.&lt;/li&gt;
	&lt;li&gt;Make the rdmavt invalidate code allow the OLD invalidate and fast reg by only comparing the key bits above 7.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;I&apos;m about to attach the two patches for 1.&#160; &#160;The patch seems to fix the issue and the lnet_selftest works fine.&lt;/p&gt;

&lt;p&gt;I&apos;m getting ready to test the invalidate patch, but I should point out our current code works with SRP, iSer, and NFS RDMA as is.&lt;/p&gt;</comment>
                            <comment id="306443" author="mmarcini2" created="Wed, 7 Jul 2021 11:48:23 +0000"  >&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39531/39531_linux-kernel-test.patch&quot; title=&quot;linux-kernel-test.patch attached to LU-14733&quot;&gt;linux-kernel-test.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&#160;contains a potential upstream patch.&lt;/p&gt;

&lt;p&gt;Older versions attached were incorrect.&lt;/p&gt;</comment>
                            <comment id="306460" author="mmarcini2" created="Wed, 7 Jul 2021 14:00:58 +0000"  >&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39531/39531_linux-kernel-test.patch&quot; title=&quot;linux-kernel-test.patch attached to LU-14733&quot;&gt;linux-kernel-test.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&#160;solves the issue, at the expense of extra processing whenever the kib_tx is reused.&lt;/p&gt;</comment>
                            <comment id="306511" author="ssmirnov" created="Wed, 7 Jul 2021 18:03:03 +0000"  >&lt;p&gt;Mike,&lt;/p&gt;

&lt;p&gt;I tried the LND change you suggested on my setup using MOFED, it&#160;appears to be fine.&#160;&lt;/p&gt;

&lt;p&gt;Please submit the LND patch for review, it will be easier to track with your ownership.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="306512" author="mmarcini2" created="Wed, 7 Jul 2021 18:06:59 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Please submit the LND patch for review, it will be easier to track with your ownership.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Which one?  There are two patches.   One is a bug fix in unmap.   The other is the fix to insure duplicate WRs are not sent after the initial posts.&lt;/p&gt;</comment>
                            <comment id="306513" author="pjones" created="Wed, 7 Jul 2021 18:06:59 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;If you need help getting your gerrit account sorted out then it&apos;s best to email me directly rather than using JIRA &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="306514" author="ssmirnov" created="Wed, 7 Jul 2021 18:16:29 +0000"  >&lt;p&gt;Mike,&#160;&lt;/p&gt;

&lt;p&gt;I was referring to the bug fix in the&#160;kiblnd_fmr_pool_unmap.&#160;&lt;/p&gt;

&lt;p&gt;I have no way of testing the upstream patch with OPA. Is it possible that the&#160;kiblnd_fmr_pool_unmap fix is sufficient by itself?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="306516" author="mmarcini2" created="Wed, 7 Jul 2021 18:30:08 +0000"  >&lt;blockquote&gt;&lt;p&gt;I have no way of testing the upstream patch with OPA. Is it possible that the kiblnd_fmr_pool_unmap fix is sufficient by itself?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I tested the rdmavt upstream fix by rebuilding the 8.4 GA kernel from srpm adding the upstream patch.&#160; &#160;The unmap fix is not sufficient by itself.&#160; I determined that early on.&#160;&lt;/p&gt;

&lt;p&gt;We either need an upstream patch to get pulled by Jason and point Red Hat to it, or use the second Lustre patch that avoids the superflous WRs that trigger the EINVAL.&lt;/p&gt;

&lt;p&gt;It seems to me that a Lustre fix might be quicker?&lt;/p&gt;

&lt;p&gt;Do we need a separate Jira for the map/unmap race fix?&lt;/p&gt;</comment>
                            <comment id="306518" author="ssmirnov" created="Wed, 7 Jul 2021 18:49:03 +0000"  >&lt;p&gt;I don&apos;t think we need a separate ticket for the map/unmap race fix. Peter can correct me if he disagrees.&lt;/p&gt;</comment>
                            <comment id="306537" author="ofaaland" created="Wed, 7 Jul 2021 23:34:02 +0000"  >&lt;p&gt;Mike,&lt;/p&gt;

&lt;p&gt;Thanks for tracking this down.&#160;&#160;Do you know why we saw this first with RHEL 8.4, but not RHEL 8.3?&lt;/p&gt;</comment>
                            <comment id="306610" author="mmarcini2" created="Thu, 8 Jul 2021 13:46:42 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Do you know why we saw this first with RHEL 8.4, but not RHEL 8.3?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Upstream removed FMR as of 5.8 and it looks like the 8.4 RDMA code took that in.&lt;/p&gt;

&lt;p&gt;8.3 still has the FMR and it appears that this code will prefer it:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
#ifdef HAVE_FMR_POOL_API
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (dev-&amp;gt;ibd_dev_caps &amp;amp; IBLND_DEV_CAPS_FMR_ENABLED)
                rc = kiblnd_alloc_fmr_pool(fps, fpo);
        &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
#endif &lt;span class=&quot;code-comment&quot;&gt;/* HAVE_FMR_POOL_API */&lt;/span&gt;
                rc = kiblnd_alloc_freg_pool(fps, fpo, dev-&amp;gt;ibd_dev_caps);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc)
                &lt;span class=&quot;code-keyword&quot;&gt;goto&lt;/span&gt; out_fpo;

        fpo-&amp;gt;fpo_deadline = ktime_get_seconds() + IBLND_POOL_DEADLINE;
        fpo-&amp;gt;fpo_owner = fps;
        *pp_fpo = fpo;

        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I don&apos;t see any override when both are available?&lt;/p&gt;</comment>
                            <comment id="306611" author="mmarcini2" created="Thu, 8 Jul 2021 13:49:06 +0000"  >&lt;p&gt;Here are the patches for Lustre:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39552/39552_01-move_null.patch&quot; title=&quot;01-move_null.patch attached to LU-14733&quot;&gt;01-move_null.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39551/39551_02-post_state.patch&quot; title=&quot;02-post_state.patch attached to LU-14733&quot;&gt;02-post_state.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="306650" author="gerrit" created="Thu, 8 Jul 2021 18:43:54 +0000"  >&lt;p&gt;Mike Marciniszyn (mike.marciniszyn@cornelisnetworks.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44189&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44189&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 071c8c7faa322be1a42d424ff0d83c3113c80140&lt;/p&gt;</comment>
                            <comment id="306651" author="gerrit" created="Thu, 8 Jul 2021 18:43:55 +0000"  >&lt;p&gt;Mike Marciniszyn (mike.marciniszyn@cornelisnetworks.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44190&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44190&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 778962a825eeee5b4754664a6baabf61e982aa38&lt;/p&gt;</comment>
                            <comment id="306655" author="mmarcini2" created="Thu, 8 Jul 2021 19:52:49 +0000"  >&lt;p&gt;Currently I have unit tested both bulk read and write with opa cards and RHEL8.4.&lt;/p&gt;

&lt;p&gt;I&apos;m trying to find an MLX card to test with that as well.&lt;/p&gt;</comment>
                            <comment id="306996" author="gerrit" created="Mon, 12 Jul 2021 18:45:40 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/44189/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44189/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 023113fb8946f3565529e7327fdcd90ab9db3ba3&lt;/p&gt;</comment>
                            <comment id="306997" author="gerrit" created="Mon, 12 Jul 2021 18:45:53 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/44190/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44190/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5930576791e864529e6ef9b46f3e09cc4b635fc2&lt;/p&gt;</comment>
                            <comment id="307017" author="ofaaland" created="Mon, 12 Jul 2021 19:01:54 +0000"  >&lt;p&gt;I saw those two Lustre patches were merged to master.&#160; Has someone been able to test them on MLX to confirm they don&apos;t cause new issues there?&#160; Thanks&lt;/p&gt;</comment>
                            <comment id="307020" author="ssmirnov" created="Mon, 12 Jul 2021 19:20:49 +0000"  >&lt;p&gt;Before Mike pushed them, I tried these patches on my local setup that uses LTS MOFED 4.9 and cx-2 cards. Ran lnet_selftest, didn&apos;t see any issues.&lt;/p&gt;</comment>
                            <comment id="307023" author="mmarcini2" created="Mon, 12 Jul 2021 19:54:09 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Before Mike pushed them, I tried these patches on my local setup that uses LTS MOFED 4.9 and cx-2 cards. Ran lnet_selftest, didn&apos;t see any issues.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I will also test on some MLX cards.   I will try the stock 8.4 kernel vs. MOFED though.&lt;/p&gt;</comment>
                            <comment id="307048" author="ofaaland" created="Mon, 12 Jul 2021 21:47:28 +0000"  >&lt;p&gt;Great, thank you both.&lt;/p&gt;</comment>
                            <comment id="307050" author="gerrit" created="Mon, 12 Jul 2021 22:33:20 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44216&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44216&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d33a2dcf34fcea3adaf46d1a806405a9c334adc0&lt;/p&gt;</comment>
                            <comment id="307051" author="gerrit" created="Mon, 12 Jul 2021 22:33:21 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44217&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44217&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8c70059776f2bdc7a68a27e35bb5dc36763d3dd6&lt;/p&gt;</comment>
                            <comment id="307059" author="mmarcini2" created="Mon, 12 Jul 2021 23:40:51 +0000"  >&lt;blockquote&gt;
&lt;p&gt;05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family &lt;span class=&quot;error&quot;&gt;&amp;#91;ConnectX-4&amp;#93;&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I was able to run lnet_selftest on a vanilla RH 8.4 install:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@ioperf-05 ~]# ./lnet_wrapper_read
LST_SESSION = 6780
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.25.9@o2ib1 are added to session
192.168.25.10@o2ib1 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 30 secs [LNet Rates of lfrom]
[R] Avg: 21275    RPC/s Min: 21275    RPC/s Max: 21275    RPC/s
[W] Avg: 10638    RPC/s Min: 10638    RPC/s Max: 10638    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10639.30 MiB/s Min: 10639.30 MiB/s Max: 10639.30 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10637    RPC/s Min: 10637    RPC/s Max: 10637    RPC/s
[W] Avg: 21275    RPC/s Min: 21275    RPC/s Max: 21275    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10639.10 MiB/s Min: 10639.10 MiB/s Max: 10639.10 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21294    RPC/s Min: 21294    RPC/s Max: 21294    RPC/s
[W] Avg: 10647    RPC/s Min: 10647    RPC/s Max: 10647    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10647.90 MiB/s Min: 10647.90 MiB/s Max: 10647.90 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10647    RPC/s Min: 10647    RPC/s Max: 10647    RPC/s
[W] Avg: 21294    RPC/s Min: 21294    RPC/s Max: 21294    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10647.90 MiB/s Min: 10647.90 MiB/s Max: 10647.90 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21275    RPC/s Min: 21275    RPC/s Max: 21275    RPC/s
[W] Avg: 10637    RPC/s Min: 10637    RPC/s Max: 10637    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10639.50 MiB/s Min: 10639.50 MiB/s Max: 10639.50 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10637    RPC/s Min: 10637    RPC/s Max: 10637    RPC/s
[W] Avg: 21276    RPC/s Min: 21276    RPC/s Max: 21276    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10639.50 MiB/s Min: 10639.50 MiB/s Max: 10639.50 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21297    RPC/s Min: 21297    RPC/s Max: 21297    RPC/s
[W] Avg: 10649    RPC/s Min: 10649    RPC/s Max: 10649    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10649.22 MiB/s Min: 10649.22 MiB/s Max: 10649.22 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10648    RPC/s Min: 10648    RPC/s Max: 10648    RPC/s
[W] Avg: 21294    RPC/s Min: 21294    RPC/s Max: 21294    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10649.22 MiB/s Min: 10649.22 MiB/s Max: 10649.22 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21304    RPC/s Min: 21304    RPC/s Max: 21304    RPC/s
[W] Avg: 10652    RPC/s Min: 10652    RPC/s Max: 10652    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10653.70 MiB/s Min: 10653.70 MiB/s Max: 10653.70 MiB/s
[W] Avg: 1.63     MiB/s Min: 1.63     MiB/s Max: 1.63     MiB/s
[LNet Rates of lto]
[R] Avg: 10652    RPC/s Min: 10652    RPC/s Max: 10652    RPC/s
[W] Avg: 21305    RPC/s Min: 21305    RPC/s Max: 21305    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.63     MiB/s Min: 1.63     MiB/s Max: 1.63     MiB/s
[W] Avg: 10653.70 MiB/s Min: 10653.70 MiB/s Max: 10653.70 MiB/s

lfrom:
Total 0 error nodes in lfrom
lto:
Total 0 error nodes in lto
1 batch in stopping
Batch is stopped
session is ended
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The only issue I ran into was a panic trying to reboot.   The servers required a powercyle.&lt;/p&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jul 12 19:09:14 ioperf-06 kernel: reboot          D    0 10092   9865 0x00004080
Jul 12 19:09:14 ioperf-06 kernel: Call Trace:
Jul 12 19:09:14 ioperf-06 kernel: __schedule+0x2c4/0x700
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to_asm+0x35/0x70
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to_asm+0x35/0x70
Jul 12 19:09:14 ioperf-06 kernel: schedule+0x38/0xa0
Jul 12 19:09:14 ioperf-06 kernel: schedule_timeout+0x246/0x2f0
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to_asm+0x41/0x70
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to+0x10c/0x480
Jul 12 19:09:14 ioperf-06 kernel: ? __schedule+0x2cc/0x700
Jul 12 19:09:14 ioperf-06 kernel: wait_for_completion+0x97/0x100
Jul 12 19:09:14 ioperf-06 kernel: cma_remove_one+0x23f/0x310 [rdma_cm]
Jul 12 19:09:14 ioperf-06 kernel: remove_client_context+0x8b/0xd0 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: disable_device+0x8c/0x130 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: __ib_unregister_device+0x35/0xa0 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: ib_unregister_device+0x21/0x30 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: __mlx5_ib_remove+0x38/0x60 [mlx5_ib]
Jul 12 19:09:14 ioperf-06 kernel: mlx5_detach_device+0xb2/0xc0 [mlx5_core]
Jul 12 19:09:14 ioperf-06 kernel: mlx5_unload_one+0x80/0x120 [mlx5_core]
Jul 12 19:09:14 ioperf-06 kernel: shutdown+0x144/0x1d0 [mlx5_core]
Jul 12 19:09:14 ioperf-06 kernel: pci_device_shutdown+0x34/0x60
Jul 12 19:09:14 ioperf-06 kernel: device_shutdown+0x161/0x212
Jul 12 19:09:14 ioperf-06 kernel: kernel_restart+0xe/0x30
Jul 12 19:09:14 ioperf-06 kernel: __do_sys_reboot+0x1d2/0x210
Jul 12 19:09:14 ioperf-06 kernel: ? syscall_trace_enter+0x1d3/0x2c0
Jul 12 19:09:14 ioperf-06 kernel: ? __audit_syscall_exit+0x249/0x2a0
Jul 12 19:09:14 ioperf-06 kernel: do_syscall_64+0x5b/0x1a0
Jul 12 19:09:14 ioperf-06 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Jul 12 19:09:14 ioperf-06 kernel: RIP: 0033:0x7f73aa2825f7
Jul 12 19:09:14 ioperf-06 kernel: Code: Unable to access opcode bytes at RIP 0x7f73aa2825cd.
Jul 12 19:09:14 ioperf-06 kernel: RSP: 002b:00007ffed2874d28 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
Jul 12 19:09:14 ioperf-06 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f73aa2825f7
Jul 12 19:09:14 ioperf-06 kernel: RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
Jul 12 19:09:14 ioperf-06 kernel: RBP: 00007ffed2874d70 R08: 0000000000000002 R09: 0000000000000000
Jul 12 19:09:14 ioperf-06 kernel: R10: 000000000000004b R11: 0000000000000246 R12: 0000000000000001
Jul 12 19:09:14 ioperf-06 kernel: R13: 00000000fffffffe R14: 0000000000000006 R15: 0000000000000000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="307266" author="gerrit" created="Tue, 13 Jul 2021 18:33:06 +0000"  >&lt;p&gt;Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44295&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44295&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_14&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8db93a26f7e99f140ccd4a0fd3e35f4e9f71b8ec&lt;/p&gt;</comment>
                            <comment id="307267" author="gerrit" created="Tue, 13 Jul 2021 18:33:07 +0000"  >&lt;p&gt;Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44296&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44296&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_14&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 1105bfbfad5002bc673ff568c0720bcebc5d095a&lt;/p&gt;</comment>
                            <comment id="307745" author="ofaaland" created="Mon, 19 Jul 2021 15:47:17 +0000"  >&lt;p&gt;Mike, Serguei, this works on our test system&lt;/p&gt;</comment>
                            <comment id="308284" author="pjones" created="Sat, 24 Jul 2021 00:23:53 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                            <comment id="309727" author="gerrit" created="Tue, 10 Aug 2021 06:35:14 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44216/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44216/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 173d60a3274d19bc1d9811b6e1b09aac2b25f221&lt;/p&gt;</comment>
                            <comment id="311475" author="gerrit" created="Sat, 28 Aug 2021 07:03:01 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44217/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44217/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 96d7dcf4e773e6026a590e4596ef30ac8a4a5061&lt;/p&gt;</comment>
                            <comment id="318198" author="gerrit" created="Sun, 14 Nov 2021 03:06:48 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44295/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44295/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_14&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 380be07fcca1f76564d1f29e58f2d8d5f8f530c8&lt;/p&gt;</comment>
                            <comment id="318199" author="gerrit" created="Sun, 14 Nov 2021 03:07:54 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44296/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44296/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_14&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 29da7cba3e7b3461d895010c7f7284b9649aba52&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="60888">LU-13976</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="66690">LU-15116</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="39552" name="01-move_null.patch" size="1385" author="mmarcini2" created="Thu, 8 Jul 2021 13:55:39 +0000"/>
                            <attachment id="39551" name="02-post_state.patch" size="3591" author="mmarcini2" created="Thu, 8 Jul 2021 13:55:39 +0000"/>
                            <attachment id="39055" name="build.txt" size="274363" author="ofaaland" created="Mon, 14 Jun 2021 18:09:03 +0000"/>
                            <attachment id="39075" name="diff.txt" size="1357" author="ssmirnov" created="Tue, 15 Jun 2021 22:20:01 +0000"/>
                            <attachment id="38932" name="dk.opal188.llnl.gov.7.txt" size="1075488" author="ofaaland" created="Thu, 3 Jun 2021 15:17:09 +0000"/>
                            <attachment id="38931" name="dk.opal63.llnl.gov.7.txt" size="775009" author="ofaaland" created="Thu, 3 Jun 2021 15:17:09 +0000"/>
                            <attachment id="38934" name="dmesg.opal188.txt" size="151018" author="ofaaland" created="Thu, 3 Jun 2021 15:23:50 +0000"/>
                            <attachment id="38933" name="dmesg.opal63.txt" size="142640" author="ofaaland" created="Thu, 3 Jun 2021 15:23:50 +0000"/>
                            <attachment id="39314" name="kprobes-off.sh" size="2284" author="mmarcini2" created="Sat, 26 Jun 2021 19:16:26 +0000"/>
                            <attachment id="39316" name="kprobes.sh" size="5617" author="mmarcini2" created="Sat, 26 Jun 2021 19:16:26 +0000"/>
                            <attachment id="39531" name="linux-kernel-test.patch" size="1697" author="mmarcini2" created="Wed, 7 Jul 2021 11:45:49 +0000"/>
                            <attachment id="39526" name="move_null.patch" size="783" author="mmarcini2" created="Tue, 6 Jul 2021 21:59:53 +0000"/>
                            <attachment id="39527" name="post_state.patch" size="2598" author="mmarcini2" created="Tue, 6 Jul 2021 21:59:53 +0000"/>
                            <attachment id="39289" name="trace1.txt" size="36651" author="mmarcini2" created="Thu, 24 Jun 2021 21:10:22 +0000"/>
                            <attachment id="39315" name="trace2.txt" size="52571" author="mmarcini2" created="Sat, 26 Jun 2021 19:16:26 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01w2n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>