<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:53:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5718] RDMA too fragmented with router</title>
                <link>https://jira.whamcloud.com/browse/LU-5718</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Got an IOR failure on the soak cluster with the following errors:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct  7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags
Oct  7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Can&apos;t setup rdma for PUT to 192.168.1.114@o2ib100: -90
Oct  7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Skipped 7 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Liang told me that this is a known issue with routing. That said, the IOR process is not killable and the only option is to reboot the client node. We should at least fail &quot;gracefully&quot; by returning the error to the application.&lt;/p&gt;</description>
                <environment></environment>
        <key id="26914">LU-5718</key>
            <summary>RDMA too fragmented with router</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="doug">Doug Oucharek</assignee>
                                    <reporter username="johann">Johann Lombardi</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 8 Oct 2014 18:43:06 +0000</created>
                <updated>Fri, 14 Jun 2019 16:41:51 +0000</updated>
                            <resolved>Wed, 3 May 2017 18:05:37 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                    <version>Lustre 2.8.0</version>
                    <version>Lustre 2.9.0</version>
                                    <fixVersion>Lustre 2.10.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>38</watches>
                                                                            <comments>
                            <comment id="97661" author="liang" created="Tue, 28 Oct 2014 04:11:04 +0000"  >&lt;p&gt;patch is here: &lt;a href=&quot;http://review.whamcloud.com/12451&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12451&lt;/a&gt;&lt;br/&gt;
it&apos;s not tested yet, I need to test it.&lt;/p&gt;</comment>
                            <comment id="97730" author="hornc" created="Tue, 28 Oct 2014 18:16:45 +0000"  >&lt;p&gt;Johann/Liang, any tips for reproducing this issue?&lt;/p&gt;</comment>
                            <comment id="97796" author="liang" created="Wed, 29 Oct 2014 03:08:28 +0000"  >&lt;p&gt;I think Johann hit this while running some mixed workloads with routers. I will patch lnet_selftest and make it support brw with offset ,which should be able to reproduce this issue.&lt;/p&gt;</comment>
                            <comment id="97914" author="shadow" created="Thu, 30 Oct 2014 06:29:08 +0000"  >&lt;p&gt;I don&apos;t sure patch is correct.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct  7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I think main reason for it - incorrect calculation on osc/ptlrpc layer. It&apos;s already responsible to the check number fragments for bulk transfer.&lt;/p&gt;</comment>
                            <comment id="98461" author="hornc" created="Wed, 5 Nov 2014 20:01:19 +0000"  >&lt;p&gt;Liang, do you have an LU ticket for the lnet_selftest enhancement you mentioned?&lt;/p&gt;</comment>
                            <comment id="98504" author="liang" created="Thu, 6 Nov 2014 04:15:11 +0000"  >&lt;p&gt;Hi Chris, I didn&apos;t create another ticket for selftest, but I have posted patch for it: &lt;a href=&quot;http://review.whamcloud.com/#/c/12496/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/12496/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="113149" author="hornc" created="Wed, 22 Apr 2015 18:26:53 +0000"  >&lt;p&gt;We had a site report seeing an error with this patch when they set peer_credits &amp;gt; 16:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LNetError: 2641:0:(o2iblnd.c:872:kiblnd_create_conn()) Can&apos;t create QP: -12, send_wr: 16191, recv_wr: 254, send_sge: 2, recv_sge: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="113183" author="liang" created="Thu, 23 Apr 2015 02:44:36 +0000"  >&lt;p&gt;Chris, I don&apos;t think this is an issue from this patch because it does not consume extra memory, I suspect it is about connd may aggressively reconnect when there is connection race, I will post a patch for this.&lt;/p&gt;</comment>
                            <comment id="113223" author="hornc" created="Thu, 23 Apr 2015 15:50:53 +0000"  >&lt;p&gt;Thanks, Liang. FWIW, they only see that error with this patch applied, and when they set &quot;options ko2iblnd wrq_sge=1&quot; the error goes away...&lt;/p&gt;</comment>
                            <comment id="113231" author="isaac" created="Thu, 23 Apr 2015 16:41:08 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=liang&quot; class=&quot;user-hover&quot; rel=&quot;liang&quot;&gt;liang&lt;/a&gt; I think the patch could cause increased memory overhead at the OFED and layers beneath it, since init_qp_attr-&amp;gt;cap.max_send_sge is doubled.&lt;/p&gt;</comment>
                            <comment id="113237" author="shadow" created="Thu, 23 Apr 2015 17:12:55 +0000"  >&lt;p&gt;Issac, &lt;/p&gt;

&lt;p&gt;did you remember my comments about additional memory issues with that patch?... &lt;/p&gt;</comment>
                            <comment id="113244" author="isaac" created="Thu, 23 Apr 2015 17:29:17 +0000"  >&lt;p&gt;Alexey, that&apos;s the price to pay - there&apos;s no free lunch.&lt;/p&gt;</comment>
                            <comment id="113442" author="gerrit" created="Mon, 27 Apr 2015 02:32:38 +0000"  >&lt;p&gt;Liang Zhen (liang.zhen@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/14600&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14600&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; o2iblnd: avoid intensive reconnecting&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 5ec48e1f63befe9c361ddac6d8baa38aa83edd34&lt;/p&gt;</comment>
                            <comment id="113455" author="liang" created="Mon, 27 Apr 2015 05:07:56 +0000"  >&lt;p&gt;Isaac, indeed, thanks for pointing out. &lt;br/&gt;
Chris, could you try this patch and see if it can help.&lt;/p&gt;</comment>
                            <comment id="113491" author="shadow" created="Mon, 27 Apr 2015 15:52:19 +0000"  >&lt;p&gt;Per discussion with Melanox people, they don&apos;t happy with increasing number fragments for own IB cards. &lt;br/&gt;
because it need large array allocated via kmalloc. With Cray default settings - it&apos;s 128k allocation per connection, so easy to hit problem with any new connection. With own patch it allocation will double so need 256k allocation via kmalloc.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;                qp-&amp;gt;sq.wrid  = kmalloc(qp-&amp;gt;sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
                qp-&amp;gt;rq.wrid  = kmalloc(qp-&amp;gt;rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I agree with Issac, none free lunch - but with that patch you may stop working with large number connections like router &amp;lt;&amp;gt; clients links.&lt;/p&gt;

&lt;p&gt;#define IBLND_SEND_WRS(v)          ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))&lt;/p&gt;
</comment>
                            <comment id="115330" author="hornc" created="Thu, 14 May 2015 15:27:16 +0000"  >&lt;p&gt;Liang, I haven&apos;t had a chance to reproduce the QP allocation failure internally, so I haven&apos;t tested your patch. I agree with Alexey that I think a big part of our problem is the large kmallocs we&apos;re doing. The site that hit this issue is using ConnectIB cards with the mlx5 drivers (I only have access to ConnectX/mlx4 cards internally). I haven&#8217;t looked at the driver code before, but it looks to me like we&apos;re not just doing the one 256k allocation noted by Alexey (I&apos;m pretty sure the qp-&amp;gt;rq.wrid kmalloc is for just 2048 bytes), but it looks like we&apos;re doing four of them:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;qp-&amp;gt;rq.wqe_cnt = 256
qp-&amp;gt;sq.wqe_cnt = 32768

         qp-&amp;gt;sq.wrid = kmalloc(qp-&amp;gt;sq.wqe_cnt * sizeof(*qp-&amp;gt;sq.wrid), GFP_KERNEL); // 262144 bytes
         qp-&amp;gt;sq.wr_data = kmalloc(qp-&amp;gt;sq.wqe_cnt * sizeof(*qp-&amp;gt;sq.wr_data), GFP_KERNEL); // 262144 bytes
         qp-&amp;gt;rq.wrid = kmalloc(qp-&amp;gt;rq.wqe_cnt * sizeof(*qp-&amp;gt;rq.wrid), GFP_KERNEL); // 2048 bytes
         qp-&amp;gt;sq.w_list = kmalloc(qp-&amp;gt;sq.wqe_cnt * sizeof(*qp-&amp;gt;sq.w_list), GFP_KERNEL); // 262144 bytes
         qp-&amp;gt;sq.wqe_head = kmalloc(qp-&amp;gt;sq.wqe_cnt * sizeof(*qp-&amp;gt;sq.wqe_head), GFP_KERNEL); // 262144 bytes
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The reason we have such large allocations is that we set peer_credits=126 and concurrent_sends=63 in order to deal with the huge amount of small messages generated by Lustre client pings at large scale (see &lt;a href=&quot;https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap166.pdf&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap166.pdf&lt;/a&gt; for details). The site that reported the QP allocation failure did try different values of peer_credits, and they found that the only values that worked were peer_credits=8 and peer_credits=16. This was on a small TDS system with just two LNet routers (I&#8217;m still waiting to find out the total number of IB peers).&lt;/p&gt;

&lt;p&gt;Interestingly, we&apos;ve deployed the multiple SGEs patch at another (very) large site that uses ConnectX/mlx4 drivers, and they have not seen this issue. So I&apos;m wondering if there&apos;s a difference in the driver code that is making this more likely.&lt;/p&gt;</comment>
                            <comment id="126339" author="gerrit" created="Fri, 4 Sep 2015 05:15:31 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/14600/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14600/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; o2iblnd: avoid intensive reconnecting&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5dcc6f68d6ebba0be4e2a7d132d4e28da7a8361e&lt;/p&gt;</comment>
                            <comment id="126388" author="shadow" created="Fri, 4 Sep 2015 15:37:57 +0000"  >&lt;p&gt;May you explain why you close a ticket with unrelated patch ?&lt;/p&gt;</comment>
                            <comment id="126389" author="shadow" created="Fri, 4 Sep 2015 15:38:55 +0000"  >&lt;p&gt;reconnect problem is completely different problem and need own ticket, but it&apos;s never addressed to wrong alignment for router buffer.&lt;br/&gt;
Please reopen ticket.&lt;/p&gt;</comment>
                            <comment id="126398" author="jgmitter" created="Fri, 4 Sep 2015 16:29:28 +0000"  >&lt;p&gt;Hi Alexey,&lt;br/&gt;
this was in error - my apologies.&lt;/p&gt;</comment>
                            <comment id="128593" author="heckes" created="Mon, 28 Sep 2015 09:47:24 +0000"  >&lt;p&gt;Error happens during still soak testing 2_7_59 + debug patch &lt;br/&gt;
(see &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150914&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150914&lt;/a&gt;)&lt;br/&gt;
running IOR (single-shared-file mode) for a single client node.&lt;br/&gt;
Job hangs and IOR process can&apos;t be killed. &lt;/p&gt;

&lt;p&gt;There&apos;re 173 message of the form:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Sep 27 09:58:24 lola-27 kernel: Lustre: 3698:0:(client.c:2040:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443372961/real 1443372961]  req@ffff880588fbd0c0 x1513236214216272/t0(0) o4-&amp;gt;soaked-OST0008-osc-ffff880818748800@192.168.1.102@o2ib10:6/4 lens 608/448 e 2 to 1 dl 1443373079 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Sep 27 09:58:24 lola-27 kernel: Lustre: 3698:0:(client.c:2040:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 27 09:58:24 lola-27 kernel: Lustre: soaked-OST0008-osc-ffff880818748800: Connection to soaked-OST0008 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Sep 27 09:58:24 lola-27 kernel: Lustre: soaked-OST0008-osc-ffff880818748800: Connection restored to soaked-OST0008 (at 192.168.1.102@o2ib10)
Sep 27 09:58:24 lola-27 kernel: LNetError: 3675:0:(o2iblnd_cb.c:1139:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.114@o2ib100 (256): 128/233 src 128/233 dst frags
Sep 27 09:58:24 lola-27 kernel: LNetError: 3675:0:(o2iblnd_cb.c:435:kiblnd_handle_rx()) Can&apos;t setup rdma for PUT to 192.168.1.114@o2ib100: -90
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which seems to correlate to the same amount of errors on OSS node (&lt;tt&gt;lola-2&lt;/tt&gt;) :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Sep 27 09:57:41 lola-2 kernel: LustreError: 8847:0:(ldlm_lib.c:3017:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff8801c9e476c0 x1513236214216272/t0(0) o4-&amp;gt;076bba0c-23e4-e9cc-96e8-bd39615184cd@192.168.1.127@o2ib100:318/0 lens 608/448 e 2 to 0 dl 1443373078 ref 1 fl Interpret:H/0/0 rc 0/0
Sep 27 09:57:41 lola-2 kernel: Lustre: soaked-OST0008: Bulk IO write error with 076bba0c-23e4-e9cc-96e8-bd39615184cd (at 192.168.1.127@o2ib100), client will retry: rc -110
Sep 27 09:58:24 lola-2 kernel: Lustre: soaked-OST0008: Client 076bba0c-23e4-e9cc-96e8-bd39615184cd (at 192.168.1.127@o2ib100) reconnecting
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="137101" author="gerrit" created="Mon, 21 Dec 2015 21:42:53 +0000"  >&lt;p&gt;Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/17699&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17699&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; o2iblnd: Revert original fix&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 682a15bf7319907cbd281021ea9af85d160cdf94&lt;/p&gt;</comment>
                            <comment id="137107" author="simmonsja" created="Mon, 21 Dec 2015 22:16:56 +0000"  >&lt;p&gt;Please don&apos;t revert it did really help relieve our router memory pressures. I really think the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7210&quot; title=&quot;ASSERTION( peer-&amp;gt;ibp_connecting == 0 )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7210&quot;&gt;&lt;del&gt;LU-7210&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt; will relieve these problems.&lt;/p&gt;</comment>
                            <comment id="137119" author="adilger" created="Mon, 21 Dec 2015 23:19:52 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Please don&apos;t revert it did really help relieve our router memory pressures. I really think the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7210&quot; title=&quot;ASSERTION( peer-&amp;gt;ibp_connecting == 0 )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7210&quot;&gt;&lt;del&gt;LU-7210&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt; will relieve these problems.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;James, as yet this patch is not landed, even the reversion needs to go through build and test since it is so old.  Are the fixes on top of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; well enough understood and tested that they present a better path forward than reverting to a state that was working for many years before it landed?  I don&apos;t have much information on it, as I&apos;m not LNet-savvy enough to make a final decision myself, but my understanding is that the current situation is worse than before the &lt;a href=&quot;http://review.whamcloud.com/14600&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14600&lt;/a&gt; patch.&lt;/p&gt;</comment>
                            <comment id="137266" author="liang" created="Wed, 23 Dec 2015 13:13:47 +0000"  >&lt;p&gt;James, I think it is better to revert it for the time being, this patch is on the right direction but it is faulty. It opened a few race windows. Instead of adding fixes for it, I think it&apos;s better to just revert it and have a better implementation. I will work out another patch for the memory issue based on this patch and &lt;a href=&quot;http://review.whamcloud.com/17527&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17527&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Andreas, I agree the situation is worse than w/o 14600 because it is faulty, sorry for that. But it is very helpful for the memory issue that people met for years, so I will rework the patch. &lt;/p&gt;</comment>
                            <comment id="138303" author="gerrit" created="Fri, 8 Jan 2016 13:33:11 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/17699/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17699/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; o2iblnd: Revert original fix&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 3efb7683679ab2d18b4d2b256acd462596324d9c&lt;/p&gt;</comment>
                            <comment id="138325" author="doug" created="Fri, 8 Jan 2016 16:52:35 +0000"  >&lt;p&gt;Given the true fix will be done in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt;, I&apos;m closing this ticket as a duplicate of that ticket.&lt;/p&gt;</comment>
                            <comment id="138329" author="hornc" created="Fri, 8 Jan 2016 16:58:38 +0000"  >&lt;p&gt;Is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt; really a duplicate? Can you briefly explain how the patch there resolves the &quot;RDMA too fragmented&quot; issue?&lt;/p&gt;</comment>
                            <comment id="138394" author="doug" created="Fri, 8 Jan 2016 22:37:47 +0000"  >&lt;p&gt;Hmm...I was assuming that &lt;a href=&quot;http://review.whamcloud.com/#/c/14600/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/14600/&lt;/a&gt; was the fix for this issue and we had to revert it as it caused other problems.  That patch is being redone under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt; which is why I wanted to close this ticket. &lt;/p&gt;

&lt;p&gt;As I look at the history, I&apos;m not convinced that &lt;a href=&quot;http://review.whamcloud.com/#/c/14600/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/14600/&lt;/a&gt; was addressing the original problem.  Does anyone know what the state of the original issue is?  I fear we have been trying to tackle too many items here.&lt;/p&gt;</comment>
                            <comment id="138396" author="hornc" created="Fri, 8 Jan 2016 22:42:03 +0000"  >&lt;p&gt;AFAIK, the original fragmentation issue still exists. The 14600 patch was, IMO, inappropriately linked to this ticket, and never addressed the fragmentation error. Hence this ticket remained open even though the 14600 patch had landed.&lt;/p&gt;</comment>
                            <comment id="138397" author="simmonsja" created="Fri, 8 Jan 2016 22:46:32 +0000"  >&lt;p&gt;The reason for 14600 creation was to fix the huge memory pressure that happened from the other patch, 12451, for this ticket. Patch 12451 was never merged but 14600 was. Also their has been debate that 12451 was fixing the issue in the right way which is why it never was merged. See the comment history here. This still needs to be investigated.&lt;/p&gt;</comment>
                            <comment id="138402" author="doug" created="Fri, 8 Jan 2016 22:56:29 +0000"  >&lt;p&gt;Ok.  That being the case, I am re-opening this ticket to address the fragmentation of memory issue.  Let&apos;s not do any more reconnection fixes here :^).&lt;/p&gt;</comment>
                            <comment id="163435" author="doug" created="Mon, 29 Aug 2016 17:32:54 +0000"  >&lt;p&gt;I&apos;m starting to believe that the fix for this issue is the same as for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7385&quot; title=&quot;Bulk IO write error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7385&quot;&gt;&lt;del&gt;LU-7385&lt;/del&gt;&lt;/a&gt;.  That is assuming the fragmentation error occurs due to an offset in the IOVs.&lt;/p&gt;</comment>
                            <comment id="163436" author="simmonsja" created="Mon, 29 Aug 2016 17:34:12 +0000"  >&lt;p&gt;So we have two not so hot solutions &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="163440" author="doug" created="Mon, 29 Aug 2016 17:45:19 +0000"  >&lt;p&gt;My big question is: why do we have an offset?  Is this caused by partial reads/writes in the file system?&lt;/p&gt;

&lt;p&gt;James: Is this going to result not in an error, but a crash after your change under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7650&quot; title=&quot;ko2iblnd map_on_demand can&amp;#39;t negotitate when page sizes are different between nodes.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7650&quot;&gt;&lt;del&gt;LU-7650&lt;/del&gt;&lt;/a&gt;?  You removed this fragment check in place of an overall sizing check before the loop.  Not sure if that will catch a problem when an offset is applied.&lt;/p&gt;</comment>
                            <comment id="163457" author="simmonsja" created="Mon, 29 Aug 2016 18:45:45 +0000"  >&lt;p&gt;It shouldn&apos;t crash since all allocating are 1 + IBLND_MAX_RDMA_FRAG to work around this issue. I know its a ugly solution but it will hold over until we move to the netlink api.&lt;/p&gt;</comment>
                            <comment id="163487" author="doug" created="Mon, 29 Aug 2016 21:22:22 +0000"  >&lt;p&gt;I don&apos;t think just adding 1 to the MAX_RDMA_FRAGS is enough.  Here is what I think is happening and I really need others to tell me my understanding is wrong or agree so we can quickly move to fix this.   We have customers adding LNet routers and running into bulk RDMA failures due to this issue so fixing this has just become a very high priority.&lt;/p&gt;

&lt;p&gt;1. A bulk operation is sent to the LNet router where the first fragment has an offset so the full 4k (assuming 4k pages) is not used in the first RDMA buffer.&lt;br/&gt;
2. In the code ko2iblnd_init_rdma(), when it is configuring the work queue for going from the source to the destination, it will have the source with a first fragment less than 4k and a destination with the first fragment ready for 4k.&lt;br/&gt;
3. 1st iteration of the loop setting things up, it will set the 1st work queue item to transfer &amp;lt;4k (size of source 1st fragment).&lt;br/&gt;
4.  When the consume routines are called, the source will advance to fragment index 2 (transferred all bytes it has) but the destination will not advance as it has space in its 1st fragment.&lt;br/&gt;
5. 2nd iteration of the loop will set up work element 2 to transfer just the number of bytes the destination has left in its 1st fragment.&lt;br/&gt;
6. When the consume routines are called, the source will not advance because it has not transferred all bytes of its 2nd fragment, but the destination will advance to index 2.&lt;br/&gt;
7.  This will continue until both source and destination indexes are 128.  At this point we will have used 256 work queue items which is the max.  It will be detected that we have used up all work queue items but are not done.  That causes the &quot;RDMA too fragmented&quot; error message.&lt;/p&gt;

&lt;p&gt;So, this issue seems to be caused by the fact that the 1st fragment of the source is &amp;lt; 4k while the destination is 4k.  It causes us to use twice as many work queue items as fragments.  It would seem that a proper solution would be to shift the offset of the destination forward to match the source so both source and destination have the same sized 1st fragments.  I have no clue how to do that and am open to suggestions.&lt;/p&gt;

&lt;p&gt;Another solution is to have 512 work queue items on LNet routers to accommodate this particular situation.  Not sure we can do that given all the funky FMR/fast reg code out there.&lt;/p&gt;

&lt;p&gt;Yet another solution is to do what was done in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7385&quot; title=&quot;Bulk IO write error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7385&quot;&gt;&lt;del&gt;LU-7385&lt;/del&gt;&lt;/a&gt; and use one very big fragment buffer on the routers when RDMA is in play.  All the source fragments will nicely fit into the big destination fragment so we don&apos;t end up needing twice the number of work queue items as source fragments.&lt;/p&gt;

&lt;p&gt;Thoughts from anyone?&lt;/p&gt;</comment>
                            <comment id="163490" author="doug" created="Mon, 29 Aug 2016 22:10:24 +0000"  >&lt;p&gt;Another possible solution is to break the assumption that we need to fill up each destination fragment before advancing to the next destination index.  If we always advance the destination index when we advance the source index, this problem would go away.  However, it would mean that the destination fragments have to be the same size as the source.  But I believe Jame&apos;s has found that this must be true anyway for multiple reasons.  The code is just not in shape to have different fragment sizes.&lt;/p&gt;</comment>
                            <comment id="163492" author="simmonsja" created="Mon, 29 Aug 2016 22:54:15 +0000"  >&lt;p&gt;Correct. The fragments sizes must match on both sides. So important is the max fragment  count that its transmitted over the wire. The thing is that we allocate all our buffers for the worst case scenario at ko2iblnd initialization. We really should be allocating it dynamically based on what the remote connection can support. Anyways its acceptable that we handle the problem as you described. Currently I can&apos;t duplicate this problem. Are their known configurations/setups that expose this. Do you need a specific work load for this to show up. &lt;/p&gt;</comment>
                            <comment id="163496" author="doug" created="Mon, 29 Aug 2016 23:16:34 +0000"  >&lt;p&gt;Don&apos;t have a profile yet.  Working on it.  Trying to get file system guys to describe the need for offset. &lt;/p&gt;

&lt;p&gt;Seagate added an &quot;offset&quot; parameter to the lnet-selftest command set so you can reproduce this issue.  See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7385&quot; title=&quot;Bulk IO write error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7385&quot;&gt;&lt;del&gt;LU-7385&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="163498" author="doug" created="Mon, 29 Aug 2016 23:25:37 +0000"  >&lt;p&gt;Also, I don&apos;t understand why this is only happening with LNet routers and not direct RDMA operations.  In theory, it should happen everywhere.   I&apos;m really missing something and the code is not making it obvious.&lt;/p&gt;</comment>
                            <comment id="163629" author="doug" created="Tue, 30 Aug 2016 18:04:01 +0000"  >&lt;p&gt;The one scenario given to me which could cause an offset is using O_DIRECT read or write on a 512-byte sector boundary.  Possibly you also have mix this with non-O_DIRECT operations (not sure).&lt;/p&gt;</comment>
                            <comment id="163671" author="doug" created="Wed, 31 Aug 2016 04:00:36 +0000"  >&lt;p&gt;Another, preferable, option is to fix the original patch by Liang, &lt;a href=&quot;http://review.whamcloud.com/12451&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12451&lt;/a&gt;.  In that patch, was having peer_credits &amp;gt; 16 triggering too many send_wr&apos;s?  &lt;/p&gt;</comment>
                            <comment id="163695" author="olaf" created="Wed, 31 Aug 2016 09:19:46 +0000"  >&lt;blockquote&gt;&lt;p&gt;Also, I don&apos;t understand why this is only happening with LNet routers and not direct RDMA operations. In theory, it should happen everywhere. I&apos;m really missing something and the code is not making it obvious.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The explanation for this difference in behaviour is likely that the source and target both use the offset because it corresponds to (say) an offset in a file. A router on the other hand, only needs to buffer the message for forwarding, and doesn&apos;t need to replicate the offset in its buffer. Note the 0 for the offset parameter in lnet_ni_recv() below.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;lnet_parse()&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!for_me) {
                rc = lnet_parse_forward_locked(ni, msg);
                lnet_net_unlock(cpt);

                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc &amp;lt; 0)
                        &lt;span class=&quot;code-keyword&quot;&gt;goto&lt;/span&gt; free_drop;

                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc == LNET_CREDIT_OK) {
                        lnet_ni_recv(ni, msg-&amp;gt;msg_private, msg, 0,
                                     0, payload_length, payload_length);
                }
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 0;
        }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The easiest approach to making RDMA work better here might be to use the offset when buffering routed messages. This should cost at most one extra page per message buffer and result in one extra fragment. If we really cannot afford to spend the extra page, we could try to use the fact that the partial page at the start + partial page at end &amp;lt;= one page, so in principle we can store both fragments in a single page. (There is still the extra fragment to deal with, and we may end up having to debug RDMA engines if it turns out they don&apos;t like putting two non-overlapping fragments into the same page.)&lt;/p&gt;

&lt;p&gt;To find where this may be coming from, if you have some kind of reproducer, consider putting a &lt;tt&gt;WARN_ON()&lt;/tt&gt; in &lt;tt&gt;lnet_md_build()&lt;/tt&gt; that triggers when &lt;tt&gt;umd-&amp;gt;start&lt;/tt&gt; isn&apos;t a multiple of the page size. You&apos;ll probably want to limit that to the &lt;tt&gt;LNET_MD_IOVEC&lt;/tt&gt; and &lt;tt&gt;LNET_MD_KIOV&lt;/tt&gt; cases, because you&apos;d likely get a warning for each LNet ping otherwise. If you can get to the point where the warnings are only triggered by the cases of interest, but the traces don&apos;t provide enough information by themselves, you can change them to a BUG_ON and dig through the core to get the actual function parameters.&lt;/p&gt;</comment>
                            <comment id="164638" author="doug" created="Thu, 1 Sep 2016 16:42:34 +0000"  >&lt;p&gt;I have given up trying to reproduce this issue from the file system level.  Having no luck.  Instead, I have updated the lnet-selftest patch which adds an offset parameter.  Using that, I have been able to reproduce the issue.&lt;/p&gt;

&lt;p&gt;Note: before the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7650&quot; title=&quot;ko2iblnd map_on_demand can&amp;#39;t negotitate when page sizes are different between nodes.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7650&quot;&gt;&lt;del&gt;LU-7650&lt;/del&gt;&lt;/a&gt; I get an error message and failed bulk operation.  After &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7650&quot; title=&quot;ko2iblnd map_on_demand can&amp;#39;t negotitate when page sizes are different between nodes.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7650&quot;&gt;&lt;del&gt;LU-7650&lt;/del&gt;&lt;/a&gt; I get a crash (see my comment above about this).  So fixing this issue has become much more important now if we don&apos;t want to revert &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7650&quot; title=&quot;ko2iblnd map_on_demand can&amp;#39;t negotitate when page sizes are different between nodes.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7650&quot;&gt;&lt;del&gt;LU-7650&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What Olaf has suggested sounds good, but I need to provide a production system a patch ASAP and don&apos;t really have the time to investigate that approach.  Instead, I&apos;m going to take Liang&apos;s original fix, &lt;a href=&quot;http://review.whamcloud.com/12451&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12451&lt;/a&gt;, and see if I can resolve the problem found with peer_credits.&lt;/p&gt;</comment>
                            <comment id="164676" author="shadow" created="Thu, 1 Sep 2016 18:07:59 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;Router don&apos;t have such info about offset, as sender don&apos;t fill it on message. That information exist on osc protocol as first KIOV isn&apos;t page aligned. Short solution - just allocate a whole large buffer as single alloc_pages() on router. But it produce a problems with TCP &amp;lt;&amp;gt; IB routing as lnet need adjusted to ability to send large buffer to socklnd.&lt;/p&gt;</comment>
                            <comment id="164686" author="simmonsja" created="Thu, 1 Sep 2016 18:49:06 +0000"  >&lt;p&gt;Doug the original patch for this ticket was integrated into our default cray 2.5 clients. On our systems it broke are routers unless sge_wqe=1 was set. Since you are under pressure it seems logical to use it as a band aid for the site currently suffering from this issue.&lt;/p&gt;</comment>
                            <comment id="164687" author="doug" created="Thu, 1 Sep 2016 18:59:26 +0000"  >&lt;p&gt;I&apos;m wondering if a recent change which keeps retrying QP creation lowering the number of send_wr&apos;s  each iteration will fix/mask the problem Cray found?&lt;/p&gt;</comment>
                            <comment id="164722" author="doug" created="Thu, 1 Sep 2016 22:59:41 +0000"  >&lt;p&gt;James: I&apos;m finding the original patch is only needed on the clients as we have not seen this problem with servers rdma&apos;ing to the routers or routers rdma&apos;ing to the clients.  So, you really only need wrq_sge=2 (default) on the clients and set it to 1 on the routers and servers.  &lt;/p&gt;</comment>
                            <comment id="164792" author="shadow" created="Fri, 2 Sep 2016 09:40:07 +0000"  >&lt;p&gt;Doug,&lt;/p&gt;

&lt;p&gt;looks you have incorrect investigation.&lt;br/&gt;
problem is related with loop on router side where first unaligned chunk was mapped to aligned, but it caused twice more segments per WC to send one page.&lt;br/&gt;
simple workaround is &lt;a href=&quot;https://github.com/Xyratex/lustre-stable/commit/f9895e2423ad76147bfbb6c4974c58439782180f&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/Xyratex/lustre-stable/commit/f9895e2423ad76147bfbb6c4974c58439782180f&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;but it will cause a problem with TCP &amp;lt;&amp;gt; routing.&lt;/p&gt;</comment>
                            <comment id="164799" author="olaf" created="Fri, 2 Sep 2016 13:44:37 +0000"  >&lt;p&gt;Looking a bit closer, the offset into the first page is present at the LND level (as opposed to LNet level) for the o2ib and (I think) gni LNDs. The sock LND does not have it. So there is a problem when data is routed from a TCP network to IB or GNI. It would be possible to extend the sock LND to carry this data (easier than an LNet protocol change) but some less invasive option might be preferable.&lt;/p&gt;</comment>
                            <comment id="164827" author="doug" created="Fri, 2 Sep 2016 16:11:53 +0000"  >&lt;p&gt;The Xyratex fix was put into Gerrit under: &lt;a href=&quot;http://review.whamcloud.com/16141/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/16141/&lt;/a&gt;.   I made some significant changes to the patch to make it work with ksocklnd and to make the change more adaptable (controlling the larger fragments by making them a new pool).  It has not been reviewed/landed yet.&lt;/p&gt;

&lt;p&gt;What I have found when this problem occurs in production is that an offset is only applied when a client is doing an rdma write through a router.  When the client is setting up the work queue elements to rdma write to the router, it runs out of elements because it is using two elements for each fragment because the fragments are out of sync (see my description of the problem above).  So the client is reporting the error &quot;Too fragmented&quot; and is aborting the rdma operation.  I have never seen the &quot;Too fragmented&quot; error reported on servers or routers.&lt;/p&gt;

&lt;p&gt;This can be fixed in two ways:&lt;/p&gt;

&lt;p&gt;1- as Liang&apos;s original fix does, just have more work queue elements via doubling the sge so the client can rdma an offset buffer into a non-offset buffer in the router.&lt;br/&gt;
2- as Xyratex has done and have the router&apos;s buffer be one single big fragment so each fragment is appended into the router&apos;s buffer.  This prevents the number of work queue elements needed from doubling.&lt;/p&gt;

&lt;p&gt;Fix 1 needs to by applied to the clients (and servers if we believe an offset can ever happen from here...no evidence of that yet).  Fix 2 needs to be applied to the routers only.&lt;/p&gt;

&lt;p&gt;Which one is best?  That seems to be an ongoing discussion here.&lt;/p&gt;</comment>
                            <comment id="164841" author="simmonsja" created="Fri, 2 Sep 2016 17:29:13 +0000"  >&lt;p&gt;Ugh. Neither is great since they involve increasing the memory foot print. In our experience Cray routers tend to be very memory constrained so I would go for the client fix option. I have looked at the Xyratex solution and never understood why a new buffer. Couldn&apos;t we just expand on the large buffers that already exist?&lt;/p&gt;

&lt;p&gt;On the other hand down the road when we move to netlink then the upper layer problems go away. These work arounds in the LND driver would have to be cleaned up in the future.&lt;/p&gt;</comment>
                            <comment id="164850" author="doug" created="Fri, 2 Sep 2016 18:31:18 +0000"  >&lt;p&gt;I wanted to let users control the use of large RDMA buffers.  Allocating a large number of 1M buffers made up of contiguous pages can be challenging if the system&apos;s memory has become fragmented.  Depending on how much memory the router has, a customer may want to control the allocution of these buffers falling back on to fragmented large buffers when they run out of the contiguous ones.  Having a separate pool make this easier to configure and adapt to each unique situation.  &lt;/p&gt;</comment>
                            <comment id="164863" author="simmonsja" created="Fri, 2 Sep 2016 20:44:31 +0000"  >&lt;p&gt;Oh I see what Alyona is doing. Its just I&apos;m used to seeing contiguous pages allocated using alloc_contig_range() or CMA. Since this is the case I would recommend you rename some of the RDMA labels to CMA so it makes sense to external reviewers. I need to look at these older kernels everyone uses to see what apis are available.&lt;/p&gt;</comment>
                            <comment id="164873" author="doug" created="Fri, 2 Sep 2016 22:54:36 +0000"  >&lt;p&gt;Any kernel wisdom you can bring to the solution will be appreciated.  I have no idea what CMA is.  Out of the loop.&lt;/p&gt;</comment>
                            <comment id="164881" author="shadow" created="Sat, 3 Sep 2016 13:16:59 +0000"  >&lt;p&gt;James,&lt;/p&gt;

&lt;p&gt;problem is simple. it&apos;s bug or feature in o2ibld which copied to GNI.&lt;br/&gt;
kiblnd_init_rdma() function.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (resid &amp;gt; 0) {
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (srcidx &amp;gt;= srcrd-&amp;gt;rd_nfrags) {
                        CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;Src buffer exhausted: %d frags\n&quot;&lt;/span&gt;, srcidx);
                        rc = -EPROTO;
                        &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
                }

                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (dstidx == dstrd-&amp;gt;rd_nfrags) {
                        CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;Dst buffer exhausted: %d frags\n&quot;&lt;/span&gt;, dstidx);
                        rc = -EPROTO;
                        &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
                }

                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (tx-&amp;gt;tx_nwrq &amp;gt;= conn-&amp;gt;ibc_max_frags) {
                        CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;RDMA has too many fragments &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; peer %s (%d), &quot;&lt;/span&gt;
                               &lt;span class=&quot;code-quote&quot;&gt;&quot;src idx/frags: %d/%d dst idx/frags: %d/%d\n&quot;&lt;/span&gt;,
                               libcfs_nid2str(conn-&amp;gt;ibc_peer-&amp;gt;ibp_nid),
                               conn-&amp;gt;ibc_max_frags,
                               srcidx, srcrd-&amp;gt;rd_nfrags,
                               dstidx, dstrd-&amp;gt;rd_nfrags);
                        rc = -EMSGSIZE;
                        &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
                }

                wrknob = MIN(MIN(kiblnd_rd_frag_size(srcrd, srcidx),
                                 kiblnd_rd_frag_size(dstrd, dstidx)), resid); &amp;lt;&amp;lt;&amp;lt; it line is problem.

                sge = &amp;amp;tx-&amp;gt;tx_sge[tx-&amp;gt;tx_nwrq];
                sge-&amp;gt;addr   = kiblnd_rd_frag_addr(srcrd, srcidx);
                sge-&amp;gt;lkey   = kiblnd_rd_frag_key(srcrd, srcidx);
                sge-&amp;gt;length = wrknob;

                wrq = &amp;amp;tx-&amp;gt;tx_wrq[tx-&amp;gt;tx_nwrq];

                wrq-&amp;gt;next       = wrq + 1;
                wrq-&amp;gt;wr_id      = kiblnd_ptr2wreqid(tx, IBLND_WID_RDMA);
                wrq-&amp;gt;sg_list    = sge;
                wrq-&amp;gt;num_sge    = 1;
                wrq-&amp;gt;opcode     = IB_WR_RDMA_WRITE;
                wrq-&amp;gt;send_flags = 0;

                wrq-&amp;gt;wr.rdma.remote_addr = kiblnd_rd_frag_addr(dstrd, dstidx);
                wrq-&amp;gt;wr.rdma.rkey        = kiblnd_rd_frag_key(dstrd, dstidx);

                srcidx = kiblnd_rd_consume_frag(srcrd, srcidx, wrknob);
                dstidx = kiblnd_rd_consume_frag(dstrd, dstidx, wrknob);

                resid -= wrknob;

                tx-&amp;gt;tx_nwrq++;
                wrq++;
                sge++;
        }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;so source transfer with sizes &lt;/p&gt;
{128,4096}
&lt;p&gt; and destination segments &lt;/p&gt;
{4096}
&lt;p&gt;,&lt;/p&gt;
{4096}
&lt;p&gt; will mapped into &lt;/p&gt;
{128},{4096-128}, {128}
&lt;p&gt; segments. so in general it&apos;s needs a twice more SGE/WR to send same amount of data. as now one source segment needs a two destination segments on router.&lt;br/&gt;
but single buffer will include all buffers without know about sizing.&lt;/p&gt;

&lt;p&gt;probably someone may find better solution with just fix it loop and it avoid problem at all without any protocol, memory pool changes.&lt;/p&gt;
</comment>
                            <comment id="166120" author="thomas.stibor" created="Thu, 15 Sep 2016 11:40:43 +0000"  >&lt;p&gt;Hi there,&lt;/p&gt;

&lt;p&gt;we encountered the &lt;tt&gt;RDMA too fragmented&lt;/tt&gt; problem without LNET routers on nearly all clients where a kind of strange function call pattern was executed:&lt;br/&gt;
Example of one of the client.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;...
Sep  2 06:05:28 lxbk0101 kernel: [2569481.266678] LNetError: 1407:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too fragmented &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.20.0.250@o2ib1 (256): 241/256 src 241/256 dst frags
Sep  2 06:05:28 lxbk0101 kernel: [2569481.269159] LNetError: 1407:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) Skipped 1 previous similar message
Sep  2 06:05:28 lxbk0101 kernel: [2569481.270325] LNetError: 1407:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can&apos;t setup rdma &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; GET from 10.20.0.250@o2ib1: -90
Sep  2 06:05:28 lxbk0101 kernel: [2569481.271498] LNetError: 1407:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Skipped 1 previous similar message
....
....
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As a consequence of the error the Lustre client lost the corresponding OST with message (lfs check osts): &lt;tt&gt;Resource temporarily unavailable (11)&lt;/tt&gt;&lt;br/&gt;
However, this behavior occurred only when the process was close reaching the soft/hard quota. Switching quota completely off or being at least 50% away&lt;br/&gt;
from the soft/hard quota, did not triggered the problem.&lt;/p&gt;

&lt;p&gt;MDS/OSS are with Lustre 2.5.3, Clients are with Lustre 2.6&lt;/p&gt;

&lt;p&gt;The (strage) function call pattern is:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;.....
write(2, &lt;span class=&quot;code-quote&quot;&gt;&quot;Info in &amp;lt;CbmPlutoGenerator::Read&quot;&lt;/span&gt;..., 96) = 96
write(2, &lt;span class=&quot;code-quote&quot;&gt;&quot;Info in &amp;lt;CbmPlutoGenerator::Read&quot;&lt;/span&gt;..., 75) = 75
write(2, &lt;span class=&quot;code-quote&quot;&gt;&quot;Info in &amp;lt;CbmPlutoGenerator::Read&quot;&lt;/span&gt;..., 96) = 96
write(1, &lt;span class=&quot;code-quote&quot;&gt;&quot;BoxGen: kf=1000010020, p=(0.20, &quot;&lt;/span&gt;..., 285) = 285
write(1, &lt;span class=&quot;code-quote&quot;&gt;&quot; GTREVE_ROOT : Transporting prim&quot;&lt;/span&gt;..., 8151) = 8151
lseek(18, 580977933, SEEK_SET)          = 580977933
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, &lt;span class=&quot;code-quote&quot;&gt;&quot;\0\0005]\3\354\0.\353\24VO*\353\0V\0&amp;lt;\0\0\0\0\&quot;&lt;/span&gt;\241\5\r\0\0\0\0\0\0&quot;..., 13661) = 13661
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 580991594, SEEK_SET)          = 580991594
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, &lt;span class=&quot;code-quote&quot;&gt;&quot;\0\0;1\3\354\0.\353\24VO*\353\0R\0&amp;lt;\0\0\0\0\&quot;&lt;/span&gt;\241:j\0\0\0\0\0\0&quot;..., 15153) = 15153
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 581006747, SEEK_SET)          = 581006747
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, &lt;span class=&quot;code-quote&quot;&gt;&quot;\0\2\3Y\3\354\0.\353\24VO*\353\0W\0&amp;lt;\0\0\0\0\&quot;&lt;/span&gt;\241u\233\0\0\0\0\0\0&quot;..., 131929) = 131929
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 581138676, SEEK_SET)          = 581138676
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, &lt;span class=&quot;code-quote&quot;&gt;&quot;\0\3\4\16\3\354\0.\353\24VO*\353\0U\0&amp;lt;\0\0\0\0\&quot;&lt;/span&gt;\243x\364\0\0\0\0\0\0&quot;..., 197646) = 197646
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 581336322, SEEK_SET)          = 581336322
...
...
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We were able to simulate the pattern with a simple Ruby script and were able to produce the &lt;tt&gt;RDMA too fragmented&lt;/tt&gt; error&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#!/usr/bin/env ruby

# Generate a 1MB buffer
data = &lt;span class=&quot;code-object&quot;&gt;String&lt;/span&gt;.&lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt;
100000.times &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; |i|
  data &amp;lt;&amp;lt; &lt;span class=&quot;code-quote&quot;&gt;&apos;0123456789&apos;&lt;/span&gt;
end

File.open(ARGV.first, &lt;span class=&quot;code-quote&quot;&gt;&apos;w+&apos;&lt;/span&gt;) &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; |f|
  512.times &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; |i|
    offset = i * 1000000
    puts offset
    f.seek(offset, IO::SEEK_SET) # SEEK_SET seeks from the beginning of the file
    f.write(data) # the write call already sets the file cursor where we will seek to in the next cycle
  end
end
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;tt&gt;RDMA too fragment&lt;/tt&gt; error was triggered after the 5th, 6th or sometimes the 7th loop.&lt;/p&gt;

&lt;p&gt;We are currently running more investigations on dedicated Lustre client machines, however, it looks like that by setting &lt;tt&gt;max_pages_per_rpc=64&lt;/tt&gt; =&amp;gt; 4  * 64 = 256 thus&lt;br/&gt;
matching it with the&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;./lnet/include/lnet/types.h:#define LNET_MAX_IOV    256
klnds/o2iblnd/o2iblnd.h:#define IBLND_MAX_RDMA_FRAGS         LNET_MAX_IOV           &lt;span class=&quot;code-comment&quot;&gt;/* max # of fragments supported */&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;the problem is not occurring anymore within routine&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (tx-&amp;gt;tx_nwrq == IBLND_RDMA_FRAGS(conn-&amp;gt;ibc_version)) {
                        CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;RDMA too fragmented &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; %s (%d): &quot;&lt;/span&gt;
                               &lt;span class=&quot;code-quote&quot;&gt;&quot;%d/%d src %d/%d dst frags\n&quot;&lt;/span&gt;,
                               libcfs_nid2str(conn-&amp;gt;ibc_peer-&amp;gt;ibp_nid),
                               IBLND_RDMA_FRAGS(conn-&amp;gt;ibc_version),
                               srcidx, srcrd-&amp;gt;rd_nfrags,
                               dstidx, dstrd-&amp;gt;rd_nfrags);
                        rc = -EMSGSIZE;
                        &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So far the problems seems to be gone by setting &lt;tt&gt;max_pages_per_rpc=64&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="166623" author="simmonsja" created="Tue, 20 Sep 2016 20:43:58 +0000"  >&lt;p&gt;By setting max_pages_per_rpc=64 this means you are below the 256 page limit so you will never hit that issue.&lt;/p&gt;

&lt;p&gt;Doug I tried the new lnet selftest patch with offsets of 64 and 256 and I didn&apos;t see any fragmentation issues. Have you been able to reproduce the problem? &lt;/p&gt;</comment>
                            <comment id="166665" author="shadow" created="Wed, 21 Sep 2016 04:37:06 +0000"  >&lt;p&gt;James,&lt;/p&gt;

&lt;p&gt;did you use an 1mb transfer size for lnet selftest with offset? &lt;/p&gt;

&lt;p&gt;ps. max_pages_per_rpc=128 should be enough to avoid problem as it twice more fragments on router than pages.&lt;/p&gt;</comment>
                            <comment id="166979" author="doug" created="Thu, 22 Sep 2016 21:47:31 +0000"  >&lt;p&gt;Thomas: I have checked and so far, everyone who has seen this issue in production were using the Quotas feature.  You may be on to something.  One theory is that quotas can trigger a syncio which may cause a page misalignment.  I&apos;m going to look more into this.&lt;/p&gt;

&lt;p&gt;I have also run into two examples where the &quot;RDMA too fragmented&quot; error was encounter where a router was not used.&lt;/p&gt;

&lt;p&gt;Olaf: are you certain that the destination will match the same starting offset when a router is not present?  If so, we should never see this error without a router.&lt;/p&gt;

&lt;p&gt;If this issue can happen without a router, then the fix which uses large, continuous, buffers on routers won&apos;t cover all cases.  The original fix by Liang, however, will.&lt;/p&gt;

&lt;p&gt;I&apos;m proposing that we land that fix but make the default of sge_wqe=1 so it is off by default.  Then, anyone running into this error can turn on the fix on systems which exhibit it.&lt;/p&gt;

&lt;p&gt;In the meantime, I will see if I can better understand how/why quotas is trigger this and see if that can be resolved.&lt;/p&gt;</comment>
                            <comment id="166980" author="doug" created="Thu, 22 Sep 2016 21:50:07 +0000"  >&lt;p&gt;Note: one example of this issue when routers are not present was on the discussion board: &lt;a href=&quot;https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12963.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12963.html&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="166982" author="adilger" created="Thu, 22 Sep 2016 22:19:49 +0000"  >&lt;p&gt;When quotas are enabled, it is possible that if a user hits the quota limit this will force sync writes from the client. That&apos;s similar if the client is doing O_DIRECT writes. &lt;/p&gt;

&lt;p&gt;It is a bit confusing, however, since I can&apos;t imaging why this would cause unaligned writes, since the pages should have previously been fully fetched to the client, so the whole page ghouls be written in this case and the write should be page aligned. AFAIK, only O_DIRECT should be able to generate partial page writes, anything else is a bug IMHO. &lt;/p&gt;

&lt;p&gt;Rather than transferring all of the pages misaligned, my strong preference would be to doc the handling of the first page, and then send the rest of the pages properly. Is the lack of ability to send the first partial page w problem at the Lustre level or LNet? If it is a bug in the way Lustrr generares the bulk requests then this could (also) be fixed, even if there also needs to be a temporary fix in LNet as well. &lt;/p&gt;</comment>
                            <comment id="168295" author="gerrit" created="Wed, 5 Oct 2016 03:51:11 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12496/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12496/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; lnet: add offset for selftest brw&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: efcef00cb3043b5f8661174fd80626b3dc0edc50&lt;/p&gt;</comment>
                            <comment id="186051" author="morrone" created="Thu, 23 Feb 2017 22:28:34 +0000"  >&lt;p&gt;Just a &quot;me too&quot;: We are hitting this with 2.8.0 on an OmniPath network.&#160; Client gets stuck with &quot;RDMA has too many fragments for peer&quot; sending to a router node.&#160; If it matters, we have one client isolated right now in this state.&#160; Hopefully the problem is understood at this point, but if you need some information gathering let us know what you want us to look for.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="186071" author="simmonsja" created="Fri, 24 Feb 2017 00:13:56 +0000"  >&lt;p&gt;The problem is the purposed patch breaks our systems. It causes our clients to go into a reconnect storm.&lt;/p&gt;</comment>
                            <comment id="186072" author="doug" created="Fri, 24 Feb 2017 00:44:22 +0000"  >&lt;p&gt;The summary of this issue is thus:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;The patch associated with this ticket solves the problem on all node types, but as James is reporting above, is causing problems with his clients (PPC-based).&lt;/li&gt;
	&lt;li&gt;The patch associated with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7385&quot; title=&quot;Bulk IO write error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7385&quot;&gt;&lt;del&gt;LU-7385&lt;/del&gt;&lt;/a&gt; has been tested and works, but only for LNet routers. &#160;If you see this issue on other node types, that patch will not help you.&lt;/li&gt;
	&lt;li&gt;If neither of the above two patches work, then we are going to need to devise a third option which does not yet exist.&lt;/li&gt;
	&lt;li&gt;Because there is no universal agreement on how to fix this problem, nothing has landed yet so customers are running with patches. &#160;As such, we now have a bit of a &quot;grab-bag&quot;&#160;with regards to this issue.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Ultimately, it would be good to get some agreement on how to fix this and land something.&lt;/p&gt;</comment>
                            <comment id="186075" author="morrone" created="Fri, 24 Feb 2017 00:57:57 +0000"  >&lt;p&gt;This issue is explicitly about a client problem.&#160; So &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7385&quot; title=&quot;Bulk IO write error&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7385&quot;&gt;&lt;del&gt;LU-7385&lt;/del&gt;&lt;/a&gt; doesn&apos;t apply.&#160; Got it.&lt;/p&gt;

&lt;p&gt;&quot;The fix&quot;, I assume, refers to &lt;a href=&quot;https://review.whamcloud.com/#/c/12451&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;change 12451?&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;We too have PPC clients.  Not sure if any of them will get Lustre 2.8+.   But I would not particularly like to gamble on a patch that Intel hasn&apos;t committed to yet.&lt;/p&gt;</comment>
                            <comment id="186077" author="doug" created="Fri, 24 Feb 2017 01:01:17 +0000"  >&lt;p&gt;Yes, the patch for this ticket is 12451. &#160;It is &quot;off&quot; by default and needs a module parameter to turn it on. &#160;As such, it should be safe.&lt;/p&gt;

&lt;p&gt;James: if this patch is off, do you have an issue with your clients? &#160;If so, then we have to debug that. &#160;If not, then technically, we can land the patch and those can use it just have to turn it on.&lt;/p&gt;</comment>
                            <comment id="186079" author="morrone" created="Fri, 24 Feb 2017 01:35:40 +0000"  >&lt;p&gt;Speaking as a customer, a &quot;fix&quot; that requires me to manually go in and change a configuration to tell Lustre &quot;no, really, please don&apos;t be broken&quot; is not a terribly satisfactory solution.  Please work on a solution that will make Lustre work out of the box.&lt;/p&gt;</comment>
                            <comment id="187375" author="simmonsja" created="Tue, 7 Mar 2017 19:41:16 +0000"  >&lt;p&gt;I attached my router logs that show the problem with this patch.&lt;/p&gt;</comment>
                            <comment id="187679" author="doug" created="Thu, 9 Mar 2017 18:23:59 +0000"  >&lt;p&gt;James: I looked at your router logs. &#160;All but rtr5 have no o2iblnd logs. &#160;Only gnilnd. &#160;Rtr5 has some new logs you must have added to debug this issue. &#160;Stuff like &quot;(130)++&quot; seem to be a counter you are keeping track of. &#160;What is it? &#160;Is it counting queue depth? &#160;Is that the problem you are running into?&lt;/p&gt;</comment>
                            <comment id="187698" author="simmonsja" created="Thu, 9 Mar 2017 19:29:22 +0000"  >&lt;p&gt;That is from&#160;kiblnd_conn_addref() which is a inline function defined in o2iblnd.h.&lt;/p&gt;</comment>
                            <comment id="187738" author="doug" created="Fri, 10 Mar 2017 00:43:05 +0000"  >&lt;p&gt;I&apos;m not seeing the client reconnect storm in those logs. &#160;Is neterr logs turned off?&lt;/p&gt;</comment>
                            <comment id="191543" author="doug" created="Tue, 11 Apr 2017 16:47:57 +0000"  >&lt;p&gt;James: give the latest patch version, 10, a try on PPC. &#160;I believe I fixed the PPC issue with the patch.&lt;/p&gt;</comment>
                            <comment id="191764" author="sthiell" created="Wed, 12 Apr 2017 22:13:12 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;We just hit this problem on brand new 2.9 clients, only&#160;on a bigmem node, leading to deadlock&apos;ed writes on our /scratch. We are using EE3 servers with lnet routers (they are all already patched for this, see DELL-221).&lt;/p&gt;

&lt;p&gt;As we think to have here a basic use case, because&#160;only a few&#160;processes were reading from a single file and writing to multiple files, apparently doing (nice) 4M I/Os before the deadlock occurred, we took a crash dump which is available for download at the following link:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://stanford.box.com/s/d37761k3ywukxh7im9mq8mgp9m2gkpga&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://stanford.box.com/s/d37761k3ywukxh7im9mq8mgp9m2gkpga&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It shows deadlocked writes after the RDMA too fragmented errors.&lt;/p&gt;

&lt;p&gt;Kernel version is&#160;3.10.0-514.10.2.el7.x86_64 on el7&lt;/p&gt;

&lt;p&gt;Hope this helps...&lt;/p&gt;

&lt;p&gt;Stephane&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="191903" author="simmonsja" created="Thu, 13 Apr 2017 17:25:45 +0000"  >&lt;p&gt;Currently my test system that has this problem is down until middle of next week. As soon as it is back I will test it.&lt;/p&gt;</comment>
                            <comment id="193502" author="gerrit" created="Wed, 26 Apr 2017 03:36:42 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/12451/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/12451/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt; o2iblnd: multiple sges for work request&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: fda19c748016c9f57f71278b597fd8a651268f66&lt;/p&gt;</comment>
                            <comment id="193548" author="pjones" created="Wed, 26 Apr 2017 10:57:10 +0000"  >&lt;p&gt;Landed for 2.10&lt;/p&gt;</comment>
                            <comment id="193603" author="simmonsja" created="Wed, 26 Apr 2017 15:33:01 +0000"  >&lt;p&gt;Just to let you know I&apos;m in the process of testing this patch and the latest patch seems to be holding up. Good work Doug.&lt;/p&gt;</comment>
                            <comment id="194350" author="dmiter" created="Wed, 3 May 2017 18:00:58 +0000"  >&lt;p&gt;After last patch landed I got the following error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[4020251.265904] LNetError: 95052:0:(o2iblnd_cb.c:1086:kiblnd_init_rdma()) RDMA is too large for peer 192.168.213.235@o2ib (131072), src size: 1048576 dst size: 1048576
[4020251.265941] LNetError: 95050:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Can&apos;t setup rdma for GET from 192.168.213.235@o2ib: -90
[4020251.265948] LustreError: 95050:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8816e0754c00
...
[4020251.267492] Lustre: 95098:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493833318/real 1493833318]  req@ffff8817e9e48000 x1566397691863184/t0(0) o4-&amp;gt;
[4020251.267503] Lustre: nvmelfs-OST000b-osc-ffff881c8e362000: Connection to nvmelfs-OST000b (at 192.168.213.236@o2ib) was lost; in progress operations using this service will wait for recovery to complete
...
[4020251.267965] LustreError: 95050:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff880223361400
[4020251.268058] Lustre: nvmelfs-OST000b-osc-ffff881c8e362000: Connection restored to 192.168.213.236@o2ib (at 192.168.213.236@o2ib)
...
[4020256.133400] LNetError: 95052:0:(o2iblnd_cb.c:1086:kiblnd_init_rdma()) RDMA is too large for peer 192.168.213.235@o2ib (131072), src size: 1048576 dst size: 1048576
[4020256.133561] LNetError: 95049:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Can&apos;t setup rdma for GET from 192.168.213.235@o2ib: -90
[4020256.133564] LNetError: 95049:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Skipped 159 previous similar messages
[4020256.133569] LustreError: 95049:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88192932fe00
[4020256.133630] Lustre: 95125:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493833323/real 1493833323]  req@ffff882031360300 x1566397691866144/t0(0) o4-&amp;gt;
[4020256.133634] Lustre: 95125:0:(client.c:2115:ptlrpc_expire_one_request()) Skipped 39 previous similar messages
[4020256.133654] Lustre: nvmelfs-OST000e-osc-ffff881c8e362000: Connection to nvmelfs-OST000e (at 192.168.213.235@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[4020256.133656] Lustre: Skipped 39 previous similar messages
[4020256.134200] Lustre: nvmelfs-OST000e-osc-ffff881c8e362000: Connection restored to 192.168.213.235@o2ib (at 192.168.213.235@o2ib)
[4020256.134202] Lustre: Skipped 39 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The system is partially working. I&apos;m able to see the list of files and open small files. But large bulk transfer don&apos;t work.&lt;/p&gt;</comment>
                            <comment id="194352" author="doug" created="Wed, 3 May 2017 18:05:21 +0000"  >&lt;p&gt;This was addressed by a patch to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9420&quot; title=&quot;Bad Check slipped into repo&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9420&quot;&gt;&lt;del&gt;LU-9420&lt;/del&gt;&lt;/a&gt;. &#160;I would have pulled this patch to fix it under this ticket, but the patch took 2 years to land and I was not about to pull it for fear it would take another 2 years to re-land :^(.&lt;/p&gt;</comment>
                            <comment id="194355" author="dmiter" created="Wed, 3 May 2017 18:11:44 +0000"  >&lt;p&gt;Uff. It looks I was luck to use the build without this fix. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="197115" author="sthiell" created="Thu, 25 May 2017 19:56:48 +0000"  >&lt;p&gt;Hi,&lt;br/&gt;
Could you please explain what is required to make the patches that landed work? We have tried 2.9 FE + patches from both &lt;del&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5718&quot; title=&quot;RDMA too fragmented with router&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5718&quot;&gt;&lt;del&gt;LU-5718&lt;/del&gt;&lt;/a&gt;&lt;/del&gt; and &lt;del&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9420&quot; title=&quot;Bad Check slipped into repo&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9420&quot;&gt;&lt;del&gt;LU-9420&lt;/del&gt;&lt;/a&gt;&lt;/del&gt; but are still seeing the problem on the routers. We have set wrq_sge=2 on the clients, and let the default wrq_sge=1 on the routers. We were not able to patch the servers at the moment (running IEEL3), see DELL-221.&lt;/p&gt;

&lt;p&gt;on the router with wrq_sge=1 (10.210.34.213@o2ib1 is an OSS not patched):&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[ 1111.504575] LNetError: 8688:0:(o2iblnd_cb.c:1093:kiblnd_init_rdma()) RDMA has too many fragments &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; peer 10.210.34.213@o2ib1 (256), src idx/frags: 128/147 dst idx/frags: 128/147
[ 1111.522352] LNetError: 8688:0:(o2iblnd_cb.c:430:kiblnd_handle_rx()) Can&apos;t setup rdma &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; PUT to 10.210.34.213@o2ib1: -90
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Clients and routers are using mlx5, servers are using mlx4.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Stephane&lt;/p&gt;</comment>
                            <comment id="197116" author="hornc" created="Thu, 25 May 2017 20:00:09 +0000"  >&lt;p&gt;You need to set wrq_sge=2 on the routers, too.&lt;/p&gt;</comment>
                            <comment id="197117" author="doug" created="Thu, 25 May 2017 20:01:18 +0000"  >&lt;p&gt;Your router needs wrq_sge=2. &lt;/p&gt;</comment>
                            <comment id="197118" author="srcc" created="Thu, 25 May 2017 20:05:49 +0000"  >&lt;p&gt;Ah! Thanks for the clarification, Chris and Doug! I was a bit lost as the parameters changed along the work done in this ticket. We&apos;ll test this right away.&lt;br/&gt;
All the best,&lt;br/&gt;
Stephane&lt;/p&gt;</comment>
                            <comment id="198320" author="spitzcor" created="Tue, 6 Jun 2017 15:47:26 +0000"  >&lt;p&gt;Looks like we should have opened a LUDOC ticket to document wrq_sge.&lt;/p&gt;</comment>
                            <comment id="198789" author="spitzcor" created="Fri, 9 Jun 2017 20:25:32 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LUDOC-378&quot; title=&quot;Document wrq_sge as an o2iblnd parameter&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LUDOC-378&quot;&gt;&lt;del&gt;LUDOC-378&lt;/del&gt;&lt;/a&gt; is linked to this issue.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="32997">LU-7385</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="18907">LU-3322</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="32327">LU-7210</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="33736">LU-7569</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="34042">LU-7650</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="45781">LU-9420</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="49357">LU-10252</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="33033">LU-7401</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="46539">LUDOC-378</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="55913">LU-12419</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="25750" name="4james.tgz" size="5835421" author="simmonsja" created="Tue, 7 Mar 2017 19:40:50 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 23 Dec 2015 18:43:06 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwy3z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16043</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 8 Oct 2014 18:43:06 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>