<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:38:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10775] (sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(2097152)</title>
                <link>https://jira.whamcloud.com/browse/LU-10775</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;running IOR on locally built lustre branch b2_10 at commit&#160;0f6c448, a couple of initial data transfers work but quickly start to fail, with server side messages like:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(4194304) &#160;req@ffff880f052d8050 x1593867370500512/t0(0) o4-&amp;gt;d0c9fb64-cf93-52c4-8daf-a80ac8484f6b@194.1.0.2@o2ib4:76/0 lens 608/448 e 0 to 0 dl 1520037046 ref 1 fl Interpret:H/2/0 rc 0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;config arg: --disable-gss&#160;&lt;/p&gt;

&lt;p&gt;module opts all defaults on both sides, perhaps something needs changed for ARM client?&lt;/p&gt;

&lt;p&gt;server has mdt + 3 osts on one node for testing, no lnet routers&lt;/p&gt;

&lt;p&gt;IB mlx5 connections.&lt;/p&gt;</description>
                <environment>RHEL 7.4 ARM client vs x86 server</environment>
        <key id="51118">LU-10775</key>
            <summary>(sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(2097152)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="simmonsja">James A Simmons</assignee>
                                    <reporter username="ruth.klundt@gmail.com">Ruth Klundt</reporter>
                        <labels>
                    </labels>
                <created>Mon, 5 Mar 2018 19:05:53 +0000</created>
                <updated>Wed, 19 Dec 2018 20:53:59 +0000</updated>
                            <resolved>Mon, 16 Apr 2018 13:56:49 +0000</resolved>
                                    <version>Lustre 2.10.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="222474" author="ruth.klundt@gmail.com" created="Tue, 6 Mar 2018 01:15:29 +0000"  >&lt;p&gt;It appears that the page size is 64k on the ARM client. so the workaround of reducing max_pages_per_rpc to 16 works to get rid of this problem.&#160;&lt;/p&gt;</comment>
                            <comment id="222478" author="simmonsja" created="Tue, 6 Mar 2018 02:05:08 +0000"  >&lt;p&gt;I see the same exact errors. I thought this issue was something else. I have a early patch to resolve this but its not complete. Let me finish up another thing I&apos;m working on and I will look into it.&lt;/p&gt;</comment>
                            <comment id="222649" author="simmonsja" created="Tue, 6 Mar 2018 23:06:42 +0000"  >&lt;p&gt;Give patch &lt;a href=&quot;https://review.whamcloud.com/#/c/31559&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/31559&lt;/a&gt; a try&lt;/p&gt;</comment>
                            <comment id="222704" author="ruth.klundt@gmail.com" created="Wed, 7 Mar 2018 15:57:24 +0000"  >&lt;p&gt;Thanks James, will do.&lt;/p&gt;</comment>
                            <comment id="224052" author="ruth.klundt@gmail.com" created="Tue, 20 Mar 2018 17:28:33 +0000"  >&lt;p&gt;sorry for the delay, the cluster has moved ahead on mofed version (MLNX_OFED_LINUX-4.2-1.4.6.0) and build/insmod of ko2iblnd is problematic now. Not sure if my build is wrong, pointing away from /usr/src/ofa-kernel to the actual kernel rpmbuild directory let&apos;s lustre config and build. But lots of these on insmod:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;170344.847903&amp;#93;&lt;/span&gt; ko2iblnd: disagrees about version of symbol ib_create_cq&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;170344.847905&amp;#93;&lt;/span&gt; ko2iblnd: Unknown symbol ib_create_cq (err -22)&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;restoring config default for o2ib to /usr/src/ofa-kernel, &apos;make&apos; works. The resulting ko2iblnd.ko loads. lctl ping generates a server side error:&lt;/p&gt;

&lt;p&gt;LNet: 14703:0:(o2iblnd_cb.c:2355:kiblnd_passive_connect()) Can&apos;t accept conn from 194.1.0.2@o2ib4 (version 12): max_frags 16 incompatible without FMR pool (256 wanted)&lt;/p&gt;

&lt;p&gt;The server is running a 2.10-ish commit&#160;2f379be, without your patch. guess I should have patched both sides..&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="224216" author="simmonsja" created="Wed, 21 Mar 2018 19:53:31 +0000"  >&lt;p&gt;Sadly Lustre 2.10 is missing patches to make it properly work. A bunch of fixes when into 2.11 to make lustre work with newer OFED stacks or new kernel IB stacks. I have been testing with 2.11 with my one additonal patch.&lt;/p&gt;</comment>
                            <comment id="224218" author="ruth.klundt@gmail.com" created="Wed, 21 Mar 2018 20:35:12 +0000"  >&lt;p&gt;I was at 2.11 RC1 with the patch on the client side, 2.10 without the patch on the server.&#160;&lt;/p&gt;

&lt;p&gt;After removing the patch the client connects fine. I&apos;m defaulting any module options.&#160;&lt;/p&gt;

&lt;p&gt;This seems like an interop issue to me.&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="224225" author="simmonsja" created="Wed, 21 Mar 2018 21:53:12 +0000"  >&lt;p&gt;That is unexpected considering x86 sends 1MB packets with and without the patch. Its ARM/Power8 that is sending 16 MB packets. I can tell you that the patch on x86 platforms will work with x86 systems without the patch. I have run the upstream client which lacks the patch against patched servers. So we have:&lt;/p&gt;

&lt;p&gt;patched x86 &amp;lt;-&amp;gt; patched x86 works&lt;/p&gt;

&lt;p&gt;unpatch x86 &amp;lt;-&amp;gt; unpatch x86 works&lt;/p&gt;

&lt;p&gt;unpatch x86 &amp;lt;-&amp;gt; patched x86 works&lt;/p&gt;

&lt;p&gt;patched x86 &amp;lt;-&amp;gt; unpatched x86 ??? should work&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;patched ARM &amp;lt;-&amp;gt; patched x86 works&lt;/p&gt;

&lt;p&gt;unpatch ARM &amp;lt;-&amp;gt; unpatch x86&#160; fails&lt;/p&gt;

&lt;p&gt;unpatch ARM &amp;lt;-&amp;gt; pacthed x86 ?? should fail since ARM is not addressed&lt;/p&gt;

&lt;p&gt;patched ARM &amp;lt;-&amp;gt; unpatch x86 fails&lt;/p&gt;

&lt;p&gt;Did you trying the server side with the patch to see if it works?&lt;/p&gt;</comment>
                            <comment id="224302" author="simmonsja" created="Thu, 22 Mar 2018 18:34:51 +0000"  >&lt;p&gt;Actually I realized I have been testing with an unpatched 2.11 server and it does work. The problem is lustre 2.10 is missing a bunch of fixes to properly support newer MOFED stack. Things like queue pair manage and map_on_demand have changed dramatically. Amir can you put together a list of missing patches for 2.10 to make this work?&lt;/p&gt;</comment>
                            <comment id="224315" author="ruth.klundt@gmail.com" created="Thu, 22 Mar 2018 19:36:37 +0000"  >&lt;p&gt;whoa, thanks but no need to patch 2.10 to make this work, I&apos;m fine with moving the server to 2.11, it&apos;s a tiny toy fs, no routers.&lt;/p&gt;

&lt;p&gt;I&apos;m far from grasping the whole map_on_demand ish, but maybe I just needed to set it to 256, don&apos;t think I did that.&lt;/p&gt;

&lt;p&gt;ps the test cluster is under work again...so next try will be in a while.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="224584" author="ashehata" created="Mon, 26 Mar 2018 22:26:19 +0000"  >&lt;p&gt;Ruth, is the server side running RHEL 7.2 or earlier?&lt;/p&gt;

&lt;p&gt;Looking through the code the reason you&apos;d get:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;LNet: 14703:0:(o2iblnd_cb.c:2355:kiblnd_passive_connect()) Can&apos;t accept conn from 194.1.0.2@o2ib4 (version 12): max_frags 16 incompatible without FMR pool (256 wanted) &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;is because you&apos;re not using FMR. This would occur if&#160;HAVE_IB_GET_DMA_MR is defined. I believe this is defined for RHEL 7.2 and earlier.&lt;/p&gt;

&lt;p&gt;you would be able to avoid this issue by setting map-on-demand to 16 on the server side as well.&lt;/p&gt;

&lt;p&gt;Can you try that and see if it resolves the issue?&lt;/p&gt;

&lt;p&gt;James, I consider the map-on-demand changes to be mini-feature. Not sure if it&apos;s the best decision to backport that to 2.10.&lt;/p&gt;

&lt;p&gt;However, we might consider porting the below patch to 2.10, because it fixes a bug&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;LU-10213 lnd: calculate qp max_send_wrs properly &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="224847" author="simmonsja" created="Thu, 29 Mar 2018 23:45:10 +0000"  >&lt;p&gt;setting map-on_demand to 16 is not going to help. I have tried it before. We are going to need the map-on-demand changes for b2_10. As he pointed out using the 64K page patch that is slated for&#160; 2.12 will break interop with x86 2.10 server when using ARM clients since it lacks the all the changes to make it possible. So we have a choice here, state that in order to user ARM clients you must use at least a 2.11 server, or back port a bunch of o2iblnd patches to make it possible. Also many of the changes missing from 2.10 make using newer MOFED possible.&#160; Do we say you have to stay on a MOFED 3.X version for 2.10?&lt;/p&gt;</comment>
                            <comment id="224882" author="ruth.klundt@gmail.com" created="Fri, 30 Mar 2018 19:00:00 +0000"  >&lt;p&gt;Amir, the server side is RHEL 7.4, I built the 2.10 at&#160;0f6c448. The ofed is&#160;MLNX_OFED_LINUX-4.2-1.0.0.0.&lt;/p&gt;

&lt;p&gt;configure reports yes to checking if &apos;ib_get_dma_mr&apos; exists, but also:&lt;/p&gt;

&lt;p&gt;WARNING: &quot;ib_get_dma_mr&quot; &lt;span class=&quot;error&quot;&gt;&amp;#91;/build_area/lustre-release/build/conftest.ko&amp;#93;&lt;/span&gt; undefined!&lt;/p&gt;

&lt;p&gt;&amp;gt; nm /lib/modules/3.10.0-693.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko | grep ib_get_dma&lt;/p&gt;

&lt;p&gt;0000000000006e50 T ib_get_dma_mr&lt;/p&gt;

&lt;p&gt;Setting map_on_demand=16 on the server works, traffic is moving, Thanks. (I guess that would not work if there were other clients mounting, with a different setting though.)&lt;/p&gt;

&lt;p&gt;The client side is now:&lt;/p&gt;

&lt;p&gt;2.11.0_RC2 + &apos;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10157&quot; title=&quot;LNET_MAX_IOV hard coded to 256&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10157&quot;&gt;&lt;del&gt;LU-10157&lt;/del&gt;&lt;/a&gt; lnet: make LNET_MAX_IOV dependent on page size&apos;&lt;/p&gt;

&lt;p&gt;&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; +&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10560&quot; title=&quot;Fixes for 4.14 kernel&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10560&quot;&gt;&lt;del&gt;LU-10560&lt;/del&gt;&lt;/a&gt; libcfs: Use kernel_write when appropriate&lt;/p&gt;

&lt;p&gt;rhel7.5, kernel 4.14.0-49.el7a.aarch64 and MLNX_OFED_LINUX-4.3-1.0.1.0&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="225220" author="simmonsja" created="Thu, 5 Apr 2018 17:25:30 +0000"  >&lt;p&gt;Ruth can you join the LWG call today?&lt;/p&gt;</comment>
                            <comment id="225221" author="ruth.klundt@gmail.com" created="Thu, 5 Apr 2018 17:27:13 +0000"  >&lt;p&gt;yes I&apos;ll be on&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="226078" author="simmonsja" created="Mon, 16 Apr 2018 13:56:55 +0000"  >&lt;p&gt;Now that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10157&quot; title=&quot;LNET_MAX_IOV hard coded to 256&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10157&quot;&gt;&lt;del&gt;LU-10157&lt;/del&gt;&lt;/a&gt; landed this ticket can be closed. As a note we need to document on the wiki for ARM/Power8 systems that you need to set map_on_demand to 16 on the back end x86 servers for lustre 2.10 version.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="48924">LU-10157</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzztvz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>