<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:48:33 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11972] lustre IB client always hang when memory size small than 40GB</title>
                <link>https://jira.whamcloud.com/browse/LU-11972</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I have created one VM with IB and mount lustre client ok.&lt;/p&gt;

&lt;p&gt;We tested lustre client io access in VM. (dd if=/dev/zero of=/mnt/lustre/testfile bs=1M ) &lt;br/&gt;
But lustre client IB always hang when VM memory size only 40G.&lt;/p&gt;

&lt;p&gt;This issue can not be reproduced when VM memory size is 80GB&lt;/p&gt;

&lt;p&gt;We have test with mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0 or Lustre : 2.10.5 / 2.10.6 , &lt;br/&gt;
they all have same problem when VM memory size only 40G&lt;/p&gt;

&lt;p&gt;The VM syslog print these messages (see attached file) &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32007/32007_ib_err.txt&quot; title=&quot;ib_err.txt attached to LU-11972&quot;&gt;ib_err.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&quot;mlx5_0:dump_cqe:285:(pid 1854): dump error cqe&lt;/p&gt;

&lt;p&gt;.....&lt;/p&gt;

&lt;p&gt;LustreError: 1854:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8bafb2976400&lt;br/&gt;
Jan 21 17:15:14 slurm-vm-1 kernel: Lustre: 1860:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;We have got respond from Mellanox&lt;/p&gt;

&lt;p&gt;&amp;gt;&amp;gt;Our RnD&#160;review the syslog&#160;and checked code, they give conclusion below, FYI. I think &amp;gt;&amp;gt;this issue is related with uplevel&#160;Lustre&#160;design, you can open defect for their&#160;&lt;font color=&quot;#3e3e3c&quot;&gt;community to fix.&#160;&lt;/font&gt;&lt;br/&gt;
&lt;font color=&quot;#3e3e3c&quot;&gt;&amp;gt;&amp;gt;Parsing CQE shows that it is&#160;&#160;local protection error &#8211; ERR_EXE_BIND_GAHER_TPT.&lt;/font&gt;&lt;br/&gt;
&lt;font color=&quot;#3e3e3c&quot;&gt;&amp;gt;&amp;gt;If an application gets a local protection error, in most cases&#160;it means that it is using wrong/insufficient mkey in the WR.&lt;/font&gt;&lt;br/&gt;
&lt;font color=&quot;#3e3e3c&quot;&gt;&amp;gt;&amp;gt;In this case the application is Lustre - which is open source and maintained by the community, not Mellanox.&lt;/font&gt; &lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>[VM] &lt;br/&gt;
CentOS Linux release 7.5.1804 (Core) &lt;br/&gt;
3.10.0-862.14.4.el7.x86_64 &lt;br/&gt;
CPU: Intel Xeon Processor (Skylake) 6 cores &lt;br/&gt;
MemTotal : 40GB &lt;br/&gt;
mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0 &lt;br/&gt;
Lustre : 2.10.5 / 2.10.6&lt;br/&gt;
</environment>
        <key id="54898">LU-11972</key>
            <summary>lustre IB client always hang when memory size small than 40GB</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="sebg-crd-pm">sebg-crd-pm</reporter>
                        <labels>
                    </labels>
                <created>Fri, 15 Feb 2019 02:45:20 +0000</created>
                <updated>Mon, 1 Apr 2019 03:52:06 +0000</updated>
                            <resolved>Mon, 1 Apr 2019 03:52:06 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="242089" author="pfarrell" created="Fri, 15 Feb 2019 18:25:00 +0000"  >&lt;p&gt;In order to understand the error, we&apos;d like to get some Lustre debug logs with appropriate tracing.&lt;/p&gt;

&lt;p&gt;Please run these commands on the client:&lt;/p&gt;

&lt;p&gt;lctl set_param debug=+rpctrace; lctl set_param debug=+net; lctl clear&lt;/p&gt;

&lt;p&gt;lctl mark &quot;debug start&quot;&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Run your DD test&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;dd if=/dev/zero of=/mnt/lustre/testfile bs=1M&lt;/p&gt;

&lt;p&gt;lctl mark &quot;debug finish&quot;&lt;/p&gt;

&lt;p&gt;lctl set_param debug=-rpctrace; lctl set_param debug=-net&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Write out the log file:&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;lctl dk &amp;gt; /tmp/log&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Please attach the log file to this ticket (you may need to compress it first).&#160; This will give us more info to go on.&lt;/p&gt;</comment>
                            <comment id="242302" author="sebg-crd-pm" created="Wed, 20 Feb 2019 00:36:07 +0000"  >&lt;p&gt;update&#65306;&lt;/p&gt;

&lt;p&gt;1This issue also happened when VM memory size is 80GB now.&lt;/p&gt;

&lt;p&gt;2.It seems easily to reproduce after we add lnet router node.&lt;/p&gt;

&lt;p&gt;3. attache log&#160;&#160;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32030/32030_dbg1.tgz&quot; title=&quot;dbg1.tgz attached to LU-11972&quot;&gt;dbg1.tgz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="242303" author="sebg-crd-pm" created="Wed, 20 Feb 2019 00:37:20 +0000"  >&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32030/32030_dbg1.tgz&quot; title=&quot;dbg1.tgz attached to LU-11972&quot;&gt;dbg1.tgz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="242789" author="sebg-crd-pm" created="Tue, 26 Feb 2019 08:34:12 +0000"  >&lt;p&gt;any suggestion?&lt;/p&gt;</comment>
                            <comment id="244962" author="sebg-crd-pm" created="Mon, 1 Apr 2019 01:54:18 +0000"  >&lt;p&gt;The issue can not be reproduce in other server.&lt;/p&gt;

&lt;p&gt;It&#160; looks like hardware issue. So you can close it. Thanks.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="32030" name="dbg1.tgz" size="1093789" author="sebg-crd-pm" created="Wed, 20 Feb 2019 00:37:18 +0000"/>
                            <attachment id="32007" name="ib_err.txt" size="17617" author="sebg-crd-pm" created="Fri, 15 Feb 2019 02:42:35 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00bq7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>