<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:19:08 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15534] failed to ping 172.19.1.27@o2ib100: Input/output error</title>
                <link>https://jira.whamcloud.com/browse/LU-15534</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we&apos;ve had an issue with lnet.&lt;/p&gt;

&lt;p&gt;We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network.  There is no indication of problems with IB:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;&quot;ping&quot; (the unix utility) between the two nodes via IPoIB is successful, in either direction&lt;/li&gt;
	&lt;li&gt;ib_write_bw between the two nodes via the IB network is successful, in either direction&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;When LNet starts, it begins reporting the following on the console:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Eventually, we see the following on the console:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;INFO: task kworker/u128:2:5350 blocked for more than 120 seconds.
&#160; &#160; &#160; Tainted: P &#160; &#160; &#160; &#160; &#160; OE&#160; &#160; --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
task:kworker/u128:2&#160; state:D stack:&#160; &#160; 0 pid: 5350 ppid: &#160; &#160; 2 flags:0x80004080
Workqueue: rdma_cm cma_work_handler [rdma_cm]
Call Trace:
 __schedule+0x2c0/0x770
 schedule+0x4c/0xc0
 schedule_preempt_disabled+0x11/0x20
 __mutex_lock.isra.6+0x343/0x550
 rdma_connect+0x1e/0x40 [rdma_cm]
 kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd]
 ? __switch_to_asm+0x41/0x70
 cma_cm_event_handler+0x25/0xf0 [rdma_cm]
 cma_work_handler+0x5a/0xb0 [rdma_cm]
 process_one_work+0x1ae/0x3a0
 worker_thread+0x3c/0x3c0
 ? create_worker+0x1a0/0x1a0
 kthread+0x12f/0x150
 ? kthread_flush_work_fn+0x10/0x10
 ret_from_fork+0x1f/0x40 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>TOSS 4.3 (based on RHEL 8.5)&lt;br/&gt;
4.18.0-348.7.1.1toss.t4.x86_64&lt;br/&gt;
lustre 2.14.0_10.llnl&lt;br/&gt;
</environment>
        <key id="68540">LU-15534</key>
            <summary>failed to ping 172.19.1.27@o2ib100: Input/output error</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="defazio">Gian-Carlo Defazio</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 8 Feb 2022 00:33:55 +0000</created>
                <updated>Fri, 11 Feb 2022 23:27:06 +0000</updated>
                            <resolved>Thu, 10 Feb 2022 23:30:33 +0000</resolved>
                                    <version>Lustre 2.14.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="325511" author="defazio" created="Tue, 8 Feb 2022 00:41:19 +0000"  >&lt;p&gt;For my notes the local ticket is at &lt;a href=&quot;https://lc.llnl.gov/jira/browse/TOSS-5521&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://lc.llnl.gov/jira/browse/TOSS-5521&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="325514" author="defazio" created="Tue, 8 Feb 2022 00:52:31 +0000"  >&lt;p&gt;So far we&apos;ve seen this issue only with the RHEL 8.5 kernel and lustre 2.14.&lt;/p&gt;

&lt;p&gt;The previous version of TOSS, TOSS 4.2-4 is based on RHEL 8.4 and doesn&apos;t have this issue. We also haven&apos;t seen it on any TOSS 3 systems which are based on RHEL 7.X and running lustre 2.12 or 2.10.&lt;/p&gt;</comment>
                            <comment id="325515" author="defazio" created="Tue, 8 Feb 2022 01:08:29 +0000"  >&lt;p&gt;This was first noticed on our new storage hardware, which includes the garter cluster.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@garter1:~]# ibstat
CA &apos;mlx5_0&apos;
&#160; &#160; &#160; &#160; CA type: MT4119
&#160; &#160; &#160; &#160; Number of ports: 1
&#160; &#160; &#160; &#160; Firmware version: 16.31.1014
&#160; &#160; &#160; &#160; Hardware version: 0
&#160; &#160; &#160; &#160; Node GUID: 0x0c42a103008ee90a
&#160; &#160; &#160; &#160; System image GUID: 0x0c42a103008ee90a
&#160; &#160; &#160; &#160; Port 1:
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; State: Active
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Physical state: LinkUp
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Rate: 100
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Base lid: 391
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; LMC: 0
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; SM lid: 363
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Capability mask: 0x2659e848
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Port GUID: 0x0c42a103008ee90a
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Link layer: InfiniBand
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The subnet manager listed, orelic1, is correct&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@garter1:~]# ibnetdiscover | grep &quot;lid 363&quot;
[2] &#160; &#160; &quot;H-506b4b0300da6764&quot;[1](506b4b0300da6764) &#160; &#160; &#160; &#160; &#160; &#160; &#160; # &quot;orelic1 mlx5_0&quot; lid 363 4xEDR
[1](506b4b0300da6764) &#160; &quot;S-248a0703006d13c0&quot;[2] &#160; &#160; &#160; &#160; # lid 363 lmc 0 &quot;SwitchIB Mellanox Technologies&quot; lid 352 4xEDR
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The issue is also preset on the boa cluster, which has the same hardware as garter&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@boai:defazio1]# ibstat
CA &apos;mlx5_0&apos;
&#160; &#160; &#160; &#160; CA type: MT4119
&#160; &#160; &#160; &#160; Number of ports: 1
&#160; &#160; &#160; &#160; Firmware version: 16.31.1014
&#160; &#160; &#160; &#160; Hardware version: 0
&#160; &#160; &#160; &#160; Node GUID: 0x0c42a10300dace36
&#160; &#160; &#160; &#160; System image GUID: 0x0c42a10300dace36
&#160; &#160; &#160; &#160; Port 1:
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; State: Active
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Physical state: LinkUp
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Rate: 100
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Base lid: 228
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; LMC: 0
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; SM lid: 5
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Capability mask: 0x2659e848
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Port GUID: 0x0c42a10300dace36
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Link layer: InfiniBand
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It also has the correct subnet manager, zrelic1&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@boai:defazio1]# ibnetdiscover | grep &quot;lid 5 &quot;
[5] &#160; &#160; &quot;H-7cfe9003000f382e&quot;[1](7cfe9003000f382e) &#160; &#160; &#160; &#160; &#160; &#160; &#160; # &quot;zrelic1 mlx5_0&quot; lid 5 4xEDR
[1](7cfe9003000f382e) &#160; &quot;S-7cfe900300b67590&quot;[5] &#160; &#160; &#160; &#160; # lid 5 lmc 0 &quot;SwitchIB Mellanox Technologies&quot; lid 23 4xEDR
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;An older cluster, slag, has the same issue as well.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@slag3:~]# ibstat
CA &apos;mlx5_0&apos;
&#160; &#160; &#160; &#160; CA type: MT4115
&#160; &#160; &#160; &#160; Number of ports: 1
&#160; &#160; &#160; &#160; Firmware version: 12.28.2006
&#160; &#160; &#160; &#160; Hardware version: 0
&#160; &#160; &#160; &#160; Node GUID: 0x506b4b0300c23712
&#160; &#160; &#160; &#160; System image GUID: 0x506b4b0300c23712
&#160; &#160; &#160; &#160; Port 1:
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; State: Active
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Physical state: LinkUp
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Rate: 100
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Base lid: 359
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; LMC: 0
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; SM lid: 363
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Capability mask: 0x2659e848
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Port GUID: 0x506b4b0300c23712
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Link layer: InfiniBand
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="325516" author="ofaaland" created="Tue, 8 Feb 2022 01:56:32 +0000"  >&lt;p&gt;We do &lt;em&gt;not&lt;/em&gt; see this issue with &lt;br/&gt;
RHEL 8.5&lt;br/&gt;
kernel 4.18.0-348.7.1.1toss.t4.x86_64&lt;br/&gt;
lustre-2.12.8_1.llnl-1.t4.x86_64&lt;/p&gt;</comment>
                            <comment id="325581" author="pjones" created="Tue, 8 Feb 2022 15:17:44 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Could you please assist with this one?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="325619" author="ssmirnov" created="Tue, 8 Feb 2022 18:11:21 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Could you please provide net debug for the failing ping test?&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl set_param debug=+net
&amp;lt;--- run test ---&amp;gt;
lctl dk &amp;gt; log.txt
lctl set_param debug=-net&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Also, could you please provide the configuration script log?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="325628" author="defazio" created="Tue, 8 Feb 2022 19:18:25 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;I&apos;ve uploaded some files.&lt;/p&gt;

&lt;p&gt;2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion.&lt;/p&gt;

&lt;p&gt;I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48.&lt;/p&gt;</comment>
                            <comment id="325641" author="ssmirnov" created="Tue, 8 Feb 2022 22:23:46 +0000"  >&lt;p&gt;This looks very similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14488&quot; title=&quot;Support rdma_connect_locked()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14488&quot;&gt;&lt;del&gt;LU-14488&lt;/del&gt;&lt;/a&gt;&#160;the fix for which appears in 2.12.7.&#160;&lt;/p&gt;

&lt;p&gt;Which MOFED version are you using?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei&lt;/p&gt;</comment>
                            <comment id="325646" author="defazio" created="Tue, 8 Feb 2022 23:31:25 +0000"  >&lt;p&gt;We are using OFED.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="325649" author="defazio" created="Wed, 9 Feb 2022 00:01:27 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14488&quot; title=&quot;Support rdma_connect_locked()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14488&quot;&gt;&lt;del&gt;LU-14488&lt;/del&gt;&lt;/a&gt; looks promising.&lt;/p&gt;

&lt;p&gt;Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1.&lt;/p&gt;</comment>
                            <comment id="325956" author="defazio" created="Thu, 10 Feb 2022 23:28:07 +0000"  >&lt;p&gt;Applying &#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14488&quot; title=&quot;Support rdma_connect_locked()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14488&quot;&gt;&lt;del&gt;LU-14488&lt;/del&gt;&lt;/a&gt; to our local 2.14 branch solved the issue. It looks like it was &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14488&quot; title=&quot;Support rdma_connect_locked()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14488&quot;&gt;&lt;del&gt;LU-14488&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="325957" author="defazio" created="Thu, 10 Feb 2022 23:30:33 +0000"  >&lt;p&gt;The issue was fixed by an existing patch that was landed for 2.15.&lt;/p&gt;

&lt;p&gt;Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="42263" name="2.14.0_10.llnl-x86_64-build.log.gz" size="61858" author="defazio" created="Tue, 8 Feb 2022 19:13:41 +0000"/>
                            <attachment id="42264" name="2.14.0_10.llnl-x86_64-config.log.gz" size="7678" author="defazio" created="Tue, 8 Feb 2022 19:13:50 +0000"/>
                            <attachment id="42261" name="garter5_ping-send_2022-02-08_10-53-41" size="316422" author="defazio" created="Tue, 8 Feb 2022 18:59:00 +0000"/>
                            <attachment id="42262" name="garter6_ping-receive_2022-02-08_10-53-48" size="270809" author="defazio" created="Tue, 8 Feb 2022 18:59:06 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02hkn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>