<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:42:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11230] QIB route to OPA LNet drops / selftest fail</title>
                <link>https://jira.whamcloud.com/browse/LU-11230</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi Folks,&lt;/p&gt;

&lt;p&gt;Looking for some assistance on this one. We&apos;re having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail. &lt;/p&gt;

&lt;p&gt;In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers. &lt;/p&gt;

&lt;p&gt;For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA &amp;lt;-&amp;gt;  compute/storage node OPA would basically always give me errors. With matched and &apos;intel optimized&apos; settings I&apos;ve not yet seen it fail. Ethernet routing to OPA also seems to work fine.&lt;/p&gt;

&lt;p&gt;We have the QIB&apos;s LNet configuration set to the same as the other nodes on the QIB fabric. I&apos;ll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets. &lt;/p&gt;

&lt;p&gt;Are there any special settings we need to apply when trying routing between old &amp;amp; new &apos;Truescale&apos; fabrics? &lt;/p&gt;

&lt;p&gt;Shortened example of failed selftest:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh
LST_SESSION = 755
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.55.77@o2ib are added to session
192.168.44.199@o2ib44 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 300 secs [LNet Rates of lfrom]
[R] Avg: 3163     RPC/s Min: 3163     RPC/s Max: 3163     RPC/s
[W] Avg: 1580     RPC/s Min: 1580     RPC/s Max: 1580     RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 1581.81  MiB/s Min: 1581.81  MiB/s Max: 1581.81  MiB/s
[W] Avg: 0.24     MiB/s Min: 0.24     MiB/s Max: 0.24     MiB/s

etc...

[LNet Bandwidth of lfrom]
[R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[LNet Rates of lto]
[R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired]
Total 1 error nodes in lfrom
lto:
Total 0 error nodes in lto
Batch is stopped
session is ended
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn&apos;t look right with the lnet.conf let me know. We can&apos;t seem to find any ko2iblnd settings that are reliable.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Simon&lt;/p&gt;
</description>
                <environment>OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5</environment>
        <key id="52936">LU-11230</key>
            <summary>QIB route to OPA LNet drops / selftest fail</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                            <label>LNet</label>
                            <label>lnet-testing</label>
                    </labels>
                <created>Thu, 9 Aug 2018 13:27:58 +0000</created>
                <updated>Mon, 1 Oct 2018 12:19:54 +0000</updated>
                                            <version>Lustre 2.10.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="231710" author="pjones" created="Thu, 9 Aug 2018 13:44:27 +0000"  >&lt;p&gt;Amir&lt;/p&gt;

&lt;p&gt;Could you please help here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="232414" author="scadmin" created="Wed, 22 Aug 2018 07:50:45 +0000"  >&lt;p&gt;Hi Guys,&lt;/p&gt;

&lt;p&gt;To update this: I went through all the scenarios doing a 5min selftests for each combination of eth/qdr/opa via our routers. This included tests between a node of each fabric type and the routers respective HCA/NIC and between nodes on different fabrics. The common factor in each failure event is the Qlogic HCA. We cannot reliably route between Qlogic and Ethernet or OPA. We can route fine between Ethernet and OPA / Ethernet. Failed selftests show up as this in dmesg/or message logs:&lt;/p&gt;

&lt;p&gt;Eg.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;QDR &amp;lt;-&amp;gt; OPA Test 2:
LTO - OPA Compute node
LFROM - Qlogic Compute node

cmdline# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.78@o2ib /opt/lustre/bin/lst-bench.sh

..snip.
[LNet Bandwidth of lfrom]
[R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[LNet Rates of lto]
[R] Avg: 2        RPC/s Min: 2        RPC/s Max: 2        RPC/s
[W] Avg: 2        RPC/s Min: 2        RPC/s Max: 2        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-192.168.55.78@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 31 expired]
Total 1 error nodes in lfrom
lto:
Total 0 error nodes in lto
Batch is stopped
session is ended
[root@john99 ~]#



LFROM node dmesg:

LustreError: 1512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -103
LNet: 1514:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.44.199@o2ib44, timeout 64.
LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110
LustreError: 1510:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110
LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 29 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Or..&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Eth &amp;lt;-&amp;gt; Qlogic Test 2:
LTO - Qlogic Compute node
LFROM - VM with Mellanox 100G NIC

cmdline# TM=300 LTO=192.168.55.78@o2ib LFROM=10.8.49.155@tcp201 /opt/lustre/bin/lst-bench.sh

..snip.
[LNet Bandwidth of lfrom]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[LNet Rates of lto]
[R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-10.8.49.155@tcp201: [Session 32 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 64 expired]
Total 1 error nodes in lfrom
lto:
12345-192.168.55.78@o2ib: [Session 0 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 63 expired]
Total 1 error nodes in lto
Batch is stopped
session is ended
[root@john99 ~]# 


LTO node dmesg:

[Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.55.78@o2ib, timeout 64.
[Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Skipped 31 previous similar messages
[Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.55.78@o2ib failed with -110
[Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 31 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Summary of passed test:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;QDR &amp;lt;-&amp;gt; QDR Test 1:
LTO - Qlogic Compute node
LFROM - Qlogic Lnet router HCA

QDR &amp;lt;-&amp;gt; QDR Test 2:
LTO - Qlogic Lnet router HCA
LFROM - Qlogic Compute node

OPA &amp;lt;-&amp;gt; OPA Test 1:
LTO - OPA Compute node
LFROM - OPA Lnet router HCA

OPA &amp;lt;-&amp;gt; OPA Test 2:
LTO - OPA Lnet router HCA
LFROM - OPA Compute node

Ethernet &amp;lt;-&amp;gt; Ethernet Test 1:
LTO - VM with Mellanox 100G NIC
LFROM - Lnet router with Mellanox 100G NIC

Ethernet &amp;lt;-&amp;gt; Ethernet Test 2:
LTO - Lnet router with Mellanox 100G NIC
LFROM - VM with Mellanox 100G NIC

QDR &amp;lt;-&amp;gt; OPA Test 1:
LTO - Qlogic Compute node
LFROM - OPA Compute node

Eth &amp;lt;-&amp;gt; OPA Test 1:
LTO - VM with Mellanox 100G NIC
LFROM - OPA Compute node

Eth &amp;lt;-&amp;gt; OPA Test 2:
LTO - VM with Mellanox 100G NIC
LFROM - OPA Compute node
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Summary of failed tests:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;QDR &amp;lt;-&amp;gt; OPA Test 2:
LTO - OPA Compute node
LFROM - Qlogic Compute node

Eth &amp;lt;-&amp;gt; Qlogic Test 1:
LTO - VM with Mellanox 100G NIC
LFROM - Qlogic Compute node

Eth &amp;lt;-&amp;gt; Qlogic Test 2:
LTO - Qlogic Compute node
LFROM - VM with Mellanox 100G NIC
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I modified one of our compute nodes today and re-configured the Qlogic HCA&apos;s on that node (as well as the Qlogic HCA the router). Running either of the following lnetctl net: configurations for the Qlogic HCA showed the same failed results as above. Selftests withing Qlogic only on either of these configs works without fail, the problems are only between Qlogic and some other fabric type.&lt;/p&gt;

&lt;p&gt;Config 1:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    - net type: o2ib
      local NI(s):
        - nid: 192.168.55.231@o2ib
          status: up
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: 1
          CPT: &quot;[0,1]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Config 2:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    - net type: o2ib
      local NI(s):
        - nid: 192.168.55.231@o2ib
          status: up
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          tcp bonding: 0
          dev cpt: 1
          CPT: &quot;[0,1]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/30916/30916_lnet-tests-21_aug_2018.txt&quot; title=&quot;lnet-tests-21_aug_2018.txt attached to LU-11230&quot;&gt;lnet-tests-21_aug_2018.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Any thoughts on what we should be looking at?&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
 Simon&lt;/p&gt;</comment>
                            <comment id="232519" author="ashehata" created="Thu, 23 Aug 2018 17:02:05 +0000"  >&lt;p&gt;Hi Simon,&lt;/p&gt;

&lt;p&gt;If you can get me the following info that would be great:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Configuration from OPA node, router node and QLogic node (lnetctl export &amp;gt; config.yaml). Would be great if each one is in a separate file.&lt;/li&gt;
	&lt;li&gt;Are you able to ping from the OPA -&amp;gt; QLOGIC and from QLOGIC -&amp;gt; OPA with no problem? (lnetctl ping &amp;lt;NID&amp;gt;). If you&apos;re encountering a failure with simple ping, let&apos;s turn on and capture the logging: lctl set_param debug=+&quot;net neterror&quot; THEN run ping test THEN lctl dk &amp;gt; log.dk.&lt;/li&gt;
	&lt;li&gt;If problem is not reproducible via ping then, if you can turn on debugging as above run a short selftest run (which would contain errors) and then capture logging.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;thanks&lt;/p&gt;

&lt;p&gt;amir&lt;/p&gt;</comment>
                            <comment id="232547" author="scadmin" created="Thu, 23 Aug 2018 23:15:03 +0000"  >&lt;p&gt;Hi Amir,&lt;/p&gt;

&lt;p&gt;I should add: there are no issues we can see with routes being marked down on either side or lctl pings failing.&#160; In general, everything appears OK. I wasn&apos;t sure if a really short test would capture it, so a ran the standard 5 min test in which is failed maybe 30 second to a minute into the test. I&apos;ve attached three configs and the dk log as requested.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;Simon&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="232614" author="ashehata" created="Mon, 27 Aug 2018 06:15:45 +0000"  >&lt;p&gt;Hi Simon,&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
peer:
    - primary nid: 192.168.44.21@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.21@o2ib44
          min_tx_credits: -4815
    - primary nid: 192.168.44.22@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.22@o2ib44
          min_tx_credits: -4868
    - primary nid: 192.168.44.51@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.51@o2ib44
          state: NA
          min_tx_credits: -10849
    - primary nid: 192.168.44.52@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.52@o2ib44
          min_tx_credits: -12366
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you&apos;re running. They appear to be on the OPA network (o2ib44)?&lt;/p&gt;

&lt;p&gt;I didn&apos;t see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted?&lt;/p&gt;

&lt;p&gt;Would you also be able to share the lnet-selftest script you&apos;re using? &lt;/p&gt;

&lt;p&gt;Also for the QIB I see that you tried both of these configs:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
              peercredits_hiw: 64
              map_on_demand: 0
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 1&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I&apos;m trying to see if the problem is restricted between the router under test and the node.&lt;/p&gt;

&lt;p&gt;My preference though is to stick with conns_per_peer: 1 for QLOGIC. the conns_per_peer 4 was intended for OPA interfaces only.&lt;/p&gt;

&lt;p&gt;Finally, would we be able to setup a live debug session?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;

&lt;p&gt;amir&lt;/p&gt;</comment>
                            <comment id="232637" author="scadmin" created="Tue, 28 Aug 2018 03:17:57 +0000"  >&lt;p&gt;Hi Amir,&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&amp;gt; The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you&apos;re running. They appear to be on the OPA network (o2ib44)?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These peers are not relevant for the purposes of the lnet_selftest (from my understanding). They are however important for the purposes of actual file transfers though.. which is why we&apos;re going back to basic lnet_selftest&apos;s to verify the network between fabrics.&lt;/p&gt;

&lt;p&gt;The below peers are (respectively) MDS1, MDS2, OSS1 for home &amp;amp; apps. etc, OSS2 for home &amp;amp; apps. etc. There are another 8 x OSS&apos;s for the main large filesystem too not mentioned here but use IP&apos;s 192.168.44.13&lt;span class=&quot;error&quot;&gt;&amp;#91;1-8&amp;#93;&lt;/span&gt;@o2ib44:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;peer:
 - primary nid: 192.168.44.21@o2ib44
 - primary nid: 192.168.44.22@o2ib44
 - primary nid: 192.168.44.51@o2ib44
 - primary nid: 192.168.44.52@o2ib44
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&amp;gt; I didn&apos;t see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yeah, dmesg and /var/log/messages are really light for errors. The only errors that appear during the test period were what I pasted in. eg: the &quot;failed with -103&quot;, and &quot;failed with -110&quot; examples.&lt;/p&gt;


&lt;p&gt;&lt;em&gt;&amp;gt; Would you also be able to share the lnet-selftest script you&apos;re using?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yup. It&apos;s a pretty standard one:&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
#!/bin/sh
#
# Simple wrapper script &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; LNET Selftest
#

# Parameters are supplied as environment variables
# The defaults are reasonable &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; quick verification.
# For in-depth benchmarking, increase the time (TM)
# variable to e.g. 60 seconds, and iterate over
# concurrency to find optimal values.
#
# Reference: http:&lt;span class=&quot;code-comment&quot;&gt;//wiki.lustre.org/LNET_Selftest
&lt;/span&gt;
# Concurrency
CN=${CN:-32}
#Size
SZ=${SZ:-1M}
# Length of time to run test (secs)
TM=${TM:-10}
# Which BRW test to run (read or write)
BRW=${BRW:-&lt;span class=&quot;code-quote&quot;&gt;&quot;read&quot;&lt;/span&gt;}
# Checksum calculation (simple or full)
CKSUM=${CKSUM:-&lt;span class=&quot;code-quote&quot;&gt;&quot;simple&quot;&lt;/span&gt;}

# The LST &lt;span class=&quot;code-quote&quot;&gt;&quot;from&quot;&lt;/span&gt; list -- e.g. Lustre clients. Space separated list of NIDs.
# LFROM=&lt;span class=&quot;code-quote&quot;&gt;&quot;10.10.2.21@tcp&quot;&lt;/span&gt;
LFROM=${LFROM:?ERROR: the LFROM variable is not set}
# The LST &lt;span class=&quot;code-quote&quot;&gt;&quot;to&quot;&lt;/span&gt; list -- e.g. Lustre servers. Space separated list of NIDs.
# LTO=&lt;span class=&quot;code-quote&quot;&gt;&quot;10.10.2.22@tcp&quot;&lt;/span&gt;
LTO=${LTO:?ERROR: the LTO variable is not set}

### End of customisation.

export LST_SESSION=$$
echo LST_SESSION = ${LST_SESSION}
lst new_session lst${BRW}
lst add_group lfrom ${LFROM}
lst add_group lto ${LTO}
lst add_batch bulk_${BRW}
lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \
  --concurrency=${CN} check=${CKSUM} size=${SZ}
lst run bulk_${BRW}
echo -n &lt;span class=&quot;code-quote&quot;&gt;&quot;Capturing statistics &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; ${TM} secs &quot;&lt;/span&gt;
lst stat lfrom lto &amp;amp;
LSTPID=$!
# Delay loop with interval markers displayed every 5 secs.
# Test time is rounded up to the nearest 5 seconds.
i=1
j=$((${TM}/5))
&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; [ $((${TM}%5)) -ne 0 ]; then let j++; fi
&lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; [ $i -le $j ]; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
  sleep 5
  let i++
done
kill ${LSTPID} &amp;amp;&amp;amp; wait ${LISTPID} &amp;gt;/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; 2&amp;gt;&amp;amp;1
echo
lst show_error lfrom lto
lst stop bulk_${BRW}
lst end_session
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;&amp;gt; If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I&apos;m trying to see if the problem is restricted between the router under test and the node.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In my testing I found that a Qlogic compute node to the Qlogic interface on the lnet router proved to be working reliably. The same goes for OPA compute nodes to the OPA interface on the lnet router - they worked just fine. In both cases though (now this is testing my memory!), if I had mismatched the ko2iblnd settings between a compute/routers respective fabric interfaces then I would get issues (depending which settings were mismatched).. but having them matched works just fine.&lt;/p&gt;

&lt;p&gt;Apart from the two Qlogic configs you just mentioned here. I&apos;d also tested this configuration which also gave poor results with routing between fabric types. This is &lt;em&gt;actually&lt;/em&gt; our current lnet setup on all Qlogic compute nodes with the exception of my test host / lnet router where I&apos;ve been going through changing the parameters to try and figure this all out:&lt;/p&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    - net type: o2ib
      local NI(s):
        - nid: 192.168.55.75@o2ib
          status: up
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          tcp bonding: 0
          dev cpt: 1
          CPT: &quot;[0,1]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;&lt;em&gt;&amp;gt; Finally, would we be able to setup a live debug session?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not a problem at all. We&apos;re east coast Australia, I can setup a live session to help debug this if you want to pick a time that suits us both.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Simon&lt;/p&gt;



</comment>
                            <comment id="232773" author="ashehata" created="Wed, 29 Aug 2018 21:49:12 +0000"  >&lt;p&gt;Does 4pm PST, 9AM (your time) work? If so, let me know the date that works for you. Would need to be able to share screens or something of that sort to debug further.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="233036" author="scadmin" created="Wed, 5 Sep 2018 06:30:15 +0000"  >&lt;p&gt;Hi Amir,&lt;/p&gt;

&lt;p&gt;Yep. That time will work. I&apos;ll email you through some details for a meeting with prospective dates.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Simon&lt;/p&gt;</comment>
                            <comment id="233573" author="pjones" created="Sat, 15 Sep 2018 12:02:22 +0000"  >&lt;p&gt;Has this proposed meeting taken place yet?&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="30931" name="dk.log.xz" size="6536644" author="scadmin" created="Thu, 23 Aug 2018 23:14:52 +0000"/>
                            <attachment id="30916" name="lnet-tests-21_aug_2018.txt" size="12344" author="scadmin" created="Wed, 22 Aug 2018 07:46:33 +0000"/>
                            <attachment id="30719" name="lnet.conf" size="2403" author="scadmin" created="Thu, 9 Aug 2018 13:14:34 +0000"/>
                            <attachment id="30933" name="lnetctl_export_lnet-router.txt" size="16576" author="scadmin" created="Thu, 23 Aug 2018 23:14:51 +0000"/>
                            <attachment id="30934" name="lnetctl_export_opa.txt" size="9977" author="scadmin" created="Thu, 23 Aug 2018 23:14:51 +0000"/>
                            <attachment id="30932" name="lnetctl_export_qlogic.txt" size="3621" author="scadmin" created="Thu, 23 Aug 2018 23:14:51 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>lnet</label>
            <label>performance</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>lnet</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i000if:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>