<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:47:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11840] Multi rail dynamic discovery prevent mounting filesystem when some NIC is unreachable</title>
                <link>https://jira.whamcloud.com/browse/LU-11840</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;In recent Lustre releases, some specific filesystem could not be mounted due to a communication error between clients and servers, depending on the LNET configuration.&lt;/p&gt;

&lt;p&gt;If we have a filesystem running on a host with 2 interfaces, let say tcp0 and tcp1 and the devices are setup to reply on both interfaces (formatted with --servicenode IP1@tcp0,IP2@tcp1).&lt;/p&gt;

&lt;p&gt;If a client is connected only to tcp0 and try to mount this filesystem, it fails with an I/O error because it is trying to connect using tcp1 interface.&lt;/p&gt;

&lt;p&gt;Mount failed:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# mount -t lustre x.y.z.a@tcp:/lustre /mnt/lustre
mount.lustre: mount x.y.z.a@tcp:/lustre at /mnt/client failed: Input/output error
Is the MGS running?
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;dmesg shows that communication fails using the wrong IP&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[422880.743179] LNetError: 19787:0:(lib-move.c:1714:lnet_select_pathway()) no route to a.b.c.d@tcp1&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# lnetctl peer show
peer:
 - primary nid: a.b.c.d@tcp1
 Multi-Rail: False
 peer ni:
 - nid: x.y.z.a@tcp
 state: NA
 - nid: 0@&amp;lt;0:0&amp;gt;
 state:&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Ping is OK though:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lctl ping x.y.z.a@tcp
12345-0@lo
12345-a.b.c.d@tcp1
12345-x.y.z.a@tcp&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;This was tested with 2.10.5 and 2.12 as server versions and 2.10, 2.11 and 2.12 as client.&lt;/p&gt;

&lt;p&gt;Only 2.10 client is able to mount the filesystem properly with this configuration&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I git-bisected the regression down to &lt;tt&gt;0f1aaad &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9480&quot; title=&quot;LNet Dynamic Discovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9480&quot;&gt;&lt;del&gt;LU-9480&lt;/del&gt;&lt;/a&gt; lnet: implement Peer Discovery&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;Looking at debug log, the client:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;setups the peer with the proper NI&lt;/li&gt;
	&lt;li&gt;the pings the peer&lt;/li&gt;
	&lt;li&gt;updates the local peer info with the wrong NI coming from the ping reply&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;data in the reply seems to announce the tcp1 IP as the primary nid.&lt;/p&gt;

&lt;p&gt;The client will then use this NI to contact the server even if it has no direct connection to it (tcp1) and has a correct one for the same peer (tcp0).&lt;/p&gt;</description>
                <environment></environment>
        <key id="54452">LU-11840</key>
            <summary>Multi rail dynamic discovery prevent mounting filesystem when some NIC is unreachable</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="degremoa">Aurelien Degremont</reporter>
                        <labels>
                    </labels>
                <created>Tue, 8 Jan 2019 13:50:33 +0000</created>
                <updated>Wed, 25 Nov 2020 15:03:48 +0000</updated>
                                            <version>Lustre 2.11.0</version>
                    <version>Lustre 2.12.0</version>
                                                        <due></due>
                            <votes>1</votes>
                                    <watches>17</watches>
                                                                            <comments>
                            <comment id="239545" author="degremoa" created="Tue, 8 Jan 2019 16:34:39 +0000"  >&lt;p&gt;I made more tests.&lt;/p&gt;

&lt;p&gt;2.12 client and 2.12 server: OK&lt;/p&gt;

&lt;p&gt;2.10 client and 2.10 client: OK&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;2.12 client with 2.10.5 server: broken&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Looks like it is related to MR capable/Discovery capable feature.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="239627" author="ashehata" created="Wed, 9 Jan 2019 19:23:32 +0000"  >&lt;p&gt;Aurelien, do you see this with 2.11-&amp;gt; 2.10.5?&lt;/p&gt;

&lt;p&gt;I&apos;m suspecting a timeout length issue. But if you could verify the above, it&apos;ll prove it to me.&lt;/p&gt;</comment>
                            <comment id="239732" author="degremoa" created="Thu, 10 Jan 2019 10:11:13 +0000"  >&lt;p&gt;2.11 client and 2.10.5 server: broken&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Moreover, I can workaround the problem if I add the peer first:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lnetctl peer add --prim_nid x.y.z.a@tcp
mount -t lustre x.y.z.a@tcp:/fsx /mnt/client/&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Timeout is a good lead as it seems to be what I see in the logs.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Looks like if the server is 2.12, all clients (2.10, 2.11 and 2.12) successfully mount the FS.&lt;/p&gt;</comment>
                            <comment id="239818" author="degremoa" created="Fri, 11 Jan 2019 14:38:34 +0000"  >&lt;p&gt;If you have a potential understanding of the problem and you have trails/ideas I can follow or dig in, let me know.&lt;/p&gt;</comment>
                            <comment id="239831" author="ashehata" created="Fri, 11 Jan 2019 17:49:06 +0000"  >&lt;p&gt;It seems like there is a misunderstanding around the primary_nid. So when you add the peer explicitly it works, but if you don&apos;t, then discovery depends on the list of NIDs which come back in the ping response. The first NID in that list is considered the primary_nid. Based on this:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# lctl ping x.y.z.a@tcp
12345-0@lo
12345-a.b.c.d@tcp1
12345-x.y.z.a@tcp&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The tcp1 interface is first. So LNet thinks it&apos;s the primary NID of the node and tries to send messages to it, but there are no routes.&lt;/p&gt;

&lt;p&gt;If you set up the server in such a way that the tcp interface is first, do you resolve the problem?&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# lctl ping x.y.z.a@tcp
12345-0@lo
12345-x.y.z.a@tcp 
12345-a.b.c.d@tcp1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="239834" author="degremoa" created="Fri, 11 Jan 2019 18:15:14 +0000"  >&lt;p&gt;Wooh! It works! Thanks a lot! I will work on a workaround based on that.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;But, I thought, especially since MR appeared that it would select the best available route, and in this case, even if the primary nid is not reachable, there is one which is. Did I misunderstand the MR features?&lt;/p&gt;</comment>
                            <comment id="239836" author="ashehata" created="Fri, 11 Jan 2019 18:49:49 +0000"  >&lt;p&gt;LNeth Health, which is in 2.12, should be able to do that. You&apos;ll need to enable it.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.mrhealth&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.mrhealth&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Although, I&apos;m a bit confused by the test matrix you outlined. When you tested 2.12-&amp;gt;2.12 it should still be the same problem. But you&apos;re saying it works. When you have 2.12 on the servers, was the order of the NIDs still the same? Or in these test runs the order of the NIDs was:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 # lctl ping x.y.z.a@tcp
12345-0@lo
12345-x.y.z.a@tcp 
12345-a.b.c.d@tcp1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="239908" author="degremoa" created="Mon, 14 Jan 2019 17:16:46 +0000"  >&lt;p&gt;I confirm that with a 2.12 server, where tcp1 is declared first, 2.12 client mount is still ok.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;That&apos;s also why I&apos;m thinking there is some MR magic in action here. Looks like the peer state in 2.12 client is not the same if the server supports &apos;push&apos; or not. It seems you&apos;re right that in the simple case, it will just use the first NID that is returned when being pinged.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;The problem is that there is still a regression between 2.10 and 2.12 in my opinion. A setup that did not look unsupported was working with a 2.10 client and is no more with a 2.12.&lt;/p&gt;

&lt;p&gt;I think the 2.12 client rely on a server side feature that was introduced after 2.10 to properly setup its peer state and use the correct one when mounting. If the server is a 2.10 one, the push/discovery feature did not exist and the client does not try to do something smart with all the available IPs. It will just try the first one and fails.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="239933" author="degremoa" created="Mon, 14 Jan 2019 20:15:39 +0000"  >&lt;p&gt;More test results, with a 2.12 client and 2.10 server, and NIDs being set in non-optimal order (ping returns tcp1 as first nid):&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Add peer before before mount: &lt;tt&gt;lnetctl peer add --prim_nid x.y.z.a@tcp&lt;/tt&gt;: Mount is OK&lt;/li&gt;
	&lt;li&gt;Add peer, declared as non_mr, before mount: &lt;tt&gt;lnetctl peer add --prim_nid x.y.z.a@tcp --non_mr:&lt;/tt&gt; Error&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Add peer, bad nid as prim_nid, 2 nids declared: &lt;tt&gt;lnetctl peer add --prim_nid a.b.c.d@tcp1 --nid x.y.z.a@tcp,a.b.c.d@tcp1&lt;/tt&gt;: OK&lt;/li&gt;
	&lt;li&gt;Add peer, bad nid as prim_nid, 2 nids declared as non_MR: &lt;tt&gt;lnetctl peer add --prim_nid a.b.c.d@tcp1 --nid x.y.z.a@tcp,a.b.c.d@tcp1 --non_mr&lt;/tt&gt;: Error&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;So that means that to have it working, we need either:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;tcp0 as the first NID returned by server&lt;/li&gt;
	&lt;li&gt;the peer being explicitly declared as MultiRail capable, whatever the prim_nid is.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Hope that helps you understand the problem.&lt;/p&gt;</comment>
                            <comment id="240077" author="ashehata" created="Wed, 16 Jan 2019 06:40:54 +0000"  >&lt;p&gt;As discussed today, the work around where you configure the tcp NID to be primary on the server will work in your case.&lt;/p&gt;

&lt;p&gt;In the mean time I&apos;ve been looking at a way to resolve the incompatibility between discovery enabled node and a non-discovery capable node (IE 2.10.x) and I have hit a snag.&lt;/p&gt;

&lt;p&gt;I&apos;m testing two different scenarios&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;OST(2.12) MDT(2.10.x) Client(2.12)&lt;/li&gt;
	&lt;li&gt;OST(2.10.x) MDT(2.10.x) Client (2.12)&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Unfortunately, lustre does its own NID lookup without using LNet to pull the NID information in both scenarios, particularly, here:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 779 /**
 780  * Retrieve MDT nids from the client log, then start the lwp device.
 781  * there are only two scenarios which would include mdt nid.
 782  * 1.
 783  * marker   5 (flags=0x01, v2.1.54.0) lustre-MDTyyyy  &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; xxx-
 784  * add_uuid  nid=192.168.122.162@tcp(0x20000c0a87aa2)  0:  1:192.168.122.162@tcp
 785  * attach    0:lustre-MDTyyyy-mdc  1:mdc  2:lustre-clilmv_UUID
 786  * setup     0:lustre-MDTyyyy-mdc  1:lustre-MDTyyyy_UUID  2:192.168.122.162@tcp
 787  * add_uuid  nid=192.168.172.1@tcp(0x20000c0a8ac01)  0:  1:192.168.172.1@tcp
 788  * add_conn  0:lustre-MDTyyyy-mdc  1:192.168.172.1@tcp
 789  * modify_mdc_tgts add 0:lustre-clilmv  1:lustre-MDTyyyy_UUID xxxx
 790  * marker   5 (flags=0x02, v2.1.54.0) lustre-MDTyyyy  &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; xxxx-
 791  * 2.
 792  * marker   7 (flags=0x01, v2.1.54.0) lustre-MDTyyyy  &lt;span class=&quot;code-quote&quot;&gt;&apos;add failnid&apos;&lt;/span&gt; xxxx-
 793  * add_uuid  nid=192.168.122.2@tcp(0x20000c0a87a02)  0:  1:192.168.122.2@tcp
 794  * add_conn  0:lustre-MDTyyyy-mdc  1:192.168.122.2@tcp
 795  * marker   7 (flags=0x02, v2.1.54.0) lustre-MDTyyyy  &lt;span class=&quot;code-quote&quot;&gt;&apos;add failnid&apos;&lt;/span&gt; xxxx-
 796  **/
 797 &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; client_lwp_config_process(&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct lu_env *env,
 798 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;     struct llog_handle *handle,
 799 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;     struct llog_rec_hdr *rec, void *data) 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Lustre tries to retrieve the MDT nids from the client log and it looks at the first NID in the list. In both cases the OST is unable to mount the MGS, because it&apos;s using the tcp1 NID to get the peer and ends in this error:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
(events.c:543:ptlrpc_uuid_to_peer()) 192.168.122.117@tcp1-&amp;gt;12345-&amp;lt;?&amp;gt;
(client.c:97:ptlrpc_uuid_to_connection()) cannot find peer 192.168.122.117@tcp1!
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This error is independent from the backwards compatibility issue. My config looks like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
OST:
----
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 192.168.122.114@tcp
          status: up
          interfaces:
              0: eth0
        - nid: 192.168.122.115@tcp
          status: up
          interfaces:
              0: eth1

MDT:
----
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp1
      local NI(s):
        - nid: 192.168.122.117@tcp1
          status: up
          interfaces:
              0: eth0
    - net type: tcp
      local NI(s):
        - nid: 192.168.122.118@tcp
          status: up
          interfaces:
              0: eth1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I&apos;m curious how&#160; you setup your OSTs so you don&apos;t run into the problem above?&lt;/p&gt;</comment>
                            <comment id="240154" author="degremoa" created="Wed, 16 Jan 2019 17:31:32 +0000"  >&lt;p&gt;My LNET setup looks like the MDT one. There is 2 LND, tcp0 and tcp1 and only one interface for each of them.&lt;/p&gt;

&lt;p&gt;We did the test together on a simple system where both MDT and OST where on the same server, but I do not think this makes a difference here.&lt;/p&gt;

&lt;p&gt;Looking at my MGS client llog, it looks rather like the case #1.&lt;/p&gt;

&lt;p&gt;Devices were formatted specifying a simple service node option (see ticket description).&lt;/p&gt;</comment>
                            <comment id="241773" author="degremoa" created="Tue, 12 Feb 2019 17:52:45 +0000"  >&lt;p&gt;@ashehata Did you make any progress on this topic?&lt;/p&gt;

&lt;p&gt;I&apos;m facing a similar issue with a pure 2.10.5 configuration.&lt;/p&gt;

&lt;p&gt;Lustre servers have both tcp0 and tcp1 NID. MDT/OSTs are setup to use both of them. But Lustre servers will try to communicate only using the first configured interface. If it fails (timeout), they will never try the second one.&lt;/p&gt;

&lt;p&gt;Do you have any clue?&lt;/p&gt;</comment>
                            <comment id="241799" author="ashehata" created="Tue, 12 Feb 2019 19:49:14 +0000"  >&lt;p&gt;Yes. I believe that&apos;s the issue I pointed to here: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11840?focusedCommentId=240077&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-240077&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-11840?focusedCommentId=240077&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-240077&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lustre (not LNet) does its own NID lookup based on logs. The assumption inherit in the code is that there is only one NID per node. Which is not right.&lt;/p&gt;</comment>
                            <comment id="241807" author="ashehata" created="Tue, 12 Feb 2019 21:27:49 +0000"  >&lt;p&gt;I&apos;m working on a solution. Will update the ticket when I have a patch to test.&lt;/p&gt;</comment>
                            <comment id="241871" author="degremoa" created="Wed, 13 Feb 2019 15:40:56 +0000"  >&lt;p&gt;Thanks a lot! Do you have a rough idea if this is days or weeks of work?&lt;/p&gt;</comment>
                            <comment id="242007" author="ashehata" created="Thu, 14 Feb 2019 20:50:04 +0000"  >&lt;p&gt;I don&apos;t think that it&apos;s a huge amount of work but I am focused on 2.13 feature work ATM so have not looked at it in much detail yet&lt;/p&gt;</comment>
                            <comment id="242079" author="degremoa" created="Fri, 15 Feb 2019 16:11:21 +0000"  >&lt;p&gt;OK, understood.&lt;/p&gt;

&lt;p&gt;A simple question based on your config output. Should we declare 0@lo in a lnet.conf file? Used with lnetctl import.&lt;/p&gt;

&lt;p&gt;I could not find a clear statement on that looking at different places.&lt;/p&gt;</comment>
                            <comment id="242080" author="ashehata" created="Fri, 15 Feb 2019 16:31:47 +0000"  >&lt;p&gt;No. 0@lo will always get ignored because it&apos;s created implicitly. So you don&apos;t have to have it in the lnet.conf file.&lt;/p&gt;

&lt;p&gt;There is actually a patch&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10452&quot; title=&quot;lnetctl export/import suggested improvements&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10452&quot;&gt;&lt;del&gt;LU-10452&lt;/del&gt;&lt;/a&gt; lnet: cleanup YAML output&lt;/p&gt;

&lt;p&gt;which allows you to use a &quot;&#8211;backup&quot; option to print a YAML block with only the elements needed to reconfigure a system.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl net show --backup 

#also when you export that backup feature is automatically set

lnetctl export &amp;gt; lnet.conf&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="242081" author="degremoa" created="Fri, 15 Feb 2019 16:41:38 +0000"  >&lt;p&gt;Really helpful. Thank you!&lt;/p&gt;</comment>
                            <comment id="247578" author="degremoa" created="Thu, 23 May 2019 09:28:26 +0000"  >&lt;p&gt;For the records, disabling LNET discovery seems to workaround the issue&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lnetctl set discovery 0&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;before mounting the Lustre client.&lt;/p&gt;</comment>
                            <comment id="251476" author="degremoa" created="Tue, 16 Jul 2019 15:27:25 +0000"  >&lt;p&gt;I realized I could not reproduced this bug with latest master branch code.&lt;/p&gt;

&lt;p&gt;I tracked down this behavior change between 2.12.54 and 2.12.55, so very likely related to MR Routing feature landing. I did not track to which patch precisely.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Lustre 2.12 is still impacted though.&lt;/p&gt;</comment>
                            <comment id="254363" author="sebg-crd-pm" created="Mon, 9 Sep 2019 01:18:34 +0000"  >&lt;p&gt;Hi, I have got the similar&#160; problem when client/ server / router are all &quot;lustre2.12&quot;&lt;/p&gt;

&lt;p&gt;The Lustre client&#160; try to communicate only using the first configured interface of server , they will never try the second one.&lt;/p&gt;

&lt;p&gt;clientA:&lt;br/&gt;
options lnet networks=&quot;tcp(eno1)&quot; routes=&quot;o2ib 172.26.1.222@tcp&quot;&lt;/p&gt;

&lt;p&gt;router:&lt;br/&gt;
options lnet networks=&quot;tcp(eno1),o2ib(ib0)&quot; &quot;forwarding=enabled&quot;&lt;/p&gt;

&lt;p&gt;server:&lt;br/&gt;
options lnet networks=&quot;tcp2(eno1),o2ib0(ib0)&quot; routes=&quot;tcp 172.20.0.222@o2ib&quot; &lt;br/&gt;
=&amp;gt; clientA mount server o2ib fail&lt;br/&gt;
options lnet networks=&quot;o2ib0(ib0),tcp2(eno1)&quot; routes=&quot;tcp 172.20.0.222@o2ib&quot; &lt;br/&gt;
=&amp;gt; clientA mount server o2ib ok&lt;/p&gt;</comment>
                            <comment id="272077" author="knweiss" created="Fri, 5 Jun 2020 15:36:00 +0000"  >&lt;p&gt;AFAICS I also have this issue:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;server (lustre-2.10.4-1.el7.x86_64):
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
options lnet networks=tcp0(en0),o2ib0(in0)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;client (lustre-2.12.5-RC1-0.el7.x86_64):
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
options lnet networks=&lt;span class=&quot;code-quote&quot;&gt;&quot;o2ib(ib0)&quot;&lt;/span&gt;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;These two workarounds seem to work (only &lt;b&gt;very&lt;/b&gt; limited testing so far):&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Configuring LNET tcp on the client (although I actually only want to use IB):
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
options lnet networks=&lt;span class=&quot;code-quote&quot;&gt;&quot;o2ib(ib0),tcp(enp3s0f0)&quot;&lt;/span&gt; &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;ol&gt;
	&lt;li&gt;Executing this before the actual Lustre mount:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl set discovery 0&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="272088" author="ashehata" created="Fri, 5 Jun 2020 17:45:38 +0000"  >&lt;p&gt;Krasten,&lt;/p&gt;

&lt;p&gt;There is a known incompatibility between discovery enabled lustre and older lustre versions, IE 2.10.&lt;/p&gt;

&lt;p&gt;We have a ticket open for it and we&apos;re looking at how we can resolve this:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13548&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13548&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Currently disabling discovery on 2.12 is the best workaround.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="59153">LU-13548</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i008zz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>