<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:47:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11888] Unreachable client NID confusing Lustre 2.12</title>
                <link>https://jira.whamcloud.com/browse/LU-11888</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Just wanted to report this, although probably not critical. During testing of 2.12.0 on IB only (o2ib with routers), we mistakenly set up a client with two NIDs, one on tcp0:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh-06-33 ~]# lctl list_nids
10.10.6.33@tcp
10.8.6.33@o2ib6
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This confused the Lustre servers a LOT:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[663974.083382] LNetError: 124939:0:(peer.c:2480:lnet_peer_merge_data()) Error deleting NID 10.10.6.33@tcp from peer 10.10.6.33@tcp: -16
[663974.095393] Lustre: MGS: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
[663981.577418] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) no route to 10.10.6.33@tcp from &amp;lt;?&amp;gt;
[663981.588032] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) Skipped 6171721 previous similar messages
[663981.599239] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548442345/real 1548442345]  req@ffff9149c37fb600 x1623580872768976/t0(0) o104-&amp;gt;fir-MDT0000@10.10.6.33@tcp:15/16 lens 296/224 e 0 to 1 dl 1548442356 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
[663981.626855] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6167666 previous similar messages
[664132.508056] LustreError: 127396:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.10.6.33@tcp) failed to reply to blocking AST (req@ffff9149c37fb600 x1623580872768976 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 109 pid: 125204 timeout: 664264 lvb_type: 0
[664132.550562] LustreError: 138-a: fir-MDT0000: A client on nid 10.10.6.33@tcp was evicted due to a lock blocking callback time out: rc -110
[664132.563014] LustreError: 125084:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.10.6.33@tcp  ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 110 pid: 125204 timeout: 0 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;But the main problem is that and even after the client &lt;b&gt;rebooted&lt;/b&gt; &lt;ins&gt;without&lt;/ins&gt; the tcp0 NID, the server was still logging things like:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[664150.993807] Lustre: fir-MDT0000: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Re: this last line, it was after the client has rebooted. While it looks like the server only prints the first client NID, but in that case it remembered the last client&apos;s tcp0 nid, which is weird...&lt;/p&gt;

&lt;p&gt;The servers are using o2ib only:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-md1-s1 ~]# lctl list_nids
10.0.10.51@o2ib7
[root@fir-md1-s1 ~]# lctl route_list
net              o2ib4 hops 4294967295 gw                10.0.10.210@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.209@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.211@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.212@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.202@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.204@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.201@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.203@o2ib7 up pri 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We were wondering how it is even possible. The solution to fix this in a timely manner was to restart the Lustre servers.&lt;/p&gt;

&lt;p&gt;Stephane&lt;/p&gt;</description>
                <environment>CentOS 7.6</environment>
        <key id="54666">LU-11888</key>
            <summary>Unreachable client NID confusing Lustre 2.12</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="sharmaso">Sonia Sharma</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Fri, 25 Jan 2019 19:19:47 +0000</created>
                <updated>Mon, 2 Nov 2020 18:23:12 +0000</updated>
                                            <version>Lustre 2.12.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="240736" author="pjones" created="Fri, 25 Jan 2019 23:22:34 +0000"  >&lt;p&gt;Sonia&lt;/p&gt;

&lt;p&gt;Could you please investigate?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="240815" author="ashehata" created="Mon, 28 Jan 2019 17:16:07 +0000"  >&lt;p&gt;This looks similar to this issue:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11840&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-11840&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="283969" author="mrb" created="Mon, 2 Nov 2020 14:03:44 +0000"  >&lt;p&gt;Hello,&lt;br/&gt;
I believe I just ran into this same issue as well.&lt;br/&gt;
Both clients and servers are RHEL 7.8, running 2.12.5, MOFED 4.9.&lt;/p&gt;

&lt;p&gt;We had a single client refusing to mount the filesystem. This client has NID:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@cpu-p-10 ~]# lctl list_nids
10.44.161.10@o2ib2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However the server logged the following similar to Stefane:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Nov 02 13:29:53 rds-mds7 kernel: LNetError: 168122:0:(peer.c:2453:lnet_peer_merge_data()) Error deleting NID 10.43.161.10@tcp from peer 10.43.161.10@tcp: -16
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That IP address is the IP of the ethernet interface on this node (only differs with the second octet). Likely this error started when LNET must have been previously misconfigured or no config was given to lnetctl so it used the first TCP interface on the node. &lt;/p&gt;

&lt;p&gt;Similar to Stefane, this error persisted across reboots. Fortunately however, I could fix it by just manually deleting the peer entry on the server:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# Before
[root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp
peer:
    - primary nid: 10.43.161.10@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.44.161.10@o2ib2
          state: NA
        - nid: 10.43.161.10@tcp
          state: NA

[root@rds-mds7 ~]# lnetctl peer del --prim_nid 10.43.161.10@tcp                                                                                                                                                     
[root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp                                                                                                                                                         
show:
    - peer:
          errno: -2
          descr: &quot;cannot get peer information: No such file or directory&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now the mount works correctly.&lt;/p&gt;

&lt;p&gt;I don&apos;t think this adds anything, but just wanted to +1 this ticket. First time we&apos;ve seen this, so will keep an eye out if this happens again.&lt;/p&gt;

&lt;p&gt;I took a look at &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11840&quot; title=&quot;Multi rail dynamic discovery prevent mounting filesystem when some NIC is unreachable&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11840&quot;&gt;LU-11840&lt;/a&gt; linked, but the workaround described there (disabling discovery on the client) didn&apos;t fix this for me.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="54800">LU-11936</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="61484">LU-14107</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00ab3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>