<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:05:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13883] LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.</title>
                <link>https://jira.whamcloud.com/browse/LU-13883</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After server reboot (failover/failback test) client connnects to server using non-primary NID 192.168.2.38@tcp2. Server calls LNetPrimaryNID to setup export.&lt;/p&gt;

&lt;p&gt;New peer is setup and queued for discovery:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:00000040:0.0:1596142526.904195:0:3881:0:(genops.c:1009:class_export_get()) GETting export ffff9efa080c4000 : new refcount 6
00000100:00000040:0.0:1596142526.904198:0:3881:0:(service.c:1054:ptlrpc_request_change_export()) RPC GETting export ffff9efa080c4000 : new rpc_count 1
00000400:00000200:0.0:1596142526.904210:0:3881:0:(peer.c:285:lnet_peer_alloc()) ffff9efa0c071c00 nid 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.904215:0:3881:0:(peer.c:220:lnet_peer_net_alloc()) ffff9efa22389a40 net tcp2
00000400:00000200:0.0:1596142526.904222:0:3881:0:(peer.c:202:lnet_peer_ni_alloc()) ffff9efa0c550000 nid 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.904228:0:3881:0:(peer.c:1329:lnet_peer_attach_peer_ni()) peer 192.168.2.38@tcp2 NID 192.168.2.38@tcp2 flags 0x0
00000400:00000200:0.0:1596142526.904233:0:3881:0:(lib-lnet.h:96:lnet_peer_set_state()) Peer 192.168.2.38@tcp2(ffff9efa0c071c00) 8192 state 0x2000
00000400:00000200:0.0:1596142526.904236:0:3881:0:(lib-lnet.h:96:lnet_peer_set_state()) Peer 192.168.2.38@tcp2(ffff9efa0c071c00) 16384 state 0x6000
00000400:00000200:0.0:1596142526.904241:0:3881:0:(lib-lnet.h:96:lnet_peer_set_state()) Peer 192.168.2.38@tcp2(ffff9efa0c071c00) 64 state 0x6040
00000400:00000200:0.0:1596142526.904249:0:3881:0:(peer.c:1931:lnet_peer_queue_for_discovery()) Queue peer 192.168.2.38@tcp2: 0
00000400:00000200:0.0:1596142526.904251:0:3881:0:(peer.c:2244:lnet_discover_peer_locked()) Discovery attempt # 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Discovery ping is sent:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1596142526.904351:0:2860:0:(peer.c:3342:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:0.0:1596142526.904355:0:2860:0:(peer.c:3457:lnet_peer_discovery()) peer 192.168.2.38@tcp2(ffff9efa0c071c00) state 0x6040
00000400:00000200:0.0:1596142526.904359:0:2860:0:(peer.c:3057:lnet_peer_send_ping()) peer 192.168.2.38@tcp2(ffff9efa0c071c00)
00000400:00000200:0.0:1596142526.904362:0:2860:0:(lib-lnet.h:96:lnet_peer_set_state()) Peer 192.168.2.38@tcp2(ffff9efa0c071c00) 512 state 0x6240
00000400:00000200:0.0:1596142526.904365:0:2860:0:(lib-lnet.h:104:lnet_peer_clear_state()) Peer 192.168.2.38@tcp2(ffff9efa0c071c00) 8192 state 0x4240
00000400:00000200:0.0:1596142526.904369:0:2860:0:(peer.c:3008:lnet_peer_select_nid()) peer 192.168.2.38@tcp2(ffff9efa0c071c00)
00000400:00000200:0.0:1596142526.904381:0:2860:0:(lib-move.c:5126:LNetGet()) LNetGet -&amp;gt; 12345-192.168.2.38@tcp2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Ping reply is processed. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1596142526.905282:0:2860:0:(peer.c:3342:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:0.0:1596142526.905287:0:2860:0:(peer.c:3457:lnet_peer_discovery()) peer 192.168.2.38@tcp2(ffff9efa0c071c00) state 0x40c1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Discovery thread looks up the primary nid in the ping buffer and it finds an existing lpni (some other request from the same client arrived from the client&apos;s 192.168.2.38@tcp99 NID).&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1596142526.905291:0:2860:0:(peer.c:2851:lnet_peer_data_present()) peer 192.168.2.38@tcp2(ffff9efa0c071c00)
00000400:00000200:0.0:1596142526.905308:0:2860:0:(peer.c:2771:lnet_peer_set_primary_data()) peer 192.168.2.38@tcp99(ffff9ef9de8dce00)
00000400:00000200:0.0:1596142526.905311:0:2860:0:(peer.c:1931:lnet_peer_queue_for_discovery()) Queue peer 192.168.2.38@tcp99: -114
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After reconciling the two peers in lnet_peer_set_primary_data(), we merge the info in pbuf with the peer object for primary NID 192.168.2.38@tcp99.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1596142526.905323:0:2860:0:(peer.c:2622:lnet_peer_merge_data()) peer 192.168.2.38@tcp99(ffff9ef9de8dce00)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When we call lnet_peer_add_nid() for the 192.168.2.38@tcp2, that function sees the existing lpni for that NID, and sees that the NID is &quot;primary&quot; for that peer object so it deletes the peer, and the peer NI, and creates a new lpni for the 192.168.2.38@tcp2 NID.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1596142526.905333:0:2860:0:(peer.c:457:lnet_peer_del_locked()) peer 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.905337:0:2860:0:(peer.c:377:lnet_peer_detach_peer_ni_locked()) peer 192.168.2.38@tcp2 NID 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.905341:0:2860:0:(peer.c:202:lnet_peer_ni_alloc()) ffff9efa0c550800 nid 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.905344:0:2860:0:(peer.c:220:lnet_peer_net_alloc()) ffff9efa0d340d40 net tcp2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But, LNetPrimaryNID still has reference on the original lpni for 192.168.2.38@tcp2 NID, and it is going to use that lpni to lookup the primary NID of the peer when discovery completes.&lt;/p&gt;

&lt;p&gt;Below we can see that after discovery completes, LNetPrimaryNID releases its reference on the lpni which allows the peer NI, peer net and peer objects to be freed. LNetPrimaryNID then returns the wrong nid for this peer.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;...
00000400:00000200:0.0:1596142526.905414:0:2860:0:(peer.c:1949:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 192.168.2.38@tcp2
...
00000400:00000200:0.0:1596142526.905518:0:3881:0:(peer.c:2292:lnet_discover_peer_locked()) peer 192.168.2.38@tcp2 NID 192.168.2.38@tcp2: -113. discovery complete
00000400:00000200:0.0:1596142526.905522:0:3881:0:(peer.c:1721:lnet_destroy_peer_ni_locked()) ffff9efa0c550000 nid 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.905526:0:3881:0:(peer.c:230:lnet_destroy_peer_net_locked()) ffff9efa22389a40 net tcp2
00000400:00000200:0.0:1596142526.905529:0:3881:0:(peer.c:293:lnet_destroy_peer_locked()) ffff9efa0c071c00 nid 192.168.2.38@tcp2
00000400:00000200:0.0:1596142526.905535:0:3881:0:(peer.c:1232:LNetPrimaryNID()) NID 192.168.2.38@tcp2 primary NID 192.168.2.38@tcp2 rc -113
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here&apos;s that code in LNetPrimaryNID():&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                rc = lnet_discover_peer_locked(lpni, cpt, true);
                if (rc)
                        goto out_decref;
                lp = lpni-&amp;gt;lpni_peer_net-&amp;gt;lpn_peer;

                /* Only try once if discovery is disabled */
                if (lnet_is_discovery_disabled(lp))
                        break;
        }
        primary_nid = lp-&amp;gt;lp_primary_nid;
out_decref:
        lnet_peer_ni_decref_locked(lpni);
out_unlock:
        lnet_net_unlock(cpt);

        CDEBUG(D_NET, &quot;NID %s primary NID %s rc %d\n&quot;, libcfs_nid2str(nid),
               libcfs_nid2str(primary_nid), rc);
        return primary_nid;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I can think of a couple solutions:&lt;br/&gt;
1. Modify lnet_peer_add_nid() so that it will maintain an existing lpni if it finds one for the NID it is trying to add. This might not be workable because lnet_create_reply_msg() currently derefs the lpni-&amp;gt;lpni_peer_net-&amp;gt;lpn_peer hierarchy without holding a net lock, and with this proposed change that hierarchy is not safe to read w/o holding the net lock. I don&apos;t think we want to introduce net lock into that function.&lt;/p&gt;

&lt;p&gt;2. Modify LNetPrimaryNID() (and probably lnet_discover_peer_locked() &lt;span class=&quot;error&quot;&gt;&amp;#91;Edit: It looks like all callers of lnet_discover_peer_locked() would also need to be adjusted&amp;#93;&lt;/span&gt;) so that they do not assume lpni is persistent across discovery.&lt;/p&gt;</description>
                <environment></environment>
        <key id="60313">LU-13883</key>
            <summary>LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="hornc">Chris Horn</reporter>
                        <labels>
                            <label>lnet</label>
                            <label>multi-rail</label>
                    </labels>
                <created>Thu, 6 Aug 2020 15:25:55 +0000</created>
                <updated>Fri, 26 Aug 2022 16:30:56 +0000</updated>
                            <resolved>Thu, 11 Mar 2021 05:04:03 +0000</resolved>
                                                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="276974" author="hornc" created="Fri, 7 Aug 2020 18:39:59 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt; I&apos;ve been working on a series of patches related to this issue. Do you have any code in progress for it already or just started looking at it now?&lt;/p&gt;</comment>
                            <comment id="276997" author="gerrit" created="Fri, 7 Aug 2020 21:28:41 +0000"  >&lt;p&gt;Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39606&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39606&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; lnet: Lookup lpni after discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ea64792a4e55685fdaa0b9b5260d0fbdcdb43587&lt;/p&gt;</comment>
                            <comment id="276999" author="hornc" created="Fri, 7 Aug 2020 21:46:35 +0000"  >&lt;p&gt;The patch implements the idea in #2 of the description, but only for LNetPrimaryNID() and lnet_discover(). I don&apos;t &lt;em&gt;think&lt;/em&gt; we need to adjust lnet_initiate_peer_discovery()... but I could be mistaken.&lt;/p&gt;</comment>
                            <comment id="277000" author="hornc" created="Fri, 7 Aug 2020 21:48:05 +0000"  >&lt;p&gt;Also note that there are other issues with this particular use-case that can still cause LNetPrimaryNID to return the wrong NID. The patch in this ticket is part of series which addresses those other issues.&lt;/p&gt;</comment>
                            <comment id="277226" author="gerrit" created="Tue, 11 Aug 2020 19:41:09 +0000"  >&lt;p&gt;Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39650&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39650&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; lnet: Refactor lnet_discover_peer_locked&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: a9759f16cf72e87c7ae9400ce40c4d13f9c3b5b6&lt;/p&gt;</comment>
                            <comment id="277235" author="ssmirnov" created="Tue, 11 Aug 2020 21:50:46 +0000"  >&lt;p&gt;Hi Chris,&#160;&lt;/p&gt;

&lt;p&gt;I only got around to this now, so I don&apos;t have anything that would be in conflict with your proposal.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="278183" author="eaujames" created="Thu, 27 Aug 2020 07:26:01 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Am I right to assume that this LU is related to LNet&apos;s multirail feature?&lt;br/&gt;
Is it OK with everyone if I update the ticket&apos;s tags to reflect this?&lt;/p&gt;</comment>
                            <comment id="278232" author="gerrit" created="Thu, 27 Aug 2020 19:56:16 +0000"  >&lt;p&gt;Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39747&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39747&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; lnet: Lookup lpni after discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 07f38014d1aebcff351fca6df358825ba045087e&lt;/p&gt;</comment>
                            <comment id="294467" author="gerrit" created="Wed, 10 Mar 2021 08:01:59 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/39747/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39747/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; lnet: Lookup lpni after discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 584d9e46053234d02a3290822317552785e44e76&lt;/p&gt;</comment>
                            <comment id="294614" author="pjones" created="Thu, 11 Mar 2021 05:04:03 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                            <comment id="299508" author="gerrit" created="Thu, 22 Apr 2021 16:57:10 +0000"  >&lt;p&gt;Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43413&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43413&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; lnet: Lookup lpni after discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 11b9f06751f9e9cfa1cc2e568b2e5d9904592adc&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i017af:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>