<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:34:58 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-17379] try MGS NIDs more quickly at initial mount</title>
                <link>https://jira.whamcloud.com/browse/LU-17379</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The MGC should try all of the MGS NIDs provided on the command line fairly rapidly the &lt;b&gt;first&lt;/b&gt; time after mount (without RPC retry) so that the client can detect the case of the MGS running on a backup node quickly.  Otherwise, if the mount command-line has many NIDs (4 is very typical, but may be 8 or even 16 in some cases) then the mount can be stuck for several minutes trying to find the MGS running on the right node.&lt;/p&gt;

&lt;p&gt;The MGC should preferably one NID per node first, then the second NID on each node, etc.  This can be done efficiently, assuming the NIDs are provided properly with colon-separated &lt;tt&gt;&amp;lt;MGSNODE&amp;gt;:&amp;lt;MGSNODE&amp;gt;&lt;/tt&gt; blocks, and comma-separated &quot;&lt;tt&gt;&amp;lt;MGSNID&amp;gt;,&amp;lt;MGSNID&amp;gt;&lt;/tt&gt;&quot; entries within that.&lt;/p&gt;

&lt;p&gt;However, the NIDs are often not listed correctly with &quot;&lt;tt&gt;:&lt;/tt&gt;&quot; separators, so if there are more than 4 MGS NIDs on the command-line, then it would be better to to do a &quot;bisection&quot; of the NIDs to best handle the case of 2/4/8 interfaces per node vs. 2/4/8 separate nodes.  For example, try NIDs in order &lt;tt&gt;0, nids/2, nids/4, nids*3/4&lt;/tt&gt; for 4 NIDs, then &lt;tt&gt;nids/8, nids*5/8, nids*3/8, nids*7/8&lt;/tt&gt; for 8 NIDs, then &lt;tt&gt;nids/16, nids*9/16, nids*5/16, nids*13/16, nids*3/16, nids*11/16, nids*7/16, nids*15/16&lt;/tt&gt; for 16 NIDs, and similarly for 32 NIDs (the maximum).&lt;/p&gt;

&lt;p&gt;It should be fairly quick to determine if the MGS is &lt;b&gt;not&lt;/b&gt; responding on a particular NID, because the client will get a rapid error response (e.g. &lt;tt&gt;-ENODEV&lt;/tt&gt; or &lt;tt&gt;-ENOTCONN&lt;/tt&gt; or &lt;tt&gt;-EHOSTUNREACH&lt;/tt&gt; with a short RPC timeout) so in that case it should try all of the NIDs once quickly.  If it gets &lt;tt&gt;ETIMEDOUT&lt;/tt&gt; that might mean the node is unavailable or overloaded, or it might mean the MGS is not running yet, so the client should back off and retry the NIDs with a longer timeout after the initial burst.  &lt;/p&gt;

&lt;p&gt;However, in most cases the MGS should be running on &lt;em&gt;some&lt;/em&gt; node and it just needs to avoid going into slow &quot;backoff&quot; mode until &lt;b&gt;after&lt;/b&gt; it has tried all of the NIDs at least once.&lt;/p&gt;

&lt;p&gt;It would make sense in this case to quiet the &quot;connecting to MGS not running on this node&quot; message for MGS connections so that it doesn&apos;t spam the console. &lt;/p&gt;</description>
                <environment></environment>
        <key id="79660">LU-17379</key>
            <summary>try MGS NIDs more quickly at initial mount</summary>
                <type id="7" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/task_agile.png">Technical task</type>
                            <parent id="75582">LU-16738</parent>
                                    <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="3" iconUrl="https://jira.whamcloud.com/images/icons/statuses/inprogress.png" description="This issue is being actively worked on at the moment by the assignee.">In Progress</status>
                    <statusCategory id="4" key="indeterminate" colorName="inprogress"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                    </labels>
                <created>Tue, 19 Dec 2023 22:57:44 +0000</created>
                <updated>Wed, 7 Feb 2024 19:57:05 +0000</updated>
                                            <version>Lustre 2.14.0</version>
                    <version>Lustre 2.15.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="400501" author="tappro" created="Mon, 22 Jan 2024 14:00:00 +0000"  >&lt;p&gt;While investigating this more closely I&apos;ve found that is not as trivial as it seems. First of all there is time limit for &lt;tt&gt;mgc_enqueue()&lt;/tt&gt; set inside function itself:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;  &#160; /* Limit how long we will wait for the enqueue to complete */
&#160; &#160; req-&amp;gt;rq_delay_limit = short_limit ? 5 : MGC_ENQUEUE_LIMIT(exp-&amp;gt;exp_obd);&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and it is the main reason why only limited amount of NIDs are checked - that is how many attempt were fit in that limit. Note, that limit is just a time enough to find MGS on &lt;b&gt;second&lt;/b&gt; node. Basically that is all we can guarantee. I suppose at time that was introduced we were not aware about 4 failover nodes for MGS, not speaking about 16 even.&lt;/p&gt;

&lt;p&gt;Interesting that with patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17357&quot; title=&quot;Client can use incorrect sec flavor when MGT is relocated&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17357&quot;&gt;LU-17357&lt;/a&gt; we would always wait for sptlrpc config with timeout enough to scan all mgs nodes despite send limit values above. That means in turn there is no need to set that limit to shorter values as that causes only more often re-queue attempts.&lt;/p&gt;

&lt;p&gt;So the problem with exiting without scanning all mgs nodes should be resolved by that long enough waiting for stlrpc config and this ticket is more targeted now on reducing total amount of that waiting time&lt;/p&gt;

&lt;p&gt;First important notice about NIDs scanning - each mgs node separated by &apos;:&apos; is added in connection list of import and can be managed in &lt;tt&gt;import_select_connection()&lt;/tt&gt; in any way we would choose, either time-based like now or by bisect proposed by Andreas or by any other way. Meanwhile the mgs NIDs inside single node separated by &apos;,&apos; are out of reach because they are added to connection at LNET level and only LNET is managing them, they are not even visible via &lt;tt&gt;lctl get_param mgc.*.import&lt;/tt&gt; unlike connections. That makes non-trivial task to try just first NIDs on each node, then second and so on. Basically that means &lt;tt&gt;import_select_connection()&lt;/tt&gt; should be able to notify LNET it needs to try just single NID from many&lt;/p&gt;

&lt;p&gt;Another way to improve - we can reduce timeout for the first round on connection attempts based ob amount of them in connection list. Now it is always 5s for first attempt. It can be reduced to lesser value if there are many connection linearly or by other means, though it makes no sense to use value less than &lt;tt&gt;obd_get_at_min&lt;/tt&gt; probably.&lt;/p&gt;

&lt;p&gt;As for current working schema of &lt;tt&gt;import_select_connection()&lt;/tt&gt; - it used &lt;tt&gt;last_attempt&lt;/tt&gt; time per connection to choose the least recently used one when there is no other preference. For any new connection its &lt;tt&gt;last_attempt&lt;/tt&gt; value is 0 and such one will be used preferably. That means the first round is linear, it goes from connection list head to its end, gets connection with 0 &lt;tt&gt;last_attempt&lt;/tt&gt; and try, then next one in list will be tried as having &lt;tt&gt;0&lt;/tt&gt; and so on.&lt;/p&gt;

&lt;p&gt;So proposed bisect approach (or any other) can be done over those connections still having &lt;tt&gt;0&lt;/tt&gt; (not yet tried) while we have any in the list.&lt;/p&gt;

&lt;p&gt;When all have non-zero last_attempt values then we also have options - to use the least recent (it is supposed to keep initial connection order, not exactly due to 1s time granularity but still), or think about other approaches , e.g. try just &lt;tt&gt;imp_last_success_conn&lt;/tt&gt; if exists or maybe it makes sense to remember primary nodes and try them first always on each new round, as HA would try to return nodes there after all&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="400880" author="adilger" created="Tue, 23 Jan 2024 20:13:36 +0000"  >&lt;p&gt;Mike, thanks for digging into the details here.&lt;/p&gt;

&lt;p&gt;It looks like in &lt;tt&gt;import_select_connection()&lt;/tt&gt; the client could try to &lt;em&gt;connect&lt;/em&gt; to the different NIDs to see which one is alive, rather than waiting on each one separately?  That could potentially be done &quot;semi-parallel&quot;, like:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;send connect to first NID with 30s or longer RPC timeout (in case server is busy)&lt;/li&gt;
	&lt;li&gt;wait 5s for reply, check if any NID has connected&lt;/li&gt;
	&lt;li&gt;if no connect yet, send to next NID&lt;/li&gt;
	&lt;li&gt;when connect is completed to some NID, set flag in export to indicate no more connections to be tried&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We would probably want to quiet the &quot;initial connect&quot; messages from the clients, maybe with &lt;tt&gt;MSG_CONNECT_INITIAL&lt;/tt&gt; so that they don&apos;t spam the server logs when trying all the servers with &quot;&lt;tt&gt;LustreError: 137-5: lfs00-MDT0005_UUID: not available for connect from 10.89.104.111@tcp (no target)&lt;/tt&gt;&quot; when all of the clients are trying to connect to different servers.&lt;/p&gt;</comment>
                            <comment id="402380" author="tappro" created="Fri, 2 Feb 2024 14:30:21 +0000"  >&lt;p&gt;So far the most troublesome case is when some NIDs in NID list are unavailable, e.g. bad address or missing network:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# time mount -t lustre 192.168.56.11@tcp:192.168.6.12@tcp:192.168.56.13@tcp:192.168.56.14@tcp:192.168.56.101@tcp:/lustre /mnt/lustre

real &#160; &#160;0m59.289s
user &#160; &#160;0m0.003s
sys &#160; &#160;0m0.051s
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note the second node is &lt;tt&gt;192.168.6.12&lt;/tt&gt; which has no interface on current node. If all addresses are on x.x.56.x network then mount took about 12s to reach correct address which is the last one in the list. So as expected it takes about 4s per node to try.&lt;/p&gt;

&lt;p&gt;But the situation goes bad if any NID is from missing network.&#160; The request is expired by ptlrpc timeout as expected in about 10s, according with deadline:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:2.0:1706879770.016570:0:6197:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1706879760/real 0] &#160;req@ffff8800a5fb93c0 x1789792650076096/t0(0) o250-&amp;gt;MGC192.168.56.11@tcp@192.168.6.12@tcp:26/25 lens 520/544 e 0 to 1 dl 1706879770 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:&apos;kworker.0&apos; uid:0 gid:0&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but it has no further effect as it starts to wait for LNet to unlink it, which happens after about 40s:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000040:1.0:1706879811.687504:0:6197:0:(lustre_net.h:2443:ptlrpc_rqphase_move()) @@@ move request phase from UnregRPC to Rpc &#160;req@ffff8800a5fb93c0 x1789792650076096/t0(0) o250-&amp;gt;MGC192.168.56.11@tcp@192.168.6.12@tcp:26/25 lens 520/544 e 0 to 1 dl 1706879770 ref 1 fl UnregRPC:EeXNQU/200/ffffffff rc -110/-1 job:&apos;kworker.0&apos; uid:0 gid:0&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And the reason of such behavior is lnet peer discovery, which started when RPC was sent:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:2.0:1706879760.019547:0:6197:0:(niobuf.c:86:ptl_send_buf()) Sending 520 bytes to portal 26, xid 1789792650076096, offset 0
00000400:00000200:2.0:1706879760.019570:0:6197:0:(lib-move.c:5284:LNetPut()) LNetPut -&amp;gt; 12345-192.168.6.12@tcp
00000400:00000200:2.0:1706879760.019605:0:6197:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 192.168.6.12@tcp: -114&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;it fails and exit but discovery is going in background. And this RPC may proceed only when discovery is timed out despite all RPC deadlines and timeouts:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:1.0:1706879811.687440:0:6196:0:(peer.c:3061:lnet_discovery_event_handler()) Received event: 6
00000400:00000200:1.0:1706879811.687442:0:6196:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 192.168.6.12@tcp: -114
00000400:00000010:1.0:1706879811.687443:0:6196:0:(api-ni.c:1821:lnet_ping_buffer_free()) kfreed &apos;pbuf&apos;: 281 at ffff88011151fba8.
00000400:00000200:1.0:1706879811.687447:0:6196:0:(peer.c:3717:lnet_peer_ping_failed()) peer 192.168.6.12@tcp:-110
00000400:00000200:1.0:1706879811.687449:0:6196:0:(peer.c:4092:lnet_peer_discovery()) peer 192.168.6.12@tcp(ffff8800a5ea06a8) state 0x102060 rc -110
00000400:00000200:1.0:1706879811.687450:0:6196:0:(peer.c:2401:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 192.168.6.12@tcp&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So in general we need some way either to prevent peer discovery (and &lt;tt&gt;lnetctl set discovery 0&lt;/tt&gt; doesn&apos;t work) for particular RPCs at least or to manage discovery timeout to be not longer than RPC deadline. Also, we could consider to force discovery to stop when request is expired, no need to discover anything already, we are just waiting for nothing.&#160;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt; , can you assist with that and propose possible solutions maybe?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="402478" author="ssmirnov" created="Fri, 2 Feb 2024 23:46:59 +0000"  >&lt;p&gt;Yes, I can see the same even with just two NIDs in the mount command out of which the first NID is unreachable.&#160;&lt;/p&gt;

&lt;p&gt;Default lnet_transaction_timeout of 150 is enough in this case for the mount to fail. Reducing lnet_transaction_timeout to 30 allows the mount to succeed. We&apos;re going to be hearing about this from the field I&apos;m sure.&lt;/p&gt;

&lt;p&gt;I haven&apos;t tried a mount with lots of &quot;:&quot;-separated NIDs, but it looks like the supplied NIDs are being discovered in the background in parallel. In my test, discovery for the second (reachable) NID completed almost immediately. However, lustre probably just didn&apos;t know it could actually talk to it.&lt;/p&gt;

&lt;p&gt;Aside from timeout manipulation, it seems that lustre could benefit from knowing that the peer is reachable before firing off a request to it. Not sure what would be the best way to accomplish this within the current architecture. Maybe registering some sort of (optional) callback so that after calling LnetAddPeer lustre gets notified that the peer got added successfully, i.e. discovery is done? If there&apos;s a lustre thread waiting on these events and acting on them in the background, this could work. There may be possibly unwanted effects from this: for example, if the first listed server is slower than the second, the second one will be picked for the mount ahead of the first.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="402501" author="tappro" created="Sat, 3 Feb 2024 14:37:53 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt; , can we avoid such behavior somehow? The current problem is that Lustre get RPC failed in 10s but can&apos;t proceed further until LNET discovery is timed out. That is unexpected behavior, we shouldn&apos;t wait about 40s for RPC with 10s deadline. So my question is that possible somehow to notify LNET that we don&apos;t need to wait for that particular lnet_libmd? Right now when request is timed out then inside &lt;tt&gt;ptlrpc_unregister_reply()&lt;/tt&gt; we call &lt;tt&gt;LNetMDUnlink()&lt;/tt&gt; for reply md and for request md if it is not unlinked yet (which means it is not yet sent as in our case), technically that unlink for request md should invoke &lt;tt&gt;request_out_callback()&lt;/tt&gt; and finalize RPC processing as I&apos;d expect. But on practice that doesn&apos;t work, because MD is referenced:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1706967574.984808:0:30492:0:(lib-md.c:64:lnet_md_unlink()) Queueing unlink of md ffff8800934b2f78&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and that unlink happens only when discovery is done:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:0.0:1706967618.538695:0:30491:0:(peer.c:3061:lnet_discovery_event_handler()) Received event: 6
00000400:00000200:0.0:1706967618.538696:0:30491:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 192.168.6.12@tcp: -114
00000400:00000200:0.0:1706967618.538703:0:30491:0:(peer.c:4092:lnet_peer_discovery()) peer 192.168.6.12@tcp(ffff8800a1fc9dd8) state 0x102060 rc -110
00000400:00000200:0.0:1706967618.538704:0:30491:0:(peer.c:2401:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 192.168.6.12@tcp
00000400:00000200:0.0:1706967618.538706:0:30491:0:(lib-msg.c:1020:lnet_is_health_check()) msg ffff8800ba545448 not committed for send or receive
00000400:00000200:0.0:1706967618.538706:0:30491:0:(lib-md.c:68:lnet_md_unlink()) Unlinking md ffff8800934b2f7&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and only now RPC is finalized. Is there any way to don&apos;t delay md unlink but abort that MD sending immediately? I mean just technically at LNET level.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;As for idea to don&apos;t even try peers which are not yet discovered - can we just check peer status from ptlrpc in some way? Probably we can add new primitive similar to LNetAddPeer() or LNetDebugPeer(), say LNetDiscoverPeer() which would return current discovery status (uptodate, discovering, disabled, etc.) . In that case we could just skip not discovered peers and try first those who discovered and avoid being stuck on dead peers&lt;/p&gt;</comment>
                            <comment id="402503" author="adilger" created="Sat, 3 Feb 2024 15:32:58 +0000"  >&lt;p&gt;Mike, in some cases the LNet layer is unable to complete the MD and deregister until a timeout finishes, because the RDMA address may have been given out to a remote node.  That said, if the host is unreachable or returns an error immediately then that shouldn&apos;t happen. &lt;/p&gt;

&lt;p&gt;Is it possible for the MGC to send separate RPCs asynchronously, so that it doesn&apos;t care what happens at the LNet layer?  That way, the client can have short timeouts, and try the different MGS NIDs quickly (eg. a few seconds apart), then wait for the first one to successfully reply.  We do the same with STATFS and GETATTR RPCs on the OST side. &lt;/p&gt;

&lt;p&gt;One important point is to silence the console errors on the servers for this case, so the logs are not spammed with &quot;refusing connect from XXXX for MGS&quot; errors (though it should still be printed for other targets since that has been very useful for debugging network issues recently. &lt;/p&gt;</comment>
                            <comment id="402509" author="tappro" created="Sat, 3 Feb 2024 16:02:48 +0000"  >&lt;p&gt;Andreas, I don&apos;t see how that can be done, all RPCs are using import to work with, and import uses only one current connection. That is what &lt;tt&gt;import_select_connection()&lt;/tt&gt; does, choose just one from many, then we try to send RPC over it. For your idea we would need to setup imports for each NID to send RPCs to them in parallel, i.e. instead of one MGC import we would need 16 if there are 16 NIDs. And we have to organize them somehow to mark only one as real while others as &apos;potential&apos;. We could think about how to send RPC with one import but over particular connection maybe, E.g. pings could be sent over each one listed in import and update their status, and import could use that status while choosing a new connection. But that need to be implemented from scratch and it seems that would be somehow the same as LNET peer discovery but at ptlrpc level. Right now I think that peer discovery does in background almost the same as you describe. So I&apos;d try to get peer info from LNET and use it to choose at least alive peers&lt;/p&gt;</comment>
                            <comment id="402511" author="simmonsja" created="Sat, 3 Feb 2024 16:09:43 +0000"  >&lt;p&gt;Mikhail what you described is very similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10360&quot; title=&quot;use Imperative Recovery logs for client-&amp;gt;MDT/OST connections&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10360&quot;&gt;LU-10360&lt;/a&gt; work.&lt;/p&gt;</comment>
                            <comment id="402513" author="adilger" created="Sat, 3 Feb 2024 17:04:02 +0000"  >&lt;p&gt;Mike, you are right, I wasn&apos;t thinking about this side of things. Doing this in parallel at the LNet level would be better. Can the MGC pass all of the peer NIDs to LNet directly to speed up discovery?  That would require the MGS NIDs to be specified correctly (&quot;,&quot; vs. &quot;:&quot; separators), or possibly have LNet not &quot;trust&quot; the NIDs given on the command-line as all being from the same host. &lt;/p&gt;

&lt;p&gt;James, I don&apos;t see how IR can help with the initial MGS connection, since &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10360&quot; title=&quot;use Imperative Recovery logs for client-&amp;gt;MDT/OST connections&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10360&quot;&gt;LU-10360&lt;/a&gt; is all about getting the current target NiD(s) &lt;b&gt;from&lt;/b&gt; the MGS.  That would be a chicken-and-egg problem.&lt;/p&gt;</comment>
                            <comment id="402514" author="ssmirnov" created="Sat, 3 Feb 2024 17:20:32 +0000"  >&lt;p&gt;LNet does appear to be discovering provided NIDs in parallel, at least in my two-NID &quot;:&quot;-separated test, with the first&#160; of the two NIDs unreachable, the second NID was being discovered immediately. (This may be different with &quot;,&quot; separated NIDs.) I don&apos;t know what happens at lustre layer exactly, but it looks like it needs to wait for a confirmation on which NID can be used before trying to establish a (lustre-level) connection. That&apos;s why I was proposing that there&apos;s a thread waiting on &quot;discovery complete&quot; events from LNet. To avoid using provided peers in random order, the thread can wait (for a shorter time) before picking the first available peer ahead of the first listed one. That said, it is definitely possible to add a peer status checker to LNet API if lustre layer prefers polling.&lt;/p&gt;</comment>
                            <comment id="402517" author="adilger" created="Sat, 3 Feb 2024 18:33:46 +0000"  >&lt;p&gt;Ideally, the MGC code could just give the full list of NIDs to ptlrpc and/or LNet along with the RPC and LNet would sanity the right place. I&apos;m totally fine to change the MGC and/or ptlrpc to do something to notify LNet about all of the MGS NIDs in advance of sending the RPC, so that the right layer can do it the best.  In all likelihood, we should probably do the same thing for other connections as well, but they happen in the background and are less noticeable.&lt;/p&gt;

&lt;p&gt;I think it would be better to have the MGC and other connections just wait for the RPC reply, rather than polling. That depends (AFAIK) on LNet knowing all of the possible NIDs to try for the connection, and I don&apos;t &lt;em&gt;think&lt;/em&gt; that happens today. It is currently the MGC code that tries all of the NIDs in sequence, and that seems redundant. Also, getting NID handling out of Lustre and into LNet would be a good thing all around. &lt;/p&gt;</comment>
                            <comment id="402519" author="adilger" created="Sat, 3 Feb 2024 18:47:02 +0000"  >&lt;p&gt;Mike, is this something you could work on?  &lt;/p&gt;

&lt;p&gt;Serguei, can you provide some input on what kind of LNet API you would like to pass multiple NIDs from the MGC down to speed up peer discovery?  There may be the risk that the mount command-line contains NIDs that are not correctly specified as belonging to the same host, so there would have to be some defense against that. &lt;/p&gt;

&lt;p&gt;I was considering a console error to provide feedback to the admin. However, if we get MGS NIDs from round-robin DNS (per &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16738&quot; title=&quot;Improve mount.lustre with many MGS NIDs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16738&quot;&gt;LU-16738&lt;/a&gt;) then we wouldn&apos;t have any way to distinguish which NIDs belong to the same or different hosts, so if LNet can handle this ambiguity during discovery automatically, then printing a console message is pointless, as would be the need to properly separate the NIDs on the mount command-line, and we could deprecate the use of &quot;:&quot; to separate NIDs and simplify IPv6 address parsing (though using DNS is probably still better than specifying IP addresses directly). &lt;/p&gt;</comment>
                            <comment id="402523" author="tappro" created="Sat, 3 Feb 2024 19:57:43 +0000"  >&lt;p&gt;I don&apos;t think we really need to pass NIDs to LNET, it has all of them added already and is doing discovery in background when they are added and each time when new RPC is sent, so we can rely on current discovery status - note, that upon mount all listed nodes in mount command have just been added to LNET and all are doing discovery already.&lt;/p&gt;

&lt;p&gt;Right now I think that easier approach would be to keep current scheme when &lt;tt&gt;import_select_connection()&lt;/tt&gt; chooses one connection from many based on info about it. So far that is just last_attempt time, we can get peer discovery status (as I see it is just passing status from LNET which is easy to impement), also it could be useful to remember how often connection has been attempted and really used (that would give us stats how often HA uses that node). For me that looks enough to select alive and most used in past node to connect to and that is generic either for MGC or for other imports.&#160;&lt;/p&gt;

&lt;p&gt;Not sure about other details but it looks like that can be done incrementally&lt;/p&gt;</comment>
                            <comment id="402524" author="ssmirnov" created="Sat, 3 Feb 2024 21:37:13 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=tappro&quot; class=&quot;user-hover&quot; rel=&quot;tappro&quot;&gt;tappro&lt;/a&gt;, I think I can add something like LNetGetPeerStatus which would return current status of the peer provided any of its NIDs. Is that something you could use for starters?&lt;/p&gt;</comment>
                            <comment id="402674" author="tappro" created="Mon, 5 Feb 2024 13:16:22 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt; , yes, that would be helpful&lt;/p&gt;</comment>
                            <comment id="402736" author="gerrit" created="Mon, 5 Feb 2024 20:28:36 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53926&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53926&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17379&quot; title=&quot;try MGS NIDs more quickly at initial mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17379&quot;&gt;LU-17379&lt;/a&gt; lnet: add LNetPeerDiscovered to LNet API&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0140bec6cfe4bfd25fbf4088c510867daab3ebb7&lt;/p&gt;</comment>
                            <comment id="402745" author="ssmirnov" created="Mon, 5 Feb 2024 21:10:37 +0000"  >&lt;p&gt;LNetPeerDiscovered may be useful in the case of &quot;:&quot;-separated NIDs, but not with &quot;,&quot;-separated NIDs in the mount string.&lt;/p&gt;

&lt;p&gt;As far as I can see, the issue with &quot;,&quot;-separated NIDs is that LNetPrimaryNID is called just once - it initiates discovery using the first listed NID (primary?) as a target, but doesn&apos;t do anything with the knowledge of the non-primary NIDs until the discovery issued to the peer&apos;s primary NID fails.&lt;/p&gt;

&lt;p&gt;I&apos;m going to experiment with modifying LNetPrimaryNID so it may handle this case better, so there may be another patch addressing that.&lt;/p&gt;</comment>
                            <comment id="402777" author="ssmirnov" created="Tue, 6 Feb 2024 00:09:48 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=tappro&quot; class=&quot;user-hover&quot; rel=&quot;tappro&quot;&gt;tappro&lt;/a&gt;&#160;&lt;/p&gt;

&lt;p&gt;Modifying LNetPrimaryNID is going to take a little longer, but this change:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/fs/lustre-release/+/53930/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/fs/lustre-release/+/53930/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;may also be useful in your testing if you use socklnd.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="402785" author="gerrit" created="Tue, 6 Feb 2024 03:34:06 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53933&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53933&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17379&quot; title=&quot;try MGS NIDs more quickly at initial mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17379&quot;&gt;LU-17379&lt;/a&gt; lnet: parallelize peer discovery via LNetAddPeer&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: a63ead3388b4bcac48f8cf1c9489092af0e92a46&lt;/p&gt;</comment>
                            <comment id="402829" author="gerrit" created="Tue, 6 Feb 2024 10:46:05 +0000"  >&lt;p&gt;&quot;Mikhail Pershin &amp;lt;mpershin@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53937&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53937&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17379&quot; title=&quot;try MGS NIDs more quickly at initial mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17379&quot;&gt;LU-17379&lt;/a&gt; ptlrpc: fix check for callback discard&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: eef6974270086c09c67914257e10732a94db5c9b&lt;/p&gt;</comment>
                            <comment id="402830" author="tappro" created="Tue, 6 Feb 2024 10:53:38 +0000"  >&lt;p&gt;while testing mount with unavailable NIDs I&apos;ve found that attempt to call request out callback while unlinking reply doesn&apos;t work. The reason is that check for &lt;tt&gt;rq_reply_unlinked&lt;/tt&gt; is done too early, that flag is set in reply callback from LNetMDUnlink() which is called after discard check. So I&apos;ve made that patch above in context of this ticket. It doesn&apos;t look as major issue, I don&apos;t even think it will have noticeable outcome, but at least it makes original idea to work&#160;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="79533">LU-17357</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="80384">LU-17476</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="80658">LU-17505</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="75582">LU-16738</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i045ef:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>