<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:39:09 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4043] clients unable to reconnect after OST failover </title>
                <link>https://jira.whamcloud.com/browse/LU-4043</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;BP has been running into an issue on their 2.4.1 system where clients are not able to connect to the OSTs after a failover.&lt;/p&gt;

&lt;p&gt;Looking at the debug logs on the MDS, it looks like the problem is that when the OSTs register, both service node NIDs are assigned to one UUID, which is named after nid&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;. The imperative recovery code, however, uses the UUID name instead of the NID name when creating a connection. This causes imperative recovery to keep trying to connect to the first service node, even when the MGS tells it to connect to the second.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;20000000:01000000:0.0:1380683267.395792:0:14496:0:(mgs_handler.c:344:mgs_handle_target_reg()) Server pfs-OST0006 is running on 10.10.160.26@tcp1
...
00000100:00000040:0.0:1380683274.383862:0:14498:0:(lustre_peer.c:200:class_check_uuid()) check &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; uuid 10.10.160.25@tcp1 has 10.10.160.26@tcp1.
10000000:00000040:0.0:1380683274.383865:0:14498:0:(mgc_request.c:1408:mgc_apply_recover_logs()) Find uuid 10.10.160.25@tcp1 by nid 10.10.160.26@tcp1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here is the tunefs line used to create the OST:&lt;br/&gt;
tunefs.lustre --erase-params --writeconf --mgsnode=10.10.160.21@tcp1 --mgsnode=10.10.160.22@tcp1 --servicenode=10.10.160.25@tcp1 --servicenode=10.10.160.26@tcp1 --param ost.quota_type=ug /dev/mapper/ost_pfs_6&lt;/p&gt;

&lt;p&gt;Does import really need to use the uuid, or can it use nid? Alternatively, should the registration code really be using all the servicenode nids for the same UUID? &lt;/p&gt;

&lt;p&gt;This also leads into a concern I have with the fix for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3445&quot; title=&quot;Specifying multiple networks in NIDs does no longer work&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3445&quot;&gt;&lt;del&gt;LU-3445&lt;/del&gt;&lt;/a&gt;. There doesn&apos;t appear to be anyway to distinguish between multiple NIDs for the same node, versus different nodes. &lt;/p&gt;

&lt;p&gt;Also, I&apos;m sort of confused as to why this wasn&apos;t caught in failover testing, is the servicenode parameter not part of the test suite?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</description>
                <environment></environment>
        <key id="21228">LU-4043</key>
            <summary>clients unable to reconnect after OST failover </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="kitwestneat">Kit Westneat</reporter>
                        <labels>
                    </labels>
                <created>Wed, 2 Oct 2013 03:41:09 +0000</created>
                <updated>Tue, 3 Jun 2014 12:44:49 +0000</updated>
                            <resolved>Thu, 13 Mar 2014 18:40:50 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                    <fixVersion>Lustre 2.4.2</fixVersion>
                    <fixVersion>Lustre 2.5.1</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="68142" author="kitwestneat" created="Wed, 2 Oct 2013 12:46:29 +0000"  >&lt;p&gt;Here&apos;s the code that adds all the failnodes to a single uuid:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;mgs_write_log_failnids:
...
&lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (class_find_param(ptr, PARAM_FAILNODE, &amp;amp;ptr) == 0) {
                &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (class_parse_nid(ptr, &amp;amp;nid, &amp;amp;ptr) == 0) {
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (failnodeuuid == NULL) {
                                /* We don&apos;t know the failover node name,
                                   so just use the first nid as the uuid */
                                rc = name_create(&amp;amp;failnodeuuid,
                                                 libcfs_nid2str(nid), &quot;&quot;);
                                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc)
                                        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; rc;
                        }
                        CDEBUG(D_MGS, &lt;span class=&quot;code-quote&quot;&gt;&quot;add nid %s &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; failover uuid %s, &quot;&lt;/span&gt;
                               &lt;span class=&quot;code-quote&quot;&gt;&quot;client %s\n&quot;&lt;/span&gt;, libcfs_nid2str(nid),
                               failnodeuuid, cliname);
                        rc = record_add_uuid(env, llh, nid, failnodeuuid);
                }
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (failnodeuuid)
                        rc = record_add_conn(env, llh, cliname, failnodeuuid);
        }
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="68167" author="kitwestneat" created="Wed, 2 Oct 2013 17:01:40 +0000"  >&lt;p&gt;I tested this with the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3829&quot; title=&quot;MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3829&quot;&gt;&lt;del&gt;LU-3829&lt;/del&gt;&lt;/a&gt; and it seems to workaround the issue, so I think the priority on this ticket can be lowered. IR with service nodes still appears to be broken though, it still attempts to connect to the first service node as opposed to the connected service node. &lt;/p&gt;</comment>
                            <comment id="68175" author="jay" created="Wed, 2 Oct 2013 17:52:44 +0000"  >&lt;p&gt;it looks like the failover NIDs were wrongly added to the first connection.&lt;/p&gt;

&lt;p&gt;Try patch &lt;a href=&quot;http://review.whamcloud.com/7835&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7835&lt;/a&gt; please - try it first before applying it on production because I didn&apos;t test it myself.&lt;/p&gt;</comment>
                            <comment id="78561" author="schamp" created="Thu, 6 Mar 2014 03:24:02 +0000"  >&lt;p&gt;This bug appears to have been introduced by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1301&quot; title=&quot;MGS over OSD&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1301&quot;&gt;&lt;del&gt;LU-1301&lt;/del&gt;&lt;/a&gt;, commit d9d27cadcab87d242072f3ff9d8fd4625415fc10&lt;/p&gt;

&lt;p&gt;It should be fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt; :&lt;br/&gt;
  b2_4 commit d61cd9a816f8e9ce959b7e32e73ad2871a8ffe7f&lt;br/&gt;
  b2_5 commit 79dd406672d87f8c14a14cfe6a3715d562b8838a&lt;br/&gt;
  master commit f157ebea33edb656b123bb34d4cb06f970bb4bb6&lt;br/&gt;
which appears functionally identical to &lt;a href=&quot;http://review.whamcloud.com/7835&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7835&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So this can probably be closed.&lt;/p&gt;</comment>
                            <comment id="79270" author="adilger" created="Thu, 13 Mar 2014 18:40:50 +0000"  >&lt;p&gt;This was fixed via &lt;a href=&quot;http://review.whamcloud.com/8372&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/8372&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="21972">LU-4243</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw4lj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10853</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>