<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:19:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15539] clients report mds_mds_connection in connect_flags after lustre update on servers</title>
                <link>https://jira.whamcloud.com/browse/LU-15539</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl.&#160;&lt;/p&gt;

&lt;p&gt;The node on which the MGS runs, copper1, began reporting &quot;new MDS connections&quot; from NIDs that are assigned to client nodes:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Clients connect flags includes &quot;mds_mds_connection&quot;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@quartz7:lustre]# head */*/connect_flags
==&amp;gt; mgc/MGC172.19.3.1@o2ib600/connect_flags &amp;lt;==
flags=0x2000011005002020
flags2=0x0
version
barrier
adaptive_timeouts
mds_mds_connection
full20
imp_recov
bulk_mbits
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The clients are running lustre lustre-2.12.7_2.llnl, which does not have &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt; client: don&apos;t use OBD_CONNECT_MNE_SWAB&quot;.&lt;/p&gt;

&lt;p&gt;Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.&lt;/p&gt;

&lt;p&gt;Patch stacks are:&lt;br/&gt;
&lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Seen during the same lustre server update where we saw &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15541&quot; title=&quot;Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15541&quot;&gt;LU-15541&lt;/a&gt; but appears to be a separate issue&lt;/p&gt;</description>
                <environment>lustre-2.12.8_6.llnl&lt;br/&gt;
3.10.0-1160.53.1.1chaos.ch6.x86_64&lt;br/&gt;
RHEL7.9&lt;br/&gt;
zfs-0.7.11-9.8llnl</environment>
        <key id="68591">LU-15539</key>
            <summary>clients report mds_mds_connection in connect_flags after lustre update on servers</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 9 Feb 2022 02:08:19 +0000</created>
                <updated>Fri, 1 Apr 2022 22:50:48 +0000</updated>
                            <resolved>Mon, 21 Mar 2022 20:05:38 +0000</resolved>
                                    <version>Lustre 2.12.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="325654" author="ofaaland" created="Wed, 9 Feb 2022 02:10:20 +0000"  >&lt;p&gt;For my records, my local ticket is TOSS5543&lt;/p&gt;</comment>
                            <comment id="325655" author="ofaaland" created="Wed, 9 Feb 2022 02:14:47 +0000"  >&lt;p&gt;When we attempted to revert to lustre-2.12.7_2.llnl after observing the issue, we did &lt;em&gt;not&lt;/em&gt; restore any server-side data; we just powered the servers down, booted them into the image with lustre-2.12.7_2.llnl, imported pools, and mounted the Lustre targets.&lt;/p&gt;

&lt;p&gt;After doing this, we continued to see the same symptoms.&lt;/p&gt;

&lt;p&gt;The clients which were still mounted (about 4,000) still had the incorrect connect flag from the earlier connection.  This was &lt;em&gt;not&lt;/em&gt; reset when they reconnected (or attempted to reconnect).  I believe this connect flag not being reset may be why the symptoms did not change after the reboot.&lt;/p&gt;</comment>
                            <comment id="325656" author="ofaaland" created="Wed, 9 Feb 2022 02:16:08 +0000"  >&lt;p&gt;Perhaps, does &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15453&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-15453&lt;/a&gt; need to be rolled out to the clients before it is rolled out to the servers?&lt;/p&gt;</comment>
                            <comment id="325657" author="ofaaland" created="Wed, 9 Feb 2022 02:23:25 +0000"  >&lt;p&gt;I am going to force-umount all the clients, before I try bringing the file system up on lustre-2.12.7_2.llnl, because I don&apos;t see another way to force the connect flags on the clients to change.&lt;/p&gt;

&lt;p&gt;I don&apos;t see evidence that this flag is recorded in the config logs somehow, and so I believe that a force umount on the client and then a mount of the file system running lustre-2.12.7_2.llnl will succeed in removing the mds_mds_connection flag.  But if I&apos;m wrong about that, please let me know ASAP.&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                            <comment id="325660" author="ofaaland" created="Wed, 9 Feb 2022 03:10:28 +0000"  >&lt;p&gt;Most of these clients mount one or more other Lustre file systems (whose servers are running lustre-2.12.7_2.llnl and have not yet been upgraded to 2.12.8_6.llnl).  On the example node I&apos;m looking at, &quot;lustre1&quot; is the file system we tried to update, and &quot;lustre2&quot; is the other one which stayed at 2.12.7_2.llnl the whole time.&lt;/p&gt;

&lt;p&gt;On this client node, the connect_flags for the other file system, &quot;lustre2&quot;, &lt;em&gt;also&lt;/em&gt; shows &quot;mds_mds_connection&quot;. &lt;/p&gt;

&lt;p&gt;Will I need to umount lustre2 and remount it to get the client to forget that connection flag?&lt;/p&gt;

&lt;p&gt;(There are two separate MGSs - one in the cluster that hosts &quot;lustre1&quot;, and one in the cluster that hosts &quot;lustre2&quot;, and each MGS knows only about the file system in the cluster it lives in.  So I didn&apos;t expect this.)&lt;/p&gt;</comment>
                            <comment id="325661" author="paf0186" created="Wed, 9 Feb 2022 03:11:45 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;Yes, I believe that should work. &#160;(Re: the unmount plan)&lt;/p&gt;</comment>
                            <comment id="325662" author="ofaaland" created="Wed, 9 Feb 2022 03:23:28 +0000"  >&lt;p&gt;Thanks, Patrick.&lt;/p&gt;</comment>
                            <comment id="325663" author="ofaaland" created="Wed, 9 Feb 2022 03:37:07 +0000"  >&lt;p&gt;Patrick, I&apos;d like to clarify which actions I&apos;ve mentioned are still uncertain/need to be looked into, and which I should proceed with.  So please confirm:&lt;/p&gt;

&lt;p&gt;(1) I should &quot;umount -f /p/lustre1&quot; (the file system that was briefly at 2.12.8_6.llnl) from every client that has it mounted?&lt;br/&gt;
(2) Do I need to umount /p/lustre2 at the same time?  &lt;br/&gt;
(3) Since /p/lustre2 has been up the whole time, has jobs running against it, and was always at the &quot;good&quot; version, is there a less disruptive way you can think of?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="325664" author="laisiyao" created="Wed, 9 Feb 2022 03:37:08 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Perhaps, does https://jira.whamcloud.com/browse/LU-15453 need to be rolled out to the clients before it is rolled out to the servers?
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Do you mean &lt;a href=&quot;https://review.whamcloud.com/37880/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/37880/&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;? If so, yes, it should be rolled out to clients.&lt;/p&gt;</comment>
                            <comment id="325665" author="ofaaland" created="Wed, 9 Feb 2022 03:51:42 +0000"  >&lt;p&gt;Thank you, Lai&lt;/p&gt;</comment>
                            <comment id="325667" author="paf0186" created="Wed, 9 Feb 2022 04:22:04 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;I would think (1) should be sufficient - the issue is an overlapping connect flag so servers without the new one should be unaffected.&lt;/p&gt;</comment>
                            <comment id="325668" author="ofaaland" created="Wed, 9 Feb 2022 04:36:17 +0000"  >&lt;p&gt;OK, thanks Patrick&lt;/p&gt;</comment>
                            <comment id="325671" author="ofaaland" created="Wed, 9 Feb 2022 05:37:11 +0000"  >&lt;p&gt;The mds_mds_connection flag is appearing in connect_flags on a client running lustre-2.12.7_2.llnl, which does not have patch 37880, on which I&apos;ve umounted all lustre file systems, stopped lnet, verified that that libcfs is not loaded (nor any other dependent modules), started lnet, and mounted lustre2 whose servers are also all running lustre-2.12.7_2.llnl.&lt;/p&gt;

&lt;p&gt;This is incorrect, right?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@zinci:~]# pdsh -a rpm -q lustre | dshbak -c
----------------
ezinc[1-52]
----------------
lustre-2.12.7_2.llnl-2.ch6.x86_64

[root@zinci:~]# pdsh -w e1 lctl list_nids
e1: 172.19.3.1@o2ib600
[root@zinci:~]# pdsh -w e1 df -h -t lustre
e1: Filesystem      Size  Used Avail Use% Mounted on
e1: zinc1/mgs       2.7T   21M  2.7T   1% /mnt/lustre/MGS
e1: zinc1/mdt1      2.8T  183G  2.7T   7% /mnt/lustre/lsh-MDT0000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@quartz187:~]# pdsh -w e10 umount -a -t lustre
[root@quartz187:~]# pdsh -w e10 systemctl stop lnet
[root@quartz187:~]# pdsh -w e10 lsmod | grep libcfs
[root@quartz187:~]# pdsh -w e10 systemctl start lnet
[root@quartz187:~]# pdsh -w e10 mount -T /etc/fstab.d/ /p/lustre2
[root@quartz187:~]# pdsh -w e10 cat /sys/kernel/debug/lustre/mgc/MGC172.19.3.1@o2ib600/connect_flags
e10: flags=0x2000011005002020
e10: flags2=0x0
e10: version
e10: barrier
e10: adaptive_timeouts
e10: mds_mds_connection
e10: full20
e10: imp_recov
e10: bulk_mbits
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="325676" author="laisiyao" created="Wed, 9 Feb 2022 07:23:19 +0000"  >&lt;p&gt;It looks correct to see this in 2.12.7, though it means MNE_SWAB here. After you land 37880, it should be gone.&lt;/p&gt;</comment>
                            <comment id="325746" author="ofaaland" created="Wed, 9 Feb 2022 19:33:31 +0000"  >&lt;p&gt;Lai, it seems as if the soft lockups in the description above are an LNet issue unrelated to the flag.&#160; Do you agree?&#160; If so, I&apos;ll create a separate Jira issue for it.&lt;/p&gt;</comment>
                            <comment id="325785" author="laisiyao" created="Thu, 10 Feb 2022 01:43:57 +0000"  >&lt;p&gt;Yes, it looks so.&lt;/p&gt;</comment>
                            <comment id="325791" author="ofaaland" created="Thu, 10 Feb 2022 02:46:20 +0000"  >&lt;p&gt;We umounted all the clients while the server cluster was down, then brought the server cluster back up in 2.12.7_2.llnl.&#160; We did not see the &quot;Received new MDS connection&quot; messages on bringup.&lt;/p&gt;

&lt;p&gt;We are proceeding with client cluster updates, and will update server clusters in about 2 weeks.&lt;/p&gt;</comment>
                            <comment id="329790" author="ofaaland" created="Mon, 21 Mar 2022 20:03:45 +0000"  >&lt;p&gt;In retrospect the isssue that brought kept the lustre file system from coming up was the LNet issue documented in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15541&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-15541&lt;/a&gt; so reduced this issue priority to &quot;Minor&quot;.&lt;/p&gt;</comment>
                            <comment id="329791" author="ofaaland" created="Mon, 21 Mar 2022 20:05:38 +0000"  >&lt;p&gt;Updated all the clients to 2.12.8_6.llnl (or later) with patch &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt; client: don&apos;t use OBD_CONNECT_MNE_SWAB&quot;.&lt;br/&gt;
Then updated the servers to 2.12.8_6.llnl (or later) with that patch.&lt;br/&gt;
No longer seeing inappropriate &quot;Received new MDS connection&quot; messages on bringup.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="58344">LU-13356</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02hvr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>