<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:18:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15453] MDT shutdown hangs on  mutex_lock, possibly cld_lock</title>
                <link>https://jira.whamcloud.com/browse/LU-15453</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;LNet issues (See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14026&quot; title=&quot;symptoms of message loss or corruption after upgrading routers to lustre 2.12.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14026&quot;&gt;LU-14026&lt;/a&gt;) result in clients and lustre servers reporting via console logs that they lost connection to the MGS.&lt;/p&gt;

&lt;p&gt;We are working on solving the LNet issues, but this may also be revealing error-path issues that should be fixed.&lt;/p&gt;

&lt;p&gt;MDT0, which is usually running on the same server as the MGS, is one of the targets which reports a lost connection (they are separate devices, stored in distinct datasets, started/stopped separately):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;MGC172.19.3.98@o2ib600: Connection to MGS (at 0@lo) was lost &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Attempting to shutdown the MDT hangs, with this stack reported by the watchdog:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; schedule_preempt_disabled+0x39/0x90
 __mutex_lock_slowpath+0x10f/0x250
 mutex_lock+0x32/0x42
 mgc_process_config+0x21a/0x1420 [mgc]
 obd_process_config.constprop.14+0x75/0x210 [obdclass]
 ? lprocfs_counter_add+0xf9/0x160 [obdclass]
 lustre_end_log+0x1ff/0x550 [obdclass]
 server_put_super+0x82e/0xd00 [obdclass]
 generic_shutdown_super+0x6d/0x110
 kill_anon_super+0x12/0x20
 lustre_kill_super+0x32/0x50 [obdclass]
 deactivate_locked_super+0x4e/0x70
 deactivate_super+0x46/0x60
 cleanup_mnt+0x3f/0x80
 __cleanup_mnt+0x12/0x20
 task_work_run+0xbb/0xf0
 do_notify_resume+0xa5/0xc0
 int_signal+0x12/0x17
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The server was crashed and a dump collected. &#160;The stacks for the umount process and the ll_cfg_requeue process both have pointers to the &quot;ls1-mdtir&quot; config_llog_data structure; I believe cld-&amp;gt;cld_lock is held by ll_cfg_requeue and umount is waiting on it.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 4504   TASK: ffff8e8c9edc8000  CPU: 24  COMMAND: &quot;ll_cfg_requeue&quot;
 #0 [ffff8e8ac474f970] __schedule at ffffffff9d3b6788
 #1 [ffff8e8ac474f9d8] schedule at ffffffff9d3b6ce9
 #2 [ffff8e8ac474f9e8] schedule_timeout at ffffffff9d3b4528
 #3 [ffff8e8ac474fa98] ldlm_completion_ast at ffffffffc14ac650 [ptlrpc]
 #4 [ffff8e8ac474fb40] ldlm_cli_enqueue_fini at ffffffffc14ae83f [ptlrpc]
 #5 [ffff8e8ac474fbf0] ldlm_cli_enqueue at ffffffffc14b10d1 [ptlrpc]
 #6 [ffff8e8ac474fca8] mgc_enqueue at ffffffffc0fb94cf [mgc]
 #7 [ffff8e8ac474fd70] mgc_process_log at ffffffffc0fbf393 [mgc]
 #8 [ffff8e8ac474fe30] mgc_requeue_thread at ffffffffc0fc1b10 [mgc]
 #9 [ffff8e8ac474fec8] kthread at ffffffff9cccb221
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I can provide console logs and the crash dump.&#160; I do not have lustre debug logs.&lt;/p&gt;</description>
                <environment>lustre-2.12.7_2.llnl-2.ch6.x86_64&lt;br/&gt;
zfs-0.7.11-9.8llnl.ch6.x86_64&lt;br/&gt;
3.10.0-1160.45.1.1chaos.ch6.x86_64</environment>
        <key id="68047">LU-15453</key>
            <summary>MDT shutdown hangs on  mutex_lock, possibly cld_lock</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 14 Jan 2022 19:43:49 +0000</created>
                <updated>Fri, 22 Jul 2022 22:42:56 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="322828" author="ofaaland" created="Fri, 14 Jan 2022 21:23:41 +0000"  >&lt;p&gt;For my records, my internal ticket is TOSS5512&lt;/p&gt;</comment>
                            <comment id="322928" author="pjones" created="Mon, 17 Jan 2022 17:15:27 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Could you please advise&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="322929" author="pjones" created="Mon, 17 Jan 2022 17:20:05 +0000"  >&lt;p&gt;Actually, perhaps Mike is a more appropriate candidate...&lt;/p&gt;</comment>
                            <comment id="322930" author="adilger" created="Mon, 17 Jan 2022 17:20:50 +0000"  >&lt;p&gt;Hi Olaf, could you please attach the stack traces of running processes at the time of the hang (&quot;bt&quot; from the crashdump).&lt;/p&gt;</comment>
                            <comment id="323218" author="ofaaland" created="Wed, 19 Jan 2022 22:15:30 +0000"  >&lt;p&gt;Hi, sorry for the delay.&#160; I&apos;ve attached:&lt;br/&gt;
&quot;bt -a&quot; output in bt.a.txt (stack traces of the active task on each CPU)&lt;br/&gt;
&quot;foreach bt&quot; output in foreach.bt.txt (stack traces of all processes)&lt;/p&gt;</comment>
                            <comment id="323377" author="tappro" created="Thu, 20 Jan 2022 21:29:40 +0000"  >&lt;p&gt;Symptoms remind me ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;, the related patch is not yet landed in b2_12: &lt;a href=&quot;https://review.whamcloud.com/41309&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/41309&lt;/a&gt;&#160;&lt;/p&gt;

&lt;p&gt;Another thought is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15020&quot; title=&quot;OSP_DISCONNECT blocking MDT unmount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15020&quot;&gt;&lt;del&gt;LU-15020&lt;/del&gt;&lt;/a&gt; which is about the waiting for OST_DISCONNECT, but the first one looks closer to what we have here&lt;/p&gt;</comment>
                            <comment id="323741" author="ofaaland" created="Mon, 24 Jan 2022 21:53:20 +0000"  >&lt;p&gt;Hi Mikhail,&lt;/p&gt;

&lt;p&gt;Yes, it does look a lot like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;.&#160; I see Etienne&apos;s comment about change #41309 removing interop support with v2.2 clients and servers, and that the patch therefore cannot be landed to b2_12.&#160;&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;At our site, we we have only Lustre 2.10.8 routers and Lustre {2.12.8,2.14) clients/servers/routers.&#160; We do not have v2.2 running anywhere.&#160; Can we safely add that patch to our stack?&#160; It would be useful to hear back about this today, if possible.&lt;/li&gt;
	&lt;li&gt;If change #41309 cannot be landed to b2_12, what are some other options?&#160; This question is not as urgent.&lt;/li&gt;
	&lt;li&gt;If we see this symptom again before we have any patches landed to address it, is there other information I can gather that would help confirm this theory?&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="323791" author="tappro" created="Tue, 25 Jan 2022 09:27:24 +0000"  >&lt;p&gt;Olaf, the patch can be added to your stack if there is no need for 2.2 interop. As for question #2 - do you mean will there be an alternative solution in b2_12?&lt;/p&gt;

&lt;p&gt;As for other information to collect, it seems we can only rely on symptoms here, since related code has no any debug messages directly connected with the situation&lt;/p&gt;</comment>
                            <comment id="323855" author="ofaaland" created="Tue, 25 Jan 2022 17:25:07 +0000"  >&lt;p&gt;Mikhail,&lt;/p&gt;

&lt;p&gt;&amp;gt; As for question #2 - do you mean will there be an alternative solution in b2_12?&lt;/p&gt;

&lt;p&gt;Yes, that was my question.&lt;/p&gt;

&lt;p&gt;thanks!&lt;/p&gt;</comment>
                            <comment id="323907" author="sthiell" created="Tue, 25 Jan 2022 21:53:51 +0000"  >&lt;p&gt;Honestly it is a bit ridiculous to not land change 41309 to b2_12 at this time because of compat issue with old Lustre 2.2. Without this patch, the MGS on 2.12.x is not stable, even in a full 2.12 environment. We have patched all our clients and servers with it (we&apos;re running 2.12.x everywhere now, mostly 2.12.7 and now deploying 2.12.8 that also requires patching). Just saying. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="323913" author="adilger" created="Tue, 25 Jan 2022 23:16:51 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=sthiell&quot; class=&quot;user-hover&quot; rel=&quot;sthiell&quot;&gt;sthiell&lt;/a&gt;, I don&apos;t think anyone is &lt;em&gt;against&lt;/em&gt; landing 41309 on b2_12 because of 2.2 interop, just that it hasn&apos;t landed yet.&lt;/p&gt;</comment>
                            <comment id="323923" author="sthiell" created="Wed, 26 Jan 2022 01:33:33 +0000"  >&lt;p&gt;OK! Thanks Andreas!&lt;/p&gt;</comment>
                            <comment id="325208" author="ofaaland" created="Thu, 3 Feb 2022 23:48:43 +0000"  >&lt;p&gt;Stephane,&lt;/p&gt;

&lt;p&gt;Do you have any other patches in your stack related to recovery?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="325210" author="sthiell" created="Fri, 4 Feb 2022 00:07:15 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I don&apos;t think so. Our servers are running 2.12.7 with:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt; client: don&apos;t use OBD_CONNECT_MNE_SWAB  (41309)&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14688&quot; title=&quot;Changelog cancel improvement&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14688&quot;&gt;&lt;del&gt;LU-14688&lt;/del&gt;&lt;/a&gt; mdt: changelog purge deletes plain llog (43990)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Our clients are now slowly moving to 2.12.8 + &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="61195">LU-14026</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="58344">LU-13356</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="67186">LU-15234</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="41939" name="bt.a.txt" size="52281" author="ofaaland" created="Wed, 19 Jan 2022 21:56:31 +0000"/>
                            <attachment id="41938" name="foreach.bt.txt" size="585213" author="ofaaland" created="Wed, 19 Jan 2022 21:56:31 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02f5j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>