<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:02:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6678] calling ldlm_revoke_export_locks repeatedly while running racer seems to cause deadlock</title>
                <link>https://jira.whamcloud.com/browse/LU-6678</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;John Hammond and Nathan Lavender recently reported an issue where making nodemap configuration changes while running racer can run into a deadlock situation. After reproducing it on my machine, the code seems to get stuck in ldlm_revoke_export_locks.&lt;/p&gt;

&lt;p&gt;Looking at the trace debug, it never leaves cfs_hash_for_each_empty:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;...
00000001:00000040:2.0:1433251090.069004:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390396
00000001:00000001:2.0:1433251090.069004:0:24913:0:(hash.c:1600:cfs_hash_for_each_relax()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00010000:00000001:2.0:1433251090.069006:0:24913:0:(ldlm_lock.c:189:ldlm_lock_put()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00010000:00000001:2.0:1433251090.069006:0:24913:0:(ldlm_lock.c:222:ldlm_lock_put()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving
00000001:00000040:2.0:1433251090.069007:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390397
00000001:00000001:2.0:1433251090.069008:0:24913:0:(hash.c:1600:cfs_hash_for_each_relax()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00010000:00000001:2.0:1433251090.069010:0:24913:0:(ldlm_lock.c:189:ldlm_lock_put()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00010000:00000001:2.0:1433251090.069010:0:24913:0:(ldlm_lock.c:222:ldlm_lock_put()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving
00000001:00000040:2.0:1433251090.069011:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390398
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It looks like some locks eventually timeout:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;00010000:00000001:0.0:1433298387.132153:0:2831:0:(ldlm_request.c:97:ldlm_expired_completion_wait()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00010000:02000400:0.0:1433298387.132154:0:2831:0:(ldlm_request.c:105:ldlm_expired_completion_wait()) lock timed out (enqueued at 1433298087, 300s ago)
00010000:00010000:0.0:1433298387.132162:0:2831:0:(ldlm_request.c:111:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1433298087, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-test-MDT00
00_UUID lock: ffff880061a31540/0xf4d215ae26653a36 lrc: 3/1,0 mode: --/PR res: [0x20000e690:0x1:0x0].0 bits 0x13 rrc: 25 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 2831 timeout: 0 lvb_type: 0
00010000:00000001:0.0:1433298387.132167:0:2831:0:(ldlm_request.c:120:ldlm_expired_completion_wait()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On one test run, I crashed the system and took backtraces of the processes to see what was up. Here are some bits I found interesting:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;A lot of racer commands are stuck in do_lookup, like:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; #2 [ffff880061c53be0] mutex_lock at ffffffff8152a2ab
 #3 [ffff880061c53c00] do_lookup at ffffffff811988eb
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;4 or 5 mdt threads look similar to these two:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; #2 [ffff88007bf1b8c8] ldlm_completion_ast at ffffffffa1259459 [ptlrpc]
 #3 [ffff88007bf1b978] ldlm_cli_enqueue_local at ffffffffa125885e [ptlrpc]
 #4 [ffff88007bf1b9f8] mdt_object_local_lock at ffffffffa07f9b2c [mdt]
 #5 [ffff88007bf1baa8] mdt_object_lock_internal at ffffffffa07fa775 [mdt]
 #6 [ffff88007bf1baf8] mdt_object_lock at ffffffffa07fab34 [mdt]
 #7 [ffff88007bf1bb08] mdt_getattr_name_lock at ffffffffa080930c [mdt]

 #2 [ffff88006c0c19c8] ldlm_completion_ast at ffffffffa1259459 [ptlrpc]
 #3 [ffff88006c0c1a78] ldlm_cli_enqueue_local at ffffffffa125885e [ptlrpc]
 #4 [ffff88006c0c1af8] mdt_object_local_lock at ffffffffa07f9b2c [mdt]
 #5 [ffff88006c0c1ba8] mdt_object_lock_internal at ffffffffa07fa775 [mdt]
 #6 [ffff88006c0c1bf8] mdt_object_lock at ffffffffa07fab34 [mdt]
 #7 [ffff88006c0c1c08] mdt_reint_link at ffffffffa0815876 [mdt]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;a couple &quot;ln&quot; commands look like:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; #2 [ffff880061e21ba0] ptlrpc_set_wait at ffffffffa1279979 [ptlrpc]
 #3 [ffff880061e21c60] ptlrpc_queue_wait at ffffffffa1279fe1 [ptlrpc]
 #4 [ffff880061e21c80] mdc_reint at ffffffffa038c4b1 [mdc]
 #5 [ffff880061e21cb0] mdc_link at ffffffffa038d2a3 [mdc]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;a couple lfs calls look like:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; #2 [ffff880063b17660] ptlrpc_set_wait at ffffffffa1279979 [ptlrpc]
 #3 [ffff880063b17720] ptlrpc_queue_wait at ffffffffa1279fe1 [ptlrpc]
 #4 [ffff880063b17740] ldlm_cli_enqueue at ffffffffa1253dae [ptlrpc]
 #5 [ffff880063b177f0] mdc_enqueue at ffffffffa039380d [mdc]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I&apos;m not sure what information is most useful to figure this out. Is it a matter of doing dlmtrace and then dumping the namespaces when it hangs? It&apos;s relatively easy to cause the crash:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;i=0; lctl nodemap_add nm0; &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;true&lt;/span&gt;; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; echo $i; lctl nodemap_add_range --name nm0 --range 0@lo; lctl nodemap_del_range --name nm0 --range 0@lo; ((i++)); done
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;ve attached a namespace dump in case that&apos;s enough to figure it out. I&apos;ll keep digging, but I thought I&apos;d post this in case anyone had any ideas, or in case I am way off track.&lt;/p&gt;</description>
                <environment></environment>
        <key id="30479">LU-6678</key>
            <summary>calling ldlm_revoke_export_locks repeatedly while running racer seems to cause deadlock</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="kit.westneat">Kit Westneat</reporter>
                        <labels>
                    </labels>
                <created>Wed, 3 Jun 2015 03:08:27 +0000</created>
                <updated>Sun, 10 Oct 2021 21:22:35 +0000</updated>
                            <resolved>Sun, 10 Oct 2021 21:22:35 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="117324" author="green" created="Wed, 3 Jun 2015 18:18:47 +0000"  >&lt;p&gt;Is this with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6409&quot; title=&quot;sleeping while atomic in nodemap_destroy&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6409&quot;&gt;&lt;del&gt;LU-6409&lt;/del&gt;&lt;/a&gt; patches too?&lt;/p&gt;

&lt;p&gt;Have you seen my additional comment in there?&lt;/p&gt;</comment>
                            <comment id="117334" author="kit.westneat" created="Wed, 3 Jun 2015 18:59:45 +0000"  >&lt;p&gt;I think this is a different bug, it looks like it&apos;s related to LDLM locks instead of mutex/spinlock problems. It also doesn&apos;t seem to be related to nodemap per se, just triggered by it. But I might be misreading the debug logs. &lt;/p&gt;

&lt;p&gt;I did, I&apos;m working on a fix for the new issues in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6409&quot; title=&quot;sleeping while atomic in nodemap_destroy&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6409&quot;&gt;&lt;del&gt;LU-6409&lt;/del&gt;&lt;/a&gt;, but I&apos;m not sure how to trigger the error that you are seeing. Is your kernel compiled with CONFIG_DEBUG_SPINLOCK_SLEEP or something like that?&lt;/p&gt;</comment>
                            <comment id="117390" author="green" created="Thu, 4 Jun 2015 05:00:40 +0000"  >&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;00000001:00000040:2.0:1433251090.069011:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390398
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This looks pretty much like a nodemap bug to me.&lt;br/&gt;
This may or may not be related to the hang at hand, of course.&lt;br/&gt;
Basically the ldlm hang you see is (in your example) due to a thread that holds this lock is stuck on theserver while holding that lock. It&apos;s pid is 2831 and you need to hunt it down and see what is it it is doing.&lt;/p&gt;</comment>
                            <comment id="117391" author="green" created="Thu, 4 Jun 2015 05:02:56 +0000"  >&lt;p&gt;actually no, the 2831 is the thread that&apos;s hung, what needs to be done is we need to see what lock is blocking of granting of this lcok then and hunt that down.&lt;br/&gt;
Usually it&apos;s printed somewhere nearby.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="18052" name="dk_locks" size="3178305" author="kit.westneat" created="Wed, 3 Jun 2015 03:08:27 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxesf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>