<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:11:00 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7681] Deadlock on MDS around dqptr_sem</title>
                <link>https://jira.whamcloud.com/browse/LU-7681</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A MDS has encountered several times a deadlock where a process seems to have acquired the superblock dqptr_sem semaphore and left without releasing it.&lt;/p&gt;

&lt;p&gt;When looking at a dump taken during this deadlock, we can see this:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;most of the processes are in jbd2_journal_start() -&amp;gt; start_this_handle() -&amp;gt; schedule(), waiting for the current transaction to be finished.&lt;/li&gt;
	&lt;li&gt;the process initiating the journal_stop is found in this state:&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 1479   TASK: ffff880cefe9e7f0  CPU: 5   COMMAND: &quot;mdt_279&quot;
 #0 [ffff8810093538a0] schedule at ffffffff814965a5
 #1 [ffff881009353968] jbd2_log_wait_commit at ffffffffa00b9e55 [jbd2]
 #2 [ffff8810093539f8] jbd2_journal_stop at ffffffffa00b1b6b [jbd2]
 #3 [ffff881009353a58] __ldiskfs_journal_stop at ffffffffa05c7808 [ldiskfs]
 #4 [ffff881009353a88] osd_trans_stop at ffffffffa0e86b35 [osd_ldiskfs]
 #5 [ffff881009353ab8] mdd_trans_stop at ffffffffa0d8b4aa [mdd]
 #6 [ffff881009353ac8] mdd_attr_set at ffffffffa0d6aa5f [mdd]
 #7 [ffff881009353ba8] cml_attr_set at ffffffffa0ec3a86 [cmm]
 #8 [ffff881009353bd8] mdt_attr_set at ffffffffa0dfe418 [mdt]
 #9 [ffff881009353c28] mdt_reint_setattr at ffffffffa0dfea65 [mdt]
#10 [ffff881009353cb8] mdt_reint_rec at ffffffffa0df7cb1 [mdt]
#11 [ffff881009353cd8] mdt_reint_internal at ffffffffa0deeed4 [mdt]
#12 [ffff881009353d28] mdt_reint at ffffffffa0def2b4 [mdt]
#13 [ffff881009353d48] mdt_handle_common at ffffffffa0de3762 [mdt]
#14 [ffff881009353d98] mdt_regular_handle at ffffffffa0de4655 [mdt]
#15 [ffff881009353da8] ptlrpc_main at ffffffffa082e4e6 [ptlrpc]
#16 [ffff881009353f48] kernel_thread at ffffffff8100412a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;This process is itself waiting for a process to commit current transaction. This process is then in this state:
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 6539   TASK: ffff880869e910c0  CPU: 14  COMMAND: &quot;mdt_47&quot;
 #0 [ffff880834bbb490] schedule at ffffffff814965a5
 #1 [ffff880834bbb558] rwsem_down_failed_common at ffffffff81498ba5
 #2 [ffff880834bbb5b8] rwsem_down_write_failed at ffffffff81498d23
 #3 [ffff880834bbb5f8] call_rwsem_down_write_failed at ffffffff812689c3
 #4 [ffff880834bbb658] dquot_initialize at ffffffff811c6cfb
 #5 [ffff880834bbb6c8] ldiskfs_dquot_initialize at ffffffffa05c7c44 [ldiskfs]
 #6 [ffff880834bbb6f8] osd_oi_iam_refresh at ffffffffa0e8ef3f [osd_ldiskfs]
 #7 [ffff880834bbb758] osd_oi_insert at ffffffffa0e8f532 [osd_ldiskfs]
 #8 [ffff880834bbb7d8] __osd_oi_insert at ffffffffa0e86e31 [osd_ldiskfs]
 #9 [ffff880834bbb828] osd_object_ea_create at ffffffffa0e87a5c [osd_ldiskfs]
#10 [ffff880834bbb888] mdd_object_create_internal at ffffffffa0d61ed0 [mdd]
#11 [ffff880834bbb8e8] mdd_create at ffffffffa0d831de [mdd]
#12 [ffff880834bbba28] cml_create at ffffffffa0ec4407 [cmm]
#13 [ffff880834bbba78] mdt_reint_open at ffffffffa0e0fd7f [mdt]
#14 [ffff880834bbbb58] mdt_reint_rec at ffffffffa0df7cb1 [mdt]
#15 [ffff880834bbbb78] mdt_reint_internal at ffffffffa0deeed4 [mdt]
#16 [ffff880834bbbbc8] mdt_intent_reint at ffffffffa0def53d [mdt]
#17 [ffff880834bbbc18] mdt_intent_policy at ffffffffa0dedc09 [mdt]
#18 [ffff880834bbbc58] ldlm_lock_enqueue at ffffffffa07d93c1 [ptlrpc]
#19 [ffff880834bbbcb8] ldlm_handle_enqueue0 at ffffffffa07ff3cd [ptlrpc]
#20 [ffff880834bbbd28] mdt_enqueue at ffffffffa0dee586 [mdt]
#21 [ffff880834bbbd48] mdt_handle_common at ffffffffa0de3762 [mdt]
#22 [ffff880834bbbd98] mdt_regular_handle at ffffffffa0de4655 [mdt]
#23 [ffff880834bbbda8] ptlrpc_main at ffffffffa082e4e6 [ptlrpc]
#24 [ffff880834bbbf48] kernel_thread at ffffffff8100412a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We are at this point waiting in dquot_initialize() for the superblock dqptr_sem semaphore to be released. &lt;br/&gt;
Unfortunately, I could not find any process in a code path where this semaphore is already acquired, and (as expected) all code blocks acquiring the semaphore properly release it before exiting the block.&lt;/p&gt;

&lt;p&gt;For reference, the semaphore is seen as follows:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct rw_semaphore ffff88086b5c1180
struct rw_semaphore {
  count = -4294967296,   # == 0xffffffff00000000
  wait_lock = {
    raw_lock = {
      slock = 2653658667
    }
  }, 
  wait_list = {
    next = 0xffff880834bbb5c0, 
    prev = 0xffff880834bbb5c0
  }
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Given the definition and comments for rw_semaphore in include/linux/rwsem-spinlock.h below:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/*
 * the rw-semaphore definition
 * - if activity is 0 then there are no active readers or writers
 * - if activity is +ve then that is the number of active readers
 * - if activity is -1 then there is one active writer
 * - if wait_list is not empty, then there are processes waiting for the semaphore
 */
struct rw_semaphore {
	__s32			activity;
	spinlock_t		wait_lock;
	struct list_head	wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map dep_map;
#endif
};
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;this would mean that activity is -1 (one active writer).&lt;/p&gt;

&lt;p&gt;What did I miss there ?&lt;/p&gt;</description>
                <environment>RHEL 6 with Bull kernel based on 2.6.32-279.5.2</environment>
        <key id="34155">LU-7681</key>
            <summary>Deadlock on MDS around dqptr_sem</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="spiechurski">Sebastien Piechurski</reporter>
                        <labels>
                            <label>p4b</label>
                    </labels>
                <created>Mon, 18 Jan 2016 21:25:35 +0000</created>
                <updated>Wed, 7 Jun 2017 12:01:15 +0000</updated>
                            <resolved>Wed, 7 Jun 2017 12:01:15 +0000</resolved>
                                    <version>Lustre 2.1.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="139206" author="bfaccini" created="Mon, 18 Jan 2016 22:27:54 +0000"  >&lt;p&gt;Assigning to me since I have already started working with Seb on this problem from Bull office.&lt;/p&gt;</comment>
                            <comment id="139207" author="bfaccini" created="Mon, 18 Jan 2016 22:29:30 +0000"  >&lt;p&gt;Suspecting a dead-lock problem around dqptr_sem in quite old Kernel ...&lt;/p&gt;</comment>
                            <comment id="139228" author="bfaccini" created="Tue, 19 Jan 2016 10:18:32 +0000"  >&lt;p&gt;Seb, is the crash-dump for this pb still available? And then, can you upload it with kernel-&lt;span class=&quot;error&quot;&gt;&amp;#91;common-&amp;#93;&lt;/span&gt;debugingo and lustre-debuginfo RPMs ?&lt;/p&gt;</comment>
                            <comment id="139229" author="bfaccini" created="Tue, 19 Jan 2016 11:50:38 +0000"  >&lt;p&gt;BTW, later (2.4&amp;lt;=) Lustre version use a Kernel patch to avoid dqptr_sem usage. So it is very likely that this problem no longer exists since then.&lt;/p&gt;</comment>
                            <comment id="139366" author="spiechurski" created="Wed, 20 Jan 2016 00:06:28 +0000"  >&lt;p&gt;A bundle with all the debuginfo packages and sources is currently uploading on ftp.whamcloud.com.&lt;/p&gt;</comment>
                            <comment id="139546" author="bfaccini" created="Thu, 21 Jan 2016 08:29:15 +0000"  >&lt;p&gt;Seb,&lt;br/&gt;
Can you check, the bundle xfer appears terminated but is incomplete.&lt;/p&gt;</comment>
                            <comment id="139550" author="spiechurski" created="Thu, 21 Jan 2016 11:15:22 +0000"  >&lt;p&gt;Yes, the tranfer failed, I had not noticed.&lt;br/&gt;
A new file is currently tranferring, but I don&apos;t have a lot of bandwidth. Will probably be finished only tonight.&lt;/p&gt;</comment>
                            <comment id="139825" author="spiechurski" created="Sat, 23 Jan 2016 09:07:54 +0000"  >&lt;p&gt;The transfer finally succeeded with file &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7681&quot; title=&quot;Deadlock on MDS around dqptr_sem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7681&quot;&gt;&lt;del&gt;LU-7681&lt;/del&gt;&lt;/a&gt;-bundle3.tar.xz.&lt;/p&gt;</comment>
                            <comment id="142454" author="bfaccini" created="Wed, 17 Feb 2016 17:32:07 +0000"  >&lt;p&gt;Hello Seb, and sorry to be late on this.&lt;/p&gt;

&lt;p&gt;I have spent more time to analyze the crash-dump you have provided. BTW, this looks like a new occurrence/crash-dump than the one we already worked-on together, so this means that same problem has re-occurred ...&lt;/p&gt;

&lt;p&gt;Also, can you give me some hint on how to use the lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;-ldiskfs&amp;#93;&lt;/span&gt;-core RPMs you have provided (and their embedded sets of patches), in order to get the full+exact source tree being used for this Lustre version?&lt;/p&gt;

&lt;p&gt;Then as a first thought, and despite the fact I still have not identified the thread that presently owns dqptr_sem and the blocked situation looks a bit different, I wonder if my patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4271&quot; title=&quot;mds load goes very high and filesystem hangs after mounting mdt&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4271&quot;&gt;&lt;del&gt;LU-4271&lt;/del&gt;&lt;/a&gt; could not help to also avoid this dead-lock. Could it be possible to give it a try? &lt;/p&gt;</comment>
                            <comment id="198413" author="spiechurski" created="Wed, 7 Jun 2017 08:19:40 +0000"  >&lt;p&gt;I did not hear from this problem for quite a while, so I think this was solved by moving away from 2.1.&lt;/p&gt;

&lt;p&gt;Please close.&lt;/p&gt;</comment>
                            <comment id="198429" author="pjones" created="Wed, 7 Jun 2017 12:01:15 +0000"  >&lt;p&gt;Thanks&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 17 Feb 2016 21:25:35 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxyg7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 18 Jan 2016 21:25:35 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>