<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:46:30 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4861] App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path</title>
                <link>https://jira.whamcloud.com/browse/LU-4861</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I&apos;ve looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;gt; crash&amp;gt; bt -F | grep -A 1 &apos;#2&apos;
&amp;gt;  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
&amp;gt;     ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem]

                                         addr(cl_lock)
&amp;gt; crash&amp;gt; foreach growfiles bt -f | grep -A 1 &apos;#2&apos; | grep -v mutex_lock
&amp;gt;     ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
&amp;gt;     ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70
&amp;gt;     ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70
&amp;gt;     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0
&amp;gt;     ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0

crash&amp;gt; struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner
    owner = 0xffff88083fe0c7f0
crash&amp;gt; ps | grep ffff88083fe0c7f0
   5548      1    3 ffff88083fe0c7f0  UN   0.0    3120   1576  growfiles

crash&amp;gt; struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner
    owner = 0xffff8808336497f0
crash&amp;gt; ps | grep ffff8808336497f0
   5543      1   12 ffff8808336497f0  UN   0.0    3120   1576  growfiles

&amp;gt; crash&amp;gt; for 5543 bt -f | grep -A 1 &apos;#2&apos;
&amp;gt;  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
&amp;gt;     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 
&amp;gt; crash&amp;gt; for 5548 bt -f | grep -A 1 &apos;#2&apos;
&amp;gt;  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
&amp;gt;     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 

So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids.
                           Owner     Waiter
cl_lock ffff88083bbf5b70   5548      5543
cl_lock ffff88083ba19ed0   5543      5548

&amp;gt; crash&amp;gt; bt
&amp;gt; PID: 5548   TASK: ffff88083fe0c7f0  CPU: 3   COMMAND: &quot;growfiles&quot;
&amp;gt;  #0 [ffff88083f505a68] schedule at ffffffff8144e6b7
&amp;gt;  #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e
&amp;gt;  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
&amp;gt;  #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
&amp;gt;  #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov]
&amp;gt;  #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
&amp;gt;  #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
&amp;gt;  #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
&amp;gt;  #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre]
&amp;gt;  #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
&amp;gt; #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre]
&amp;gt; #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea
&amp;gt; #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e
&amp;gt; #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab

&amp;gt; PID: 5543   TASK: ffff8808336497f0  CPU: 12  COMMAND: &quot;growfiles&quot;
&amp;gt;  #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7
&amp;gt;  #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e
&amp;gt;  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
&amp;gt;  #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
&amp;gt;  #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc]
&amp;gt;  #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc]
&amp;gt;  #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc]
&amp;gt;  #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc]
&amp;gt;  #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass]
&amp;gt;  #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov]
&amp;gt; #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
&amp;gt; #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
&amp;gt; #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
&amp;gt; #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre]
&amp;gt; #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
&amp;gt; #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre]
&amp;gt; #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea
&amp;gt; #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e
&amp;gt; #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;, 9876 is NOT included.&lt;/p&gt;</description>
                <environment>Lustre 2.5.1 on both clients and servers.</environment>
        <key id="24064">LU-4861</key>
            <summary>App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="amk">Ann Koehler</reporter>
                        <labels>
                    </labels>
                <created>Thu, 3 Apr 2014 19:06:59 +0000</created>
                <updated>Tue, 12 Aug 2014 19:42:16 +0000</updated>
                            <resolved>Fri, 20 Jun 2014 05:08:55 +0000</resolved>
                                    <version>Lustre 2.5.1</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                    <fixVersion>Lustre 2.5.3</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="80975" author="amk" created="Thu, 3 Apr 2014 19:22:33 +0000"  >&lt;p&gt;Uploaded dump to ftp.whamcloud.com:/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4861&quot; title=&quot;App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4861&quot;&gt;&lt;del&gt;LU-4861&lt;/del&gt;&lt;/a&gt;/bug810344_cllock_deadlock.tgz&lt;/p&gt;

&lt;p&gt;This ticket is primarily &quot;for your information&quot;. Seems like there&apos;s a reasonable chance that the cause of the deadlock is related to the root cause of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;. I&apos;m passing on the dump in case it helps solve &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;. And in case it&apos;s not, we&apos;re working the issue through Xyratex in parallel.&lt;/p&gt;
</comment>
                            <comment id="81015" author="jay" created="Fri, 4 Apr 2014 01:21:04 +0000"  >&lt;p&gt;Thanks for the coredump, I will take a look.&lt;/p&gt;</comment>
                            <comment id="82185" author="amk" created="Tue, 22 Apr 2014 18:49:38 +0000"  >&lt;p&gt;FYI, we&apos;ve seen this same deadlock again with the patch  &lt;a href=&quot;http://review.whamcloud.com/9881&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9881&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt; applied.&lt;/p&gt;</comment>
                            <comment id="84447" author="paf" created="Tue, 20 May 2014 14:51:25 +0000"  >&lt;p&gt;I hit this bug last night during stress testing of current, unmodified master on SLES11SP3.  &lt;/p&gt;

&lt;p&gt;Most recent commit on the branch being tested:&lt;br/&gt;
commit 864fc9daac267819f5e3bdebef6cdac4c6325626&lt;br/&gt;
Author: Mikhail Pershin &amp;lt;mike.pershin@intel.com&amp;gt;&lt;br/&gt;
Date:   Tue May 13 18:34:03 2014 +0400&lt;/p&gt;

&lt;p&gt;    &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2059&quot; title=&quot;mgc to backup configuration on osd-based llogs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2059&quot;&gt;&lt;del&gt;LU-2059&lt;/del&gt;&lt;/a&gt; mgs: don&apos;t fail on missing params log&lt;br/&gt;
-----------------------------------------&lt;/p&gt;

&lt;p&gt;Uploaded dump to ftp.whamcloud.com:/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4861&quot; title=&quot;App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4861&quot;&gt;&lt;del&gt;LU-4861&lt;/del&gt;&lt;/a&gt;/master_140520_dump_upload.tar.gz&lt;/p&gt;

&lt;p&gt;Note there is also a master_140520_dump.tar.gz in that same directory, which is a failed transfer.&lt;/p&gt;</comment>
                            <comment id="85203" author="jay" created="Thu, 29 May 2014 23:58:48 +0000"  >&lt;p&gt;I will take a look at this issue.&lt;/p&gt;</comment>
                            <comment id="85675" author="jay" created="Wed, 4 Jun 2014 02:20:55 +0000"  >&lt;p&gt;please try patch: &lt;a href=&quot;http://review.whamcloud.com/10581&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10581&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="85678" author="paf" created="Wed, 4 Jun 2014 03:10:29 +0000"  >&lt;p&gt;We&apos;ll try the patch.  Separately, because Jinshan asked, the logs from the dump I uploaded are here:&lt;br/&gt;
ftp.whamcloud.com:/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4861&quot; title=&quot;App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4861&quot;&gt;&lt;del&gt;LU-4861&lt;/del&gt;&lt;/a&gt;/logs.tar.gz&lt;/p&gt;</comment>
                            <comment id="85925" author="paf" created="Thu, 5 Jun 2014 21:52:19 +0000"  >&lt;p&gt;Jinshan - We&apos;ve given this patch a bit of testing, and seen no issues.  We don&apos;t have a clear reproducer, so we can&apos;t say the bug is fixed for sure, but we&apos;re pulling this in to our branch of Lustre so it will see more exposure over the next few weeks.&lt;/p&gt;</comment>
                            <comment id="87136" author="pjones" created="Fri, 20 Jun 2014 05:08:55 +0000"  >&lt;p&gt;Landed for 2.6&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="25201">LU-5225</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwj9r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13411</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>