<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:05:46 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-308] Hang and eventual ASSERT after mdc_enqueue()) ldlm_cli_enqueue: -4</title>
                <link>https://jira.whamcloud.com/browse/LU-308</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On a production lustre client node, we hit an ASSERT.  The first sign of trouble on the console is this:&lt;/p&gt;

&lt;p&gt;2011-05-11 08:55:44 LustreError: ... (mdc_locks.c:648:mdc_enqueue())&lt;br/&gt;
ldlm_cli_enqueue: -4&lt;/p&gt;

&lt;p&gt;I believe that is under an emacs process.&lt;/p&gt;

&lt;p&gt;Ten seconds later we start getting &quot;soft lockup&quot; &quot;stuck for 10s&quot; warnings&lt;br/&gt;
about the same process.  The messages pop up every 10s until we finally get an&lt;br/&gt;
assertion later on.  Backtrace looks like:&lt;/p&gt;

&lt;p&gt;:mdc:mdc_enter_request&lt;br/&gt;
:ptlrpc:ldlm_lock_addref_internal_nolock&lt;br/&gt;
:mdc:mdc_enqueue&lt;br/&gt;
dequeue_task&lt;br/&gt;
thread_return&lt;br/&gt;
:ptlrpc:ldlm_lock_add_to_lru_nolock&lt;br/&gt;
:mdc:mdc_intent_lock&lt;br/&gt;
:ptlrpc:ldlm_lock_decref&lt;br/&gt;
:mdc:mdc_set_lock_data&lt;br/&gt;
:lustre:ll_mdc_blocking_ast&lt;br/&gt;
:ptlrpc:ldlm_completion_ast&lt;br/&gt;
:lustre:ll_prepare_mdc_op_data&lt;br/&gt;
:lustre:ll_lookup_it&lt;br/&gt;
:lustre:ll_mdc_blocking_ast&lt;br/&gt;
:lov:lov_fini_enqueue_set&lt;br/&gt;
:lustre:ll_lookup_nd&lt;br/&gt;
list_add&lt;br/&gt;
d_alloc&lt;br/&gt;
do_lookup&lt;br/&gt;
__link_path_walk&lt;br/&gt;
link_path_walk&lt;br/&gt;
do_path_lookup&lt;br/&gt;
__user_walk_fd&lt;br/&gt;
vfs_stat_fd&lt;br/&gt;
sys_rt_sigreturn&lt;br/&gt;
sys_rt_sigreturn&lt;br/&gt;
sys_newstat&lt;br/&gt;
sys_setitimer&lt;br/&gt;
stub_rt_sigreturn&lt;br/&gt;
system_call&lt;/p&gt;

&lt;p&gt;Later a different process throws these errors:&lt;/p&gt;

&lt;p&gt;2011-05-11 09:06:07 Lustre: ... Request mdc_close sent 106s ago has failed due&lt;br/&gt;
to network error (limit 106s)&lt;br/&gt;
2011-05-11 09:06:07 LustreError: ... ll_lcose_inode_openhandle()) inode X mdc&lt;br/&gt;
close failed: -4&lt;br/&gt;
2011-05-11 09:06:07 Skipped 4 previous messages&lt;/p&gt;

&lt;p&gt;And then three seconds later the original stuck thread does:&lt;/p&gt;

&lt;p&gt;2011-05-11 09:06:10 ldlm_lock.c:189:ldlm_lock_remove_from_lru_nolock ASSERT(ns-&amp;gt;ns_nr_unused &amp;gt; 0) failed&lt;/p&gt;

&lt;p&gt;Backtrace looks like:&lt;/p&gt;

&lt;p&gt;ldlm_lock_remove_from_lru_nolock&lt;br/&gt;
ldlm_lock_remove_from_lru&lt;br/&gt;
ldlm_lock_addref_internal_nolock&lt;br/&gt;
search_queue&lt;br/&gt;
ldlm_lock_match&lt;br/&gt;
ldlm_resource_get&lt;br/&gt;
mdc_revalidate_lock&lt;br/&gt;
ldlm_lock_addref_internal_nolock&lt;br/&gt;
mdc_intent_lock&lt;br/&gt;
ll_i2gids&lt;br/&gt;
ll_prepare_mdc_op_data&lt;br/&gt;
__ll_inode_revalidate_it&lt;br/&gt;
ll_mdc_blocking_ast&lt;br/&gt;
ll_inode_permission&lt;br/&gt;
dput&lt;br/&gt;
permission&lt;br/&gt;
vfs_permission&lt;br/&gt;
__link_path_walk&lt;br/&gt;
link_path_walk&lt;br/&gt;
do_path_lookup&lt;br/&gt;
__path_lookup_intent_open&lt;br/&gt;
path_lookup_open&lt;br/&gt;
open_namei&lt;br/&gt;
do_filp_open&lt;br/&gt;
get_unused_fd&lt;br/&gt;
do_sys_open&lt;br/&gt;
sys_open&lt;/p&gt;

&lt;p&gt;Apologies for any typos.  That all had to be hand copied.&lt;/p&gt;

&lt;p&gt;Since this all appears to have started with an EINTR in mdc_enqueue(), it may be that this bug is related:&lt;/p&gt;

&lt;p&gt;  &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=18213&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=18213&lt;/a&gt;&lt;br/&gt;
  &lt;a href=&quot;http://jira.whamcloud.com/browse/LU-234&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-234&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are running 1.8.5+, so we should have the fix that was applied to 1.8.5 in bug 18213.&lt;/p&gt;</description>
                <environment>RHEL5.5ish (CHAOS4.4-2), lustre 1.8.5.0-3chaos</environment>
        <key id="10786">LU-308</key>
            <summary>Hang and eventual ASSERT after mdc_enqueue()) ldlm_cli_enqueue: -4</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                    </labels>
                <created>Wed, 11 May 2011 17:37:25 +0000</created>
                <updated>Tue, 28 Jun 2011 15:01:39 +0000</updated>
                            <resolved>Mon, 13 Jun 2011 15:00:21 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="14263" author="pjones" created="Thu, 12 May 2011 15:32:19 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="14421" author="laisiyao" created="Mon, 16 May 2011 22:44:39 +0000"  >&lt;p&gt;Johann, could you take a look into this, I can&apos;t find a use case which will trigger ldlm_lock_remove_from_lru_nolock ASSERT(ns-&amp;gt;ns_nr_unused &amp;gt; 0).&lt;/p&gt;</comment>
                            <comment id="14423" author="johann" created="Tue, 17 May 2011 01:34:29 +0000"  >&lt;p&gt;The LASSERT might just be a side effect of the initial soft lockup.&lt;br/&gt;
We actually found &amp;amp; fixed a problem in 1.8.5 with mdc_enter_request().&lt;br/&gt;
See &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=24508#c1&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=24508#c1&lt;/a&gt;&lt;br/&gt;
A patch was landed to Whamcloud&apos;s b1_8 as part of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-286&quot; title=&quot;racer: general protection fault: 0000 [1] SMP RIP: __wake_up_common+60}&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-286&quot;&gt;&lt;del&gt;LU-286&lt;/del&gt;&lt;/a&gt;, see &lt;a href=&quot;http://review.whamcloud.com/506&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/506&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="14426" author="laisiyao" created="Tue, 17 May 2011 02:08:57 +0000"  >&lt;p&gt;Thank you, Johann. This looks reasonable.&lt;br/&gt;
Chris, could you verify the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-286&quot; title=&quot;racer: general protection fault: 0000 [1] SMP RIP: __wake_up_common+60}&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-286&quot;&gt;&lt;del&gt;LU-286&lt;/del&gt;&lt;/a&gt; is not included in your test code?&lt;/p&gt;</comment>
                            <comment id="14463" author="morrone" created="Tue, 17 May 2011 13:16:29 +0000"  >&lt;p&gt;Correct, we do not have that patch.&lt;/p&gt;

&lt;p&gt;And our code is not &quot;test&quot; code; we saw this in production.&lt;/p&gt;</comment>
                            <comment id="14494" author="pjones" created="Wed, 18 May 2011 07:57:02 +0000"  >&lt;p&gt;Chris&lt;/p&gt;

&lt;p&gt;Will you be trying this patch in production or are some additional steps required first?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="14699" author="morrone" created="Thu, 19 May 2011 17:34:44 +0000"  >&lt;p&gt;I&apos;ll pull it into our 1.8.5-llnl branch.&lt;/p&gt;

&lt;p&gt;As for when it goes in production...our local testing infrastracture has almost completely moved to RHEL6 and lustre 2.1.  We have a 1.8 server cluster left over for testing, but no 1.8 clients.  The first window we have to get 1.8 clients and test a release is probably mid June, with a target for installation in late June if there are no surprises.&lt;/p&gt;</comment>
                            <comment id="16107" author="pjones" created="Mon, 13 Jun 2011 15:00:21 +0000"  >&lt;p&gt;Let&apos;s close this ticket for now and reopen if the issue reoccurs with the patch applied&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw1gv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10319</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>