<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:51:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5467] process stuck in cl_locks_prune()</title>
                <link>https://jira.whamcloud.com/browse/LU-5467</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;User processes are stuck in &lt;tt&gt;cl_locks_prune()&lt;/tt&gt;.  The system is classified so files from the system can&apos;t be uploaded.  We currently have two lustre clients in this state.&lt;/p&gt;

&lt;p&gt;Stack trace from stuck process:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;cfs_waitq_wait
cl_locks_prune
lov_delete_raid0
lov_object_delete
lu_object_free
lu_object_put
cl_object_put
cl_inode_fini
ll_clear_inode
clear_inode
ll_delete_inode
generic_delete_inode
generic_drop_inode
...
sys_unlink
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;They are waiting for lock user count to drop to 0:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;2063 again:
2064                 cl_lock_mutex_get(env, lock);
2065                 &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lock-&amp;gt;cll_state &amp;lt; CLS_FREEING) {
2066                         LASSERT(lock-&amp;gt;cll_users &amp;lt;= 1);
2067                         &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (unlikely(lock-&amp;gt;cll_users == 1)) {
2068                                 struct l_wait_info lwi = { 0 };
2069                                                                                 
2070                                 cl_lock_mutex_put(env, lock);
2071                                 l_wait_event(lock-&amp;gt;cll_wq,
2072                                              lock-&amp;gt;cll_users == 0, 
2073                                              &amp;amp;lwi);
2074                                 &lt;span class=&quot;code-keyword&quot;&gt;goto&lt;/span&gt; again; 
2075                         }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On one node I also found a user process stuck in &lt;tt&gt;osc_io_setattr_end()&lt;/tt&gt; line 500:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;489 &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; void osc_io_setattr_end(&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct lu_env *env,
490                                &lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct cl_io_slice *slice)
491 { 
492         struct cl_io     *io  = slice-&amp;gt;cis_io;
493         struct osc_io    *oio = cl2osc_io(env, slice);
494         struct cl_object *obj = slice-&amp;gt;cis_obj;
495         struct osc_async_cbargs *cbargs = &amp;amp;oio-&amp;gt;oi_cbarg;
496         &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; result = 0;
497
498         &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (cbargs-&amp;gt;opc_rpc_sent) {
499                 wait_for_completion(&amp;amp;cbargs-&amp;gt;opc_sync);
500                 result = io-&amp;gt;ci_result = cbargs-&amp;gt;opc_rc;
501         } 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On both stuck nodes, I also notice the &lt;tt&gt;ptlrpcd_rcv&lt;/tt&gt; thread blocked with this backtrace:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;sync_page
__lock_page
vvp_page_own
cl_page_own0
cl_page_own
check_and_discard_cb
cl_page_gang_lookup
cl_lock_discard_pages
osc_lock_flush
osc_lock_cancel
cl_lock_cancel0
cl_lock_cancel
osc_ldlm_blocking_ast
ldlm_cancel_callback
ldlm_lock_cancel
ldlm_cli_cancel_list_local
ldlm_cancel_lru_local
ldlm_replay_locks
ptlrpc_import_recov_state_machine
ptlrpc_connect_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I haven&apos;t checked anything on the server side yet.  Please let us know ASAP if you want any more debug data from the clients before we reboot them.&lt;/p&gt;


</description>
                <environment>&lt;a href=&quot;https://github.com/chaos/lustre/commits/2.4.2-13chaos&quot;&gt;https://github.com/chaos/lustre/commits/2.4.2-13chaos&lt;/a&gt;</environment>
        <key id="25934">LU-5467</key>
            <summary>process stuck in cl_locks_prune()</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="nedbass">Ned Bass</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 8 Aug 2014 21:46:37 +0000</created>
                <updated>Tue, 7 Jun 2016 15:38:29 +0000</updated>
                                            <version>Lustre 2.4.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="91239" author="pjones" created="Fri, 8 Aug 2014 23:08:04 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;What do you advise here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="91245" author="jay" created="Sat, 9 Aug 2014 02:21:59 +0000"  >&lt;p&gt;What&apos;s the status of corresponding OST? Can you please show me the output of dmesg?&lt;/p&gt;</comment>
                            <comment id="91358" author="nedbass" created="Tue, 12 Aug 2014 01:14:45 +0000"  >&lt;p&gt;Jinshan, I can&apos;t get full dmesg output because the system is classified.&lt;/p&gt;

&lt;p&gt;I noticed &apos;lfs check servers&apos; shows resource temporarily unavailable for the same 5 OSTs on both affected clients.&lt;/p&gt;

&lt;p&gt;The &apos;imports&apos; file under /proc/fs/lustre/osc/... shows state &apos;REPLAY_LOCKS&apos;.  Also the import&apos;s current_connection shows the fail over partner&apos;s NID, not the NID of the active server.  There is no export for the client under /proc/fs/lustre/obdfilter on the OST.&lt;/p&gt;</comment>
                            <comment id="91360" author="jay" created="Tue, 12 Aug 2014 02:23:51 +0000"  >&lt;p&gt;From the stack trace, I guess this is the same issue of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4786&quot; title=&quot;Apparent denial of service from client to mdt&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4786&quot;&gt;&lt;del&gt;LU-4786&lt;/del&gt;&lt;/a&gt;. I&apos;d like to port those patches back to b2_4.&lt;/p&gt;</comment>
                            <comment id="91421" author="jay" created="Tue, 12 Aug 2014 16:41:20 +0000"  >&lt;p&gt;I back ported the patch to b2_4 at: &lt;a href=&quot;http://review.whamcloud.com/11418&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11418&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="93492" author="morrone" created="Tue, 9 Sep 2014 00:02:20 +0000"  >&lt;p&gt;It looks like we hit a similar problem on a BGQ I/O Node (lustre client).  The backtrace for the prlrpc_rcv thread is identical to the backtrace that Ned listed above.  There are two OSCs stuck in the REPLAY_LOCKS state as Ned reported in the earlier instance on x86_64.&lt;/p&gt;

&lt;p&gt;There is no thread in cl_locks_prune() this time.&lt;/p&gt;

&lt;p&gt;The OSTs appear to be fine.  Other nodes can use them.&lt;/p&gt;

&lt;p&gt;Many other threads are stuck waiting under an open():&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;cfs_waitq_timedwait
ptlrpc_set_wait
ptlrpc_queue.wait
ldlm_cli_enqueue
mdc_enqueue
mdc_intent_lock
lmv_intent_lookup
lmv_intent_lock
ll_lookup_it
ll_lookup_nd
do_lookup
__link_path_walk
path_walk
filename_lookup
do_filp_open
do_sys_open
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One thread had nearly and identical stack as the open() ones, but got there through fstat():&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[see open() stack for the rest]
filename_lookup
user_path_at
vfs_fstatat
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, a couple of threads where in this backtrace:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;cfs_waitq_timedwait
ptlrpc_set_wait
ptlrpc_queue_wait
mdc_close
lmv_close
ll_close_inode_openhandle
ll_md_real_close
ll_file_release
ll_dir_release
__fput
filp_close
pu_files_struct
do_exit
do_group_exit
set_signal_to_deliver
do_signal_pending_clone
do_signal
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Do you still think that &lt;a href=&quot;http://review.whamcloud.com/11418&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11418&lt;/a&gt; will address this problem?  We have not yet pulled in that patch.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 9 Sep 2014 21:46:37 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwtaf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15233</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 8 Aug 2014 21:46:37 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>