<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:59:20 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6337] threads stuck at ldlm_completion_ast</title>
                <link>https://jira.whamcloud.com/browse/LU-6337</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Looks like we have this this issue at 3 times in the past 48 hours. Lots of threads stuck at ldlm_completion_ast. We are running with 2.4.3.&lt;/p&gt;

&lt;p&gt;see attached console logs&lt;/p&gt;
</description>
                <environment></environment>
        <key id="28991">LU-6337</key>
            <summary>threads stuck at ldlm_completion_ast</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="mhanafi">Mahmoud Hanafi</reporter>
                        <labels>
                    </labels>
                <created>Thu, 5 Mar 2015 22:35:52 +0000</created>
                <updated>Fri, 16 Oct 2015 18:18:53 +0000</updated>
                            <resolved>Fri, 16 Oct 2015 18:18:53 +0000</resolved>
                                    <version>Lustre 2.4.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="109000" author="jaylan" created="Thu, 5 Mar 2015 23:42:24 +0000"  >&lt;p&gt;This ticket looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5497&quot; title=&quot;Many MDS service threads blocked in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5497&quot;&gt;&lt;del&gt;LU-5497&lt;/del&gt;&lt;/a&gt;. It seems like moving to 2.5.3 is a possible solution, but we had problem upgrading yesterday. While we investigate problems of upgrading (and honestly need more testing before putting it in production), we need working patches for 2.4.3 from Intel.&lt;/p&gt;</comment>
                            <comment id="109007" author="green" created="Fri, 6 Mar 2015 01:17:40 +0000"  >&lt;p&gt;The symptoms you are seeing are too broad. The most likely cause is due to a lock that is not being released by some party.&lt;br/&gt;
In the past the major contributor to it was LU2827 also seen as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5497&quot; title=&quot;Many MDS service threads blocked in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5497&quot;&gt;&lt;del&gt;LU-5497&lt;/del&gt;&lt;/a&gt; at LLNL - but in order for it to manifest you need to have over 45 OSTs in your system OR your nerwork must regularly drop RPCs to/from MDS.&lt;br/&gt;
If any of those two are true - then applying &lt;a href=&quot;http://review.whamcloud.com/#/c/6511/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/6511/&lt;/a&gt; and &lt;a href=&quot;http://review.whamcloud.com/#/c/9488/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/9488/&lt;/a&gt; should help you.&lt;/p&gt;

&lt;p&gt;Also please note that 2.5.3 does not contain fixes for this problem (but the tip of b2_5 does).&lt;/p&gt;</comment>
                            <comment id="109015" author="jaylan" created="Fri, 6 Mar 2015 02:07:01 +0000"  >&lt;p&gt;Our nas-2.5.3 branch as of today is very close to the tip of b2_5. We are at &lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5912&quot; title=&quot;locking flaw generates logged errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5912&quot;&gt;&lt;del&gt;LU-5912&lt;/del&gt;&lt;/a&gt; libcfs: use vfs api for fsync calls&lt;/p&gt;

&lt;p&gt;However, I do not see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5497&quot; title=&quot;Many MDS service threads blocked in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5497&quot;&gt;&lt;del&gt;LU-5497&lt;/del&gt;&lt;/a&gt; patch in b2_5. Which commit should it be if the tip of b2_5 contains the fix?&lt;/p&gt;

&lt;p&gt;Now, if I want to cherry pick #6511 and #9488 to nas-2.4.3, do I also need &lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/10601/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10601/&lt;/a&gt; in addition to those two?&lt;/p&gt;</comment>
                            <comment id="109017" author="green" created="Fri, 6 Mar 2015 02:27:59 +0000"  >&lt;p&gt;in b2_5 there is a proper fix to this issue under the banner of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2827&quot; title=&quot;mdt_intent_fixup_resent() cannot find the proper lock in hash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2827&quot;&gt;&lt;del&gt;LU-2827&lt;/del&gt;&lt;/a&gt; and then a number of follow-on patches, from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2827&quot; title=&quot;mdt_intent_fixup_resent() cannot find the proper lock in hash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2827&quot;&gt;&lt;del&gt;LU-2827&lt;/del&gt;&lt;/a&gt; to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5579&quot; title=&quot;MDS crashed by &amp;quot;mdt_check_resent_lock()) ASSERTION( lock != NULL ) failed&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5579&quot;&gt;&lt;del&gt;LU-5579&lt;/del&gt;&lt;/a&gt; and everythign in-between.&lt;/p&gt;

&lt;p&gt;patch 10601 is purely of informational nature and does not really improve actual hanging situation, so it&apos;s ok to skip it for 2.4.3, but it alsmo might be a good idea to add it should this matter require more investigations.&lt;/p&gt;</comment>
                            <comment id="109019" author="jaylan" created="Fri, 6 Mar 2015 02:35:33 +0000"  >&lt;p&gt;I had a conflict in applying #9488/9:&lt;/p&gt;

&lt;p&gt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; HEAD&lt;br/&gt;
        /* If the client does not require open lock, it does not need to&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;search lock in exp_lock_hash, since the server thread will&lt;/li&gt;
	&lt;li&gt;make sure the lock will be released, and the resend request&lt;/li&gt;
	&lt;li&gt;can always re-enqueue the lock */&lt;br/&gt;
        if ((opcode != MDT_IT_OPEN) || (opcode == MDT_IT_OPEN &amp;amp;&amp;amp;&lt;br/&gt;
            info-&amp;gt;mti_spec.sp_cr_flags &amp;amp; MDS_OPEN_LOCK)) {&lt;br/&gt;
                /* In the function below, .hs_keycmp resolves to&lt;/li&gt;
	&lt;li&gt;ldlm_export_lock_keycmp() */&lt;br/&gt;
                /* coverity&lt;span class=&quot;error&quot;&gt;&amp;#91;overrun-buffer-val&amp;#93;&lt;/span&gt; */&lt;br/&gt;
                lock = cfs_hash_lookup(exp-&amp;gt;exp_lock_hash, &amp;amp;remote_hdl);&lt;br/&gt;
                if (lock) 
&lt;div class=&quot;error&quot;&gt;&lt;span class=&quot;error&quot;&gt;Unknown macro: {                        lock_res_and_lock(lock);                        if (lock != new_lock) {
                                lh-&amp;gt;mlh_reg_lh.cookie = lock-&amp;gt;l_handle.h_cookie;
                                lh-&amp;gt;mlh_reg_mode = lock-&amp;gt;l_granted_mode;

                                LDLM_DEBUG(lock, &quot;Restoring lock cookie&quot;);
                                DEBUG_REQ(D_DLMTRACE, req,
                                          &quot;restoring lock cookie &quot;LPX64,
                                          lh-&amp;gt;mlh_reg_lh.cookie);
                                if (old_lock)
                                        *old_lock = LDLM_LOCK_GET(lock);
                                cfs_hash_put(exp-&amp;gt;exp_lock_hash,
                                             &amp;amp;lock-&amp;gt;l_exp_hash);
                                unlock_res_and_lock(lock);
                                return;
                        }                        cfs_hash_put(exp-&amp;gt;exp_lock_hash, &amp;amp;lock-&amp;gt;l_exp_hash);                        unlock_res_and_lock(lock);                }&lt;/span&gt; &lt;/div&gt;
&lt;p&gt;        }&lt;br/&gt;
=======&lt;br/&gt;
        /* In the function below, .hs_keycmp resolves to&lt;/p&gt;&lt;/li&gt;
	&lt;li&gt;ldlm_export_lock_keycmp() */&lt;br/&gt;
        /* coverity&lt;span class=&quot;error&quot;&gt;&amp;#91;overrun-buffer-val&amp;#93;&lt;/span&gt; */&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;        /* Look for first lock found in hash for key that is not new_lock.&lt;br/&gt;
           There should only be 2 upon resend, new_lock and the first/original&lt;br/&gt;
           one.&lt;br/&gt;
        */&lt;br/&gt;
        data.skip_lock = new_lock;&lt;br/&gt;
        cfs_hash_for_each_key(exp-&amp;gt;exp_lock_hash, &amp;amp;remote_hdl,&lt;br/&gt;
                                     not_skip_lock, &amp;amp;data);&lt;br/&gt;
        lock = data.found_lock;&lt;br/&gt;
        if (lock != NULL) &lt;/p&gt;
{
                lh-&amp;gt;mlh_reg_lh.cookie = lock-&amp;gt;l_handle.h_cookie;
                lh-&amp;gt;mlh_reg_mode = lock-&amp;gt;l_granted_mode;

                LDLM_DEBUG(lock, &quot;Restoring lock cookie&quot;);
                DEBUG_REQ(D_DLMTRACE, req,
                          &quot;restoring lock cookie &quot;LPX64,
                          lh-&amp;gt;mlh_reg_lh.cookie);
                if (old_lock)
                        *old_lock = LDLM_LOCK_GET(lock);
                cfs_hash_put(exp-&amp;gt;exp_lock_hash, &amp;amp;lock-&amp;gt;l_exp_hash);
                return;
        }

&lt;p&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; c695980... &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4584&quot; title=&quot;Lock revocation process fails consistently&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4584&quot;&gt;&lt;del&gt;LU-4584&lt;/del&gt;&lt;/a&gt; mdt: ensure orig lock is found in hash upon resend&lt;/p&gt;

&lt;p&gt;Does this ring any bell to you? I guess the code in HEAD came from another patch we cherry-picked before or I missed another patch.&lt;/p&gt;</comment>
                            <comment id="109020" author="jaylan" created="Fri, 6 Mar 2015 02:36:57 +0000"  >&lt;p&gt;Geez, the formatting really screwed up in displaying &quot;*&quot; and indentation.&lt;/p&gt;

&lt;p&gt;The display will be correct when you enter edit mode.&lt;/p&gt;</comment>
                            <comment id="109021" author="green" created="Fri, 6 Mar 2015 03:19:21 +0000"  >&lt;p&gt;Patch 9488 is against b2_4, so there should not really be a conflict. (btw you can use (code)....(code) tags (but use curvy brackets) to disable formatting on a piece of code when making a comment with it).&lt;/p&gt;

&lt;p&gt;I see that in your case it&apos;s &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4403&quot; title=&quot;ASSERTION( lock-&amp;gt;l_readers &amp;gt; 0 )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4403&quot;&gt;&lt;del&gt;LU-4403&lt;/del&gt;&lt;/a&gt; patch you are carryign that messed things up.&lt;/p&gt;

&lt;p&gt;Sadly it has a ton of whitespace change, but if you do git show -b 08b397f5bf2561f2294315a9039b1930ce0695d5 on it, then you can see the real change.&lt;br/&gt;
Also I think to have a passing memory how this patch was not really needed and only arouse due to a bug fixed by patch 6511 (it was backed out in b2_5 as part of lu2827 series of patches)&lt;/p&gt;</comment>
                            <comment id="130668" author="mhanafi" created="Fri, 16 Oct 2015 18:14:24 +0000"  >&lt;p&gt;Please close this issue&lt;/p&gt;</comment>
                            <comment id="130671" author="pjones" created="Fri, 16 Oct 2015 18:18:53 +0000"  >&lt;p&gt;ok - thanks Mahmoud&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="17227" name="service100" size="1296013" author="mhanafi" created="Thu, 5 Mar 2015 22:35:52 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzx7t3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>17749</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>