<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:03:19 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-59] call traces on MDS for ldlm_expired_completion_wait()</title>
                <link>https://jira.whamcloud.com/browse/LU-59</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG.&lt;br/&gt;
This seems to be similar to bug 21967, but there is no patches for lustre-1.8.x, right now.&lt;br/&gt;
I&apos;m attaching the console log on MDS that we saw. Could you please find out whether this is same bug as 21967. And if yes, please back port patch in 21967 for 1.8.x, also.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Ihara &lt;/p&gt;</description>
                <environment></environment>
        <key id="10329">LU-59</key>
            <summary>call traces on MDS for ldlm_expired_completion_wait()</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="ihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Fri, 4 Feb 2011 02:47:17 +0000</created>
                <updated>Tue, 28 Jun 2011 15:01:37 +0000</updated>
                            <resolved>Mon, 13 Jun 2011 19:32:00 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>1</watches>
                                                                            <comments>
                            <comment id="10526" author="pjones" created="Fri, 4 Feb 2011 08:25:48 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="10548" author="niu" created="Mon, 7 Feb 2011 19:42:34 +0000"  >&lt;p&gt;Hi, Ihara&lt;/p&gt;

&lt;p&gt;It looks similiar to the bug 21967, the difference is the timeout value in this trace is much longer than 21967&apos;s. But I didn&apos;t see any patches on bug 21967, seems it&apos;s an unresolved issue.&lt;/p&gt;

&lt;p&gt;BTW: In your test, is the dynamic timeout feature enabled?&lt;/p&gt;</comment>
                            <comment id="10549" author="ihara" created="Mon, 7 Feb 2011 20:40:12 +0000"  >&lt;p&gt;Ah, 21967 was closed by some reasons. I thought the problem is related to bug 22598 which there are no patches for 1.8.x branch.&lt;/p&gt;

&lt;p&gt;The dynamic timeout feature means Adaptive Timeout? If so, yes, I didn&apos;t disable AT intentionally. So, it should be enabled by default.&lt;/p&gt;</comment>
                            <comment id="10550" author="niu" created="Mon, 7 Feb 2011 22:53:23 +0000"  >&lt;p&gt;There are some statistic/diagnostic patches and one &apos;disable COS by default&apos; patch in bug 22598, and I don&apos;t think 1.8 has COS, so the patches in 22598 might not helpful for this issue.&lt;/p&gt;

&lt;p&gt;Yes, I meant adaptive timeout.Thank you. &lt;/p&gt;</comment>
                            <comment id="10563" author="niu" created="Tue, 8 Feb 2011 23:38:45 +0000"  >&lt;p&gt;Hi, Ihara&lt;/p&gt;

&lt;p&gt;&quot;The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG&quot;&lt;br/&gt;
I don&apos;t quite understand the &quot;status was changed to LBUG&quot;, could you make further explanation?&lt;/p&gt;

&lt;p&gt;The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn&apos;t expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end.&lt;/p&gt;

&lt;p&gt;Is it easy to reproduce? If it&apos;s easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks. &lt;/p&gt;</comment>
                            <comment id="10607" author="ihara" created="Thu, 10 Feb 2011 05:58:10 +0000"  >&lt;p&gt;Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-27&quot; title=&quot;(mds_open.c:1667:mds_close()) @@@ no handle for file close ino&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-27&quot;&gt;&lt;del&gt;LU-27&lt;/del&gt;&lt;/a&gt; which caused LBUG. Also, we might hit bug 23352 on TCP clients. I saw these problems sometimes happened at the same time. So, I thought this case was also LBUG, but you are right. there are nothing LBUG in this console messages. &lt;/p&gt;

&lt;p&gt;We applied the patch in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-27&quot; title=&quot;(mds_open.c:1667:mds_close()) @@@ no handle for file close ino&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-27&quot;&gt;&lt;del&gt;LU-27&lt;/del&gt;&lt;/a&gt; and bug 23352 and are keeping an eye if we still see same problem even after apply these patches. Do you think this error messages can be caused after by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-27&quot; title=&quot;(mds_open.c:1667:mds_close()) @@@ no handle for file close ino&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-27&quot;&gt;&lt;del&gt;LU-27&lt;/del&gt;&lt;/a&gt; or bug 23352?&lt;/p&gt;</comment>
                            <comment id="10608" author="niu" created="Thu, 10 Feb 2011 06:42:57 +0000"  >&lt;p&gt;Thank you, Ihara. I think the fix in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-27&quot; title=&quot;(mds_open.c:1667:mds_close()) @@@ no handle for file close ino&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-27&quot;&gt;&lt;del&gt;LU-27&lt;/del&gt;&lt;/a&gt; doesn&apos;t help much on this issue, whereas the at_min fix in b23352 can avoid the unnecessary client eviction, which might reduce the chance of seeing such error message, I think you can try this patch to see if things will be getting better.&lt;/p&gt;

&lt;p&gt;What I don&apos;t understand is that why the lock enqueue didn&apos;t timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later.&lt;/p&gt;</comment>
                            <comment id="10695" author="niu" created="Sun, 20 Feb 2011 22:03:35 +0000"  >&lt;p&gt;Hi, Ihara&lt;/p&gt;

&lt;p&gt;What about the test result after b23352 fix applied?&lt;/p&gt;</comment>
                            <comment id="10696" author="ihara" created="Mon, 21 Feb 2011 00:22:57 +0000"  >&lt;p&gt;Niu, &lt;/p&gt;

&lt;p&gt;We haven&apos;t seen same issue since applied a patch in b23352.&lt;br/&gt;
But, don&apos;t we have any fixes in adaptive timeout area?&lt;/p&gt;</comment>
                            <comment id="10697" author="niu" created="Mon, 21 Feb 2011 00:37:33 +0000"  >&lt;p&gt;With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it&apos;s not necessary a bug.&lt;/p&gt;</comment>
                            <comment id="16090" author="pjones" created="Mon, 13 Jun 2011 13:55:03 +0000"  >&lt;p&gt;Ihara&lt;/p&gt;

&lt;p&gt;Do you have any further questions or can we close out this ticket?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16133" author="ihara" created="Mon, 13 Jun 2011 19:26:56 +0000"  >&lt;p&gt;fine. as far as we can see, the problem seems to be fixed by patch in b23352.&lt;/p&gt;

&lt;p&gt;thanks!&lt;/p&gt;</comment>
                            <comment id="16135" author="pjones" created="Mon, 13 Jun 2011 19:32:00 +0000"  >&lt;p&gt;Great - thanks Ihara!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10103" name="t2s007019.console_log" size="377138" author="ihara" created="Fri, 4 Feb 2011 02:47:17 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw01z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10088</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>