<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:22:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2067] ldlm_resource_complain()) Namespace MGC resource refcount nonzero after lock cleanup</title>
                <link>https://jira.whamcloud.com/browse/LU-2067</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;During some tests I notice the following debug message on the console, and I suspect it is a sign of a resource leak in some code path that should be cleaned up.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 22089:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace MGC192.168.20.154@tcp resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 22089:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff880058917200 (126883877578100/0/0/0) (rc: 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On my system, the LDLM resource ID is always the same - 126883877578100 = 0x736674736574, which happens to be the ASCII (in reverse order) for the Lustre fsname of the filesystem being tested, &quot;testfs&quot;.&lt;/p&gt;

&lt;p&gt;I don&apos;t know when the problem started for sure, but it is in my /var/log/messages file as far back as I have records of Lustre testing on this machine, 2012/09/02.&lt;/p&gt;

&lt;p&gt;The tests that report this include:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;replay-single:
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;test_0c&lt;/li&gt;
		&lt;li&gt;test_10&lt;/li&gt;
		&lt;li&gt;test_13&lt;/li&gt;
		&lt;li&gt;test_14&lt;/li&gt;
		&lt;li&gt;test_15&lt;/li&gt;
		&lt;li&gt;test_17&lt;/li&gt;
		&lt;li&gt;test_19&lt;/li&gt;
		&lt;li&gt;test_22&lt;/li&gt;
		&lt;li&gt;test_24&lt;/li&gt;
		&lt;li&gt;test_28&lt;/li&gt;
		&lt;li&gt;test_53b&lt;/li&gt;
		&lt;li&gt;test_59&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;replay-dual
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;test_5&lt;/li&gt;
		&lt;li&gt;test_6&lt;/li&gt;
		&lt;li&gt;test_9&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;insanity
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;test_0&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Note that in my older runs (2012-09-10) the list of tests is very similar, but not exactly the same.  I don&apos;t know if this indicates that the failure is due to a race condition (so it only hits on a percentage of tests), or if the leak happens differently in the newer code.&lt;/p&gt;</description>
                <environment>Single node test system (MGS, MDS, OSS, client on one node), x86_64, 2GB RAM, Lustre v2_3_51_0-2-g0810df3</environment>
        <key id="16203">LU-2067</key>
            <summary>ldlm_resource_complain()) Namespace MGC resource refcount nonzero after lock cleanup</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>shh</label>
                    </labels>
                <created>Mon, 1 Oct 2012 16:05:35 +0000</created>
                <updated>Mon, 29 May 2017 06:22:12 +0000</updated>
                            <resolved>Mon, 29 May 2017 06:22:12 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="45818" author="adilger" created="Mon, 1 Oct 2012 16:06:30 +0000"  >&lt;p&gt;Added Jinshan since I suspect this is related to IR, and Liang since he added the original checking code.&lt;/p&gt;</comment>
                            <comment id="45823" author="jay" created="Mon, 1 Oct 2012 18:15:28 +0000"  >&lt;p&gt;it looks like this is race somewhere. The first message showed the refcount is 2 and it became to 1(expected) in the 2nd message.&lt;/p&gt;</comment>
                            <comment id="48506" author="prakash" created="Wed, 28 Nov 2012 18:26:39 +0000"  >&lt;p&gt;We see this constantly on our production 2.1 based systems.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# zgrep resource /var/log/lustre.log                                            
Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST00a9-osc-ffff880823503800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88042fefb300 (13829255/0/0/0) (rc: 1)
Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0052-osc-ffff880823f84000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8802e18b3d40 (14157121/0/0/0) (rc: 1)
Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0057-osc-ffff88040742d800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88035bf2ce40 (13850406/0/0/0) (rc: 1)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="54150" author="morrone" created="Fri, 15 Mar 2013 19:05:01 +0000"  >&lt;p&gt;We need to silence this message.  Until the race is fixed, I am redirecting it to D_DLMTRACE in change &lt;a href=&quot;http://review.whamcloud.com/5736&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;5736&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="54508" author="adilger" created="Wed, 20 Mar 2013 20:25:42 +0000"  >&lt;p&gt;Marking this ALWAYS_EXCEPT, since we aren&apos;t really fixing the problem, just silencing the warning message.  I do NOT want this bug closed if Chris&apos; patch is landed.&lt;/p&gt;</comment>
                            <comment id="61287" author="niu" created="Tue, 25 Jun 2013 05:33:37 +0000"  >&lt;p&gt;I investigated this a bit when I was working on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3460&quot; title=&quot;recovery-small test_51 timeout: lqe_iter_cb(): Inuse quota entry&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3460&quot;&gt;&lt;del&gt;LU-3460&lt;/del&gt;&lt;/a&gt;, looks it&apos;s possible that locks have reader/writer when ldlm_namespace_cleanup() is called. following is comment from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3460&quot; title=&quot;recovery-small test_51 timeout: lqe_iter_cb(): Inuse quota entry&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3460&quot;&gt;&lt;del&gt;LU-3460&lt;/del&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;It is possible that lock reader/writer isn&apos;t dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario:&lt;/p&gt;

&lt;p&gt;&#9632;ldlm_cli_enqueue() is called to create the lock, and increased lock reader/writer;&lt;br/&gt;
&#9632;before the enqueue request is added in imp_sending_list or imp_delayed_list, shutdown happened;&lt;br/&gt;
&#9632;shutdown procedure aborted inflight RPCs, but the enqueue request can&apos;t be aborted since it&apos;s neither on sending list nor delayed list;&lt;br/&gt;
&#9632;shutdown procedure moving on to obd_import_event(IMP_EVENT_ACTIVE)-&amp;gt;ldlm_namespace_cleanup() to cleanup all locks;&lt;br/&gt;
&#9632;ldlm_namespace_cleanup() found that the lock just created still has 1 reader/writer, because the interpret callback for this lock enqueue hasn&apos;t been called yet (where the reader/writer is dropped;&lt;/p&gt;

&lt;p&gt;That&apos;s why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally.&lt;/p&gt;</comment>
                            <comment id="88582" author="lixi" created="Wed, 9 Jul 2014 13:14:46 +0000"  >&lt;p&gt;Hi Yawei,&lt;/p&gt;

&lt;p&gt;Your analysis is really interesting. Is it possible to make a patch to fix that race problem? ldlm_resource_dump() didn&apos;t print any granted/converting/waiting locks, that means the locks holding the resource were beening enqueued, right?&lt;/p&gt;</comment>
                            <comment id="88676" author="niu" created="Thu, 10 Jul 2014 07:28:06 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Your analysis is really interesting. Is it possible to make a patch to fix that race problem? ldlm_resource_dump() didn&apos;t print any granted/converting/waiting locks, that means the locks holding the resource were beening enqueued, right?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Yes, I suppose the lock hasn&apos;t been put onto any list yet. I didn&apos;t figure out a good solution so far.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="41253">LU-8792</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv4u7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4317</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>