<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:18:47 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1682] Memory allocation failure in mgc_enqueue() tends to cause LBUG and refcount issue</title>
                <link>https://jira.whamcloud.com/browse/LU-1682</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi, I believe I found some bugs in mgc_enqueue().&lt;br/&gt;
I&apos;m sure in 1.8.x case because I&apos;ve often run into the case actually,&lt;br/&gt;
but I&apos;m not so in 2.x case because it&apos;s due to only my sourde code&lt;br/&gt;
inspection, not actuall facing the problem.&lt;/p&gt;

&lt;p&gt;So could someone verify whether or not it&apos;s true or not in the 2.x case?&lt;br/&gt;
The over views of the bugs are the below.&lt;/p&gt;

&lt;p&gt;1) &lt;br/&gt;
mgc_blocking_ast() is never callced when the case that mgc_enqnene() &lt;br/&gt;
failed to allocat memory for a new ldlm_lock for cld,  because there &lt;br/&gt;
is no ldlm_lock holding callback-functions due to the allocation failure.&lt;br/&gt;
That&apos;s why the cld remains and it cause the obd_device of MGC remains&lt;br/&gt;
forever even after umount-command is exexuted.&lt;/p&gt;

&lt;p&gt;IMHO, In order to fix the problem, we should call config_log_put() when &lt;br/&gt;
the above allocation failure happened. But we have no way to know just &lt;br/&gt;
only the failure because ldlm_cli_enqueue() returns only -ENOMEM to caller&lt;br/&gt;
and there&apos;re some other functions which will return -ENOMEM.   &lt;/p&gt;

&lt;p&gt;Locally, I&apos;ve fixed it with adding one more argument, though my case&lt;br/&gt;
is FEFS, based on Lustre-1.8.5. but I think it&apos;s not so good.&lt;/p&gt;

&lt;p&gt;So I&apos;d like to discuss about the way to fix it.&lt;/p&gt;


&lt;p&gt;2) &lt;br/&gt;
When the case that mgc_enqnene() failed to allocat memory for a new&lt;br/&gt;
ptrpc_request for cld, failed_lock_cleanup() is called, and the call&lt;br/&gt;
path will follow the below list ...&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;failed_lock_clanup()
    ldlm_lock_decref_internal()
        ldlm_handle_bl_callback()
            mgc_blocking_ast()
                ldlm_cli_cancel()
                    ldlm_cli_cancel_local()
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, it will reach ldlm_cli_cancel_local() and go the below&lt;br/&gt;
else-statement because the lock-&amp;gt;l_conn_export haven&apos;t been &lt;br/&gt;
set so far. &lt;br/&gt;
But it&apos;s caught by LBUG because it&apos;s a client side lock.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int ldlm_cli_cancel_local(struct ldlm_lock *lock)
{
        int rc = LDLM_FL_LOCAL_ONLY;
        ENTRY;

        if (lock-&amp;gt;l_conn_export) { &amp;lt;---- NULL in the case
                ....
        } else {
                if (ns_is_client(ldlm_lock_to_ns(lock))) { &amp;lt;---- This is a client side lock, then it will be True
                        LDLM_ERROR(lock, &quot;Trying to cancel local lock&quot;); &amp;lt;---- caught here
                        LBUG();
                }
                LDLM_DEBUG(lock, &quot;server-side local cancel&quot;);
                ldlm_lock_cancel(lock);
                ldlm_reprocess_all(lock-&amp;gt;l_resource);
        }

        RETURN(rc);
} 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So, I believe ldlm_cli_enqueue should be modified like the below pseudo code.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;+        lock-&amp;gt;l_conn_export = exp;
+        lock-&amp;gt;l_export = NULL;
+        lock-&amp;gt;l_blocking_ast = einfo-&amp;gt;ei_cb_bl;

        if (reqp == NULL || *reqp == NULL) {
                req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
                                                &amp;amp;RQF_LDLM_ENQUEUE,
                                                LUSTRE_DLM_VERSION,
                                                LDLM_ENQUEUE);
                if (req == NULL) {
                        failed_lock_cleanup(ns, lock, einfo-&amp;gt;ei_mode);
                        LDLM_LOCK_RELEASE(lock);
                        RETURN(-ENOMEM);
                }
                req_passed_in = 0;
                if (reqp)
                        *reqp = req;
        } else {
                int len;

                req = *reqp;
                len = req_capsule_get_size(&amp;amp;req-&amp;gt;rq_pill, &amp;amp;RMF_DLM_REQ,
                                           RCL_CLIENT);
                LASSERTF(len &amp;gt;= sizeof(*body), &quot;buflen[%d] = %d, not %d\n&quot;,
                         DLM_LOCKREQ_OFF, len, (int)sizeof(*body));
        }

-        lock-&amp;gt;l_conn_export = exp;
-        lock-&amp;gt;l_export = NULL;
-        lock-&amp;gt;l_blocking_ast = einfo-&amp;gt;ei_cb_bl;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>Originally it happened with FEFS, based on Lustre-1.8.5 in RIKEN K computer environment &lt;br/&gt;
MDSx1, OSSx2592, OSTx5184, Clientx84672</environment>
        <key id="15330">LU-1682</key>
            <summary>Memory allocation failure in mgc_enqueue() tends to cause LBUG and refcount issue</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="nozaki">Hiroya Nozaki</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Fri, 27 Jul 2012 00:35:00 +0000</created>
                <updated>Thu, 8 Feb 2018 18:16:09 +0000</updated>
                            <resolved>Thu, 8 Feb 2018 18:16:09 +0000</resolved>
                                    <version>Lustre 2.1.0</version>
                    <version>Lustre 2.3.0</version>
                    <version>Lustre 1.8.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="42368" author="green" created="Fri, 27 Jul 2012 01:09:42 +0000"  >&lt;p&gt;Thanks for uncovering this problem.&lt;/p&gt;

&lt;p&gt;I see that the code in 2.x is different than in 1.8. The cld get is done after mgc_enqueue call in 2.x.&lt;br/&gt;
In this case what needs to be done is in mgc_process_log() to check that lockh is still 0 (there is a function to check that hte lock handle is valid), and if so do not do the get.&lt;/p&gt;

&lt;p&gt;In 1.8 you can check that the lockh is not 0 after ldlm_cli_enqueue in mgc_enqueue and if it is still 0 (well, again, the not valid function), drop the cld refcount.&lt;/p&gt;

&lt;p&gt;I think this approach should help for issue #1.&lt;/p&gt;

&lt;p&gt;For issue #2, 2.x is unaffected in this particular case because we allocate the request before calling ldlm_cli_enqueue.&lt;/p&gt;

&lt;p&gt;But otherwise the bug is real and could hit in some other scenarios. As such please submit a separate patch just following your pseudocode for master (and b1_8 too).&lt;/p&gt;

&lt;p&gt;Thanks again!&lt;/p&gt;</comment>
                            <comment id="42370" author="nozaki" created="Fri, 27 Jul 2012 01:51:53 +0000"  >&lt;p&gt;Hi, Oleg. It&apos;s been a while and thank you for your quick response!!&lt;/p&gt;

&lt;p&gt;1)&lt;br/&gt;
Oh, sorry. Apparently I had read a bit old code, Lustre-2.1.1 or so. &lt;br/&gt;
As you had said, config_log_get() is called only in the case when&lt;br/&gt;
ldlm_cli_enqueue() returns successfully.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In 1.8 you can check that the lockh is not 0 after ldlm_cli_enqueue&lt;br/&gt;
in mgc_enqueue and if it is still 0 (well, again, the not valid function),&lt;br/&gt;
drop the cld refcount.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I got it, it sounds good. I&apos;ll try it from now on. thanks alot !!&lt;/p&gt;

&lt;p&gt;2) &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But otherwise the bug is real and could hit in some other scenarios. &lt;br/&gt;
As such please submit a separate patch just following your pseudocode&lt;br/&gt;
for master (and b1_8 too).&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;OK, so I&apos;ll create and submit the new patch soon.&lt;/p&gt;</comment>
                            <comment id="42373" author="green" created="Fri, 27 Jul 2012 02:02:45 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Oh, sorry. Apparently I had read a bit old code, Lustre-2.1.1 or so.&lt;br/&gt;
As you had said, config_log_get() is called only in the case when&lt;br/&gt;
ldlm_cli_enqueue() returns successfully&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Sorry, what I mean is while the config_log_get() is called after mgc_enqueue in master, it&apos;s still called regardless of success&lt;br/&gt;
(you can see that the call is there in both branches of that if), so the bug is there and it still needs to be fixed, just in a different place.&lt;/p&gt;</comment>
                            <comment id="42374" author="nozaki" created="Fri, 27 Jul 2012 02:38:42 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Sorry, what I mean is while the config_log_get() is called after mgc_enqueue in master, it&apos;s still called regardless of success&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Wow, that&apos;s true. I seemed to jump to conclusions ...&lt;/p&gt;

&lt;p&gt;OK, so what I&apos;m supposed to do is trying to fix the #1 bug of both 1.8 &lt;br/&gt;
and 2.x case, right? If it&apos;s true, I&apos;ll try to fix it from now on.&lt;/p&gt;</comment>
                            <comment id="42375" author="green" created="Fri, 27 Jul 2012 02:42:59 +0000"  >&lt;p&gt;Yes, it would be great if you can submit patches for both issues for both branches.&lt;/p&gt;</comment>
                            <comment id="42381" author="nozaki" created="Fri, 27 Jul 2012 04:10:09 +0000"  >&lt;p&gt;for the #2 bug, b1_8 branch&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,3487&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,3487&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="42382" author="nozaki" created="Fri, 27 Jul 2012 05:09:34 +0000"  >&lt;p&gt;for the #2 bug, master branch&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,3488&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,3488&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="42388" author="nozaki" created="Fri, 27 Jul 2012 07:22:38 +0000"  >&lt;p&gt;For the #1 bug, b1_8 branch.&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,3489&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,3489&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="42389" author="nozaki" created="Fri, 27 Jul 2012 07:56:40 +0000"  >&lt;p&gt;For the #1 but, master branch&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,3490&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,3490&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I especially want you all to review this patch.&lt;br/&gt;
because I&apos;m not so experienced in Lustre-2.x stuff yet.&lt;/p&gt;</comment>
                            <comment id="42804" author="nozaki" created="Tue, 7 Aug 2012 06:56:05 +0000"  >&lt;p&gt;Now I&apos;ve investigated a weird dump of FEFS based on Lustre-1.8.5.&lt;br/&gt;
In that, there are two config_llog_datas in config_llog_list. And each cld is with&lt;br/&gt;
cld_refcount=1 and cld_lostlock=1&#12288;whereas rq_state is 0xf, RQ_RUNNING|RQ_NOW|RQ_LATER|RQ_STOP &lt;br/&gt;
but with no mgc_requeue_thread.&lt;/p&gt;

&lt;p&gt;Do you know of some similar phenomena? Anything can be helpful for me to proceed it.&lt;br/&gt;
by the way, we haven&apos;t added something into the mgc_enqueue logic.&lt;/p&gt;</comment>
                            <comment id="220449" author="jay" created="Thu, 8 Feb 2018 18:16:09 +0000"  >&lt;p&gt;close old tickets&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv62f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4517</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>