<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:23:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2177] ldlm_flock_completion_ast causes LBUG because of a race</title>
                <link>https://jira.whamcloud.com/browse/LU-2177</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I believe that I have found a possible race condition between ldlm_cli_enqueue_fini() and cleanup_resource() when these handle flocks. The case is below:&lt;/p&gt;

&lt;p&gt;thread A: flock&lt;br/&gt;
1) In ldlm_cli_enqueue_fini(), ldlm_lock_enqueue() has returned in success. it means that the lock is being handled now has been registered with a list in a ldlm_resouce, lr_granted or lr_waiting. And it also means that the spin lock which protects the ldlm_resouce of flock has been released already because ldlm_lock_enqueue calls unlock_res_and_lock() right before its returning.&lt;/p&gt;

&lt;p&gt;thread B: evict&lt;br/&gt;
2) For some reason, an evict process has been driven and will call cleanup_resouce() with the same lock which is now being handled by ldlm_cli_enqueue_fini() because the lock has been registered the list in the ldlm_resouce of flock and the lock which protects the ldlm_resouce has been released already.&lt;/p&gt;

&lt;p&gt;thread B: evict&lt;br/&gt;
3) Since the lock is now being handled in a success way, l_writers or l_readers must be more than 1. So, after checking these variables and relesing the res lock, cleanup_resouce() will call ldlm_flock_completion_ast().&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;cleanup_resource()&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; void cleanup_resource(struct ldlm_resource *res, cfs_list_t *q,
                             &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; flags)
{

                ...

                /* Set CBPENDING so nothing in the cancellation path
                 * can match &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; lock */
                lock-&amp;gt;l_flags |= LDLM_FL_CBPENDING;
                lock-&amp;gt;l_flags |= LDLM_FL_FAILED;
                lock-&amp;gt;l_flags |= flags;

                &lt;span class=&quot;code-comment&quot;&gt;/* ... without sending a CANCEL message &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; local_only. */&lt;/span&gt;
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (local_only)
                        lock-&amp;gt;l_flags |= LDLM_FL_LOCAL_ONLY;

                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (local_only &amp;amp;&amp;amp; (lock-&amp;gt;l_readers || lock-&amp;gt;l_writers)) {
                        /* This is a little bit gross, but much better than the
                         * alternative: pretend that we got a blocking AST from
                         * the server, so that when the lock is decref&apos;d, it
                         * will go away ... */
                        unlock_res(res);
                        LDLM_DEBUG(lock, &lt;span class=&quot;code-quote&quot;&gt;&quot;setting FL_LOCAL_ONLY&quot;&lt;/span&gt;);
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lock-&amp;gt;l_completion_ast)
                                lock-&amp;gt;l_completion_ast(lock, 0, NULL);
                        LDLM_LOCK_RELEASE(lock);
                        &lt;span class=&quot;code-keyword&quot;&gt;continue&lt;/span&gt;;
                }

                ...
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;thread A: flock&lt;br/&gt;
4) After returning ldlm_lock_enqueue(), ldlm_cli_enqueue_fini() will also call ldlm_flock_compleiton_ast() because there is no condition-statement between ldlm_lock_enqueue() and ldlm_flock_completion_ast().&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;ldlm_cli_enqueue_fini()&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
                          ldlm_type_t type, __u8 with_policy, ldlm_mode_t mode,
                          &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; *flags, void *lvb, __u32 lvb_len,
                          struct lustre_handle *lockh,&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rc)
{

        ...

        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!is_replay) {
                rc = ldlm_lock_enqueue(ns, &amp;amp;lock, NULL, flags);
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lock-&amp;gt;l_completion_ast != NULL) {
                        &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; err = lock-&amp;gt;l_completion_ast(lock, *flags, NULL);
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!rc)
                                rc = err;
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc)
                                cleanup_phase = 1;
                }
        }

        ...

}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result: Both threads are possible&lt;br/&gt;
5) Since the l_flags of the lock has been set by cleanup_resouce() in order to call ldlm_lock_decref_internal() in ldlm_flock_completion_ast(), ldlm_lock_decref_internal() will be called twice and this race will cause LBUG() in ldlm_lock_decref_internal() &lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;ldlm_flock_completion_ast()&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;ldlm_flock_completion_ast(struct ldlm_lock *lock, &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; flags, void *data)
{
        cfs_flock_t                    *getlk = lock-&amp;gt;l_ast_data;
        struct obd_device              *obd;
        struct obd_import              *imp = NULL;
        struct ldlm_flock_wait_data     fwd;
        struct l_wait_info              lwi;
        ldlm_error_t                    err;
        &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;                             rc = 0;
        ENTRY;

        CDEBUG(D_DLMTRACE, &lt;span class=&quot;code-quote&quot;&gt;&quot;flags: 0x%x data: %p getlk: %p\n&quot;&lt;/span&gt;,
               flags, data, getlk);

        /* Import invalidation. We need to actually release the lock
         * references being held, so that it can go away. No point in
         * holding the lock even &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; app still believes it has it, since
         * server already dropped it anyway. Only &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; granted locks too. */
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; ((lock-&amp;gt;l_flags &amp;amp; (LDLM_FL_FAILED|LDLM_FL_LOCAL_ONLY)) ==
            (LDLM_FL_FAILED|LDLM_FL_LOCAL_ONLY)) {
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (lock-&amp;gt;l_req_mode == lock-&amp;gt;l_granted_mode &amp;amp;&amp;amp;
                    lock-&amp;gt;l_granted_mode != LCK_NL &amp;amp;&amp;amp;
                    NULL == data)
                        ldlm_lock_decref_internal(lock, lock-&amp;gt;l_req_mode);

                &lt;span class=&quot;code-comment&quot;&gt;/* Need to wake up the waiter &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; we were evicted */&lt;/span&gt;
                cfs_waitq_signal(&amp;amp;lock-&amp;gt;l_waitq);
                RETURN(0);
        }
        ....
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;the blow is the dlmtrace. it actually says that ldlm_lock_decref_interanl_nolock() was done to the same lock twice, althought it rw refcount has reached zero.&lt;/p&gt;

&lt;p&gt;5827: flock&lt;br/&gt;
5829: evict&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:00000001:13:1350018852.706543:0:5827:0:(ldlm_request.c:409:ldlm_cli_enqueue_fini()) Process entered
00010000:00000001:13:1350018852.706543:0:5827:0:(ldlm_lock.c:451:__ldlm_handle2lock()) Process entered
00000020:00000001:13:1350018852.706544:0:5827:0:(lustre_handles.c:172:class_handle2object()) Process entered
00000020:00000001:13:1350018852.706545:0:5827:0:(lustre_handles.c:195:class_handle2object()) Process leaving (rc=18446604488316264960 : -139585393286656 : ffff810c3e37a200)
00010000:00000001:13:1350018852.706548:0:5827:0:(ldlm_resource.c:1105:ldlm_resource_putref_internal()) Process entered
00010000:00000001:13:1350018852.706549:0:5827:0:(ldlm_resource.c:1118:ldlm_resource_putref_internal()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:13:1350018852.706551:0:5827:0:(ldlm_lock.c:502:__ldlm_handle2lock()) Process leaving
00010000:00000001:13:1350018852.706552:0:5827:0:(ldlm_lock.c:1217:ldlm_lock_enqueue()) Process entered
00010000:00000001:13:1350018861.719601:0:5827:0:(ldlm_lock.c:895:ldlm_grant_lock()) Process entered
00010000:00000001:13:1350018861.719602:0:5827:0:(ldlm_lock.c:914:ldlm_grant_lock()) Process leaving
00010000:00000001:13:1350018861.719603:0:5827:0:(ldlm_lock.c:1329:ldlm_lock_enqueue()) Process leaving via out (rc=0 : 0 : 0)
00010000:00000001:13:1350018861.719604:0:5827:0:(ldlm_flock.c:525:ldlm_flock_completion_ast()) Process entered
00010000:00010000:4:1350018861.719604:0:5829:0:(ldlm_resource.c:574:cleanup_resource()) ### setting FL_LOCAL_ONLY ns: fefs-MDT0000-mdc-ffff8105e9bbbc00 lock: ffff810c3e37a200/0xee16fc0aef26bb17 lrc: 5/0,1 mode: PW/PW res: 12/706300508 rrc: 2 type: FLK pid: 5827 [0-&amp;gt;16777215] flags: 0x2000c10 remote: 0xa520aeaebdefc517 expref: -99 pid: 5827 timeout: 0
00010000:00010000:13:1350018861.719605:0:5827:0:(ldlm_flock.c:528:ldlm_flock_completion_ast()) flags: 0x40000 data: 0000000000000000 getlk: ffff810c3ecc1ad8
00010000:00000001:13:1350018861.719607:0:5827:0:(ldlm_lock.c:659:ldlm_lock_decref_internal()) Process entered

1
==================================================
00010000:00010000:13:1350018861.719607:0:5827:0:(ldlm_lock.c:637:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PW) ns: fefs-MDT0000-mdc-ffff8105e9bbbc00 lock: ffff810c3e37a200/0xee16fc0aef26bb17 lrc: 5/0,1 mode: PW/PW res: 12/706300508 rrc: 2 type: FLK pid: 5827 [0-&amp;gt;16777215] flags: 0x2000c10 remote: 0xa520aeaebdefc517 expref: -99 pid: 5827 timeout: 0
==================================================

00010000:00000001:4:1350018861.719611:0:5829:0:(ldlm_flock.c:525:ldlm_flock_completion_ast()) Process entered
00010000:00000001:13:1350018861.719612:0:5827:0:(ldlm_lock.c:152:ldlm_lock_put()) Process entered
00010000:00010000:4:1350018861.719612:0:5829:0:(ldlm_flock.c:528:ldlm_flock_completion_ast()) flags: 0x0 data: 0000000000000000 getlk: ffff810c3ecc1ad8
00010000:00000001:13:1350018861.719613:0:5827:0:(ldlm_lock.c:182:ldlm_lock_put()) Process leaving
00010000:00010000:13:1350018861.719613:0:5827:0:(ldlm_lock.c:683:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: fefs-MDT0000-mdc-ffff8105e9bbbc00 lock: ffff810c3e37a200/0xee16fc0aef26bb17 lrc: 4/0,0 mode: PW/PW res: 12/706300508 rrc: 2 type: FLK pid: 5827 [0-&amp;gt;16777215] flags: 0x2000c10 remote: 0xa520aeaebdefc517 expref: -99 pid: 5827 timeout: 0
00010000:00000001:13:1350018861.719617:0:5827:0:(ldlm_lock.c:206:ldlm_lock_remove_from_lru()) Process entered
00010000:00000001:13:1350018861.719618:0:5827:0:(ldlm_lock.c:210:ldlm_lock_remove_from_lru()) Process leaving
00010000:00000001:13:1350018861.719619:0:5827:0:(ldlm_lockd.c:1763:ldlm_bl_to_thread()) Process entered
00010000:00000001:4:1350018861.719619:0:5829:0:(ldlm_lock.c:659:ldlm_lock_decref_internal()) Process entered

2
==================================================
00010000:00010000:4:1350018861.719619:0:5829:0:(ldlm_lock.c:637:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PW) ns: fefs-MDT0000-mdc-ffff8105e9bbbc00 lock: ffff810c3e37a200/0xee16fc0aef26bb17 lrc: 5/0,0 mode: PW/PW res: 12/706300508 rrc: 2 type: FLK pid: 5827 [0-&amp;gt;16777215] flags: 0x2000c10 remote: 0xa520aeaebdefc517 expref: -99 pid: 5827 timeout: 0
==================================================
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>MDSx1, OSSx1(OSTx3), Clientx1, running an application program using flock</environment>
        <key id="16360">LU-2177</key>
            <summary>ldlm_flock_completion_ast causes LBUG because of a race</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="emoly.liu">Emoly Liu</assignee>
                                    <reporter username="nozaki">Hiroya Nozaki</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Mon, 15 Oct 2012 00:33:01 +0000</created>
                <updated>Mon, 4 Jan 2016 19:18:54 +0000</updated>
                            <resolved>Thu, 12 Jun 2014 06:19:03 +0000</resolved>
                                    <version>Lustre 2.1.3</version>
                    <version>Lustre 1.8.8</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="46712" author="nozaki" created="Thu, 18 Oct 2012 03:53:59 +0000"  >&lt;p&gt;patch for b1_8&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,4290&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4290&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="46716" author="nozaki" created="Thu, 18 Oct 2012 05:07:19 +0000"  >&lt;p&gt;patch for master&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,4291&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,4291&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="46717" author="nozaki" created="Thu, 18 Oct 2012 06:21:13 +0000"  >&lt;p&gt;I uploaded the patch to fix the problem for b1_8 and master branches.&lt;br/&gt;
So, could someone check and reviews them ? thank you.&lt;/p&gt;</comment>
                            <comment id="81904" author="vitaly_fertman" created="Fri, 18 Apr 2014 00:50:05 +0000"  >&lt;p&gt;CODE: &lt;a href=&quot;http://review.whamcloud.com/10005&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10005&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="86408" author="emoly.liu" created="Thu, 12 Jun 2014 06:19:03 +0000"  >&lt;p&gt;The patch &lt;a href=&quot;http://review.whamcloud.com/10005&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10005&lt;/a&gt; landed to 2.6 .&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="33904">LU-7626</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvadb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5215</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>