<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:51:47 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held</title>
                <link>https://jira.whamcloud.com/browse/LU-12347</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;lustre&#8217;s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    cl_io_loop
      cl_io_lock                    &amp;lt;- ldlm lock is taken here
      cl_io_start
        vvp_io_write_start
        ...
          __generic_file_aio_write
            file_remove_privs
              security_inode_need_killpriv
              ...
                ll_xattr_get_common
                ...
                  mdc_intent_lock   &amp;lt;- enqueue rpc is sent here
      cl_io_unlock                  &amp;lt;- ldlm lock is released
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;That may lead to client eviction. The following scenario has been observed during write load with DoM involved:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;write holds mdc ldlm lock (L1) and is waiting on free rpc slot in&lt;br/&gt;
      obd_get_request_slot trying to do ll_xattr_get_common().&lt;/li&gt;
	&lt;li&gt;all the rpc slots are busy by write processes which wait for enqueue&lt;br/&gt;
      rpc completion.&lt;/li&gt;
	&lt;li&gt;mds in order to serve the enqueue requests has sent blocking ast for&lt;br/&gt;
      the lock L1 and eventually evicts the client as it does not cancel&lt;br/&gt;
      L1.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing.&lt;/p&gt;</description>
                <environment></environment>
        <key id="55764">LU-12347</key>
            <summary>lustre write: do not enqueue rpc holding osc/mdc ldlm lock held</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="vsaveliev">Vladimir Saveliev</assignee>
                                    <reporter username="vsaveliev">Vladimir Saveliev</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Tue, 28 May 2019 13:22:38 +0000</created>
                <updated>Wed, 5 Jul 2023 18:03:47 +0000</updated>
                            <resolved>Tue, 30 Nov 2021 13:41:32 +0000</resolved>
                                                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="247836" author="gerrit" created="Tue, 28 May 2019 13:23:11 +0000"  >&lt;p&gt;Vladimir Saveliev (c17830@cray.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/34977&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34977&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12347&quot; title=&quot;lustre write: do not enqueue rpc holding osc/mdc ldlm lock held&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12347&quot;&gt;&lt;del&gt;LU-12347&lt;/del&gt;&lt;/a&gt; llite: call file_remove_privs before taking ldlm&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 4b7265b8a06ec5dced501b2c25c7fd74894a7f45&lt;/p&gt;</comment>
                            <comment id="256991" author="adilger" created="Thu, 24 Oct 2019 03:02:02 +0000"  >&lt;p&gt;he full trace looks like:&lt;/p&gt;
 &lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; PID: 20755  TASK: ffff8ff1f07fa080  CPU: 1   COMMAND: &quot;xdd&quot;
 #0 [ffff8ff1e3c47328] __schedule at ffffffffab367747
 #1 [ffff8ff1e3c473b0] schedule at ffffffffab367c49
 #2 [ffff8ff1e3c473c0] obd_get_request_slot at ffffffffc0692c24 [obdclass]
 #3 [ffff8ff1e3c47470] ldlm_cli_enqueue at ffffffffc083a7d0 [ptlrpc]
 #4 [ffff8ff1e3c47528] mdc_enqueue_base at ffffffffc0a10e81 [mdc]
 #5 [ffff8ff1e3c47640] mdc_intent_lock at ffffffffc0a12f15 [mdc]
 #6 [ffff8ff1e3c47718] lmv_intent_lock at ffffffffc09824f2 [lmv]
 #7 [ffff8ff1e3c477c8] ll_xattr_cache_refill at ffffffffc0aea395 [lustre]
 #8 [ffff8ff1e3c478a8] ll_xattr_cache_get at ffffffffc0aeb2ab [lustre]
 #9 [ffff8ff1e3c47900] ll_xattr_list at ffffffffc0ae7cdc [lustre]
#10 [ffff8ff1e3c47968] ll_xattr_get_common at ffffffffc0ae83ef [lustre]
#11 [ffff8ff1e3c479a8] ll_xattr_get_common_3_11 at ffffffffc0ae88d8 [lustre]
#12 [ffff8ff1e3c479b8] generic_getxattr at ffffffffaae693d2
#13 [ffff8ff1e3c479e8] cap_inode_need_killpriv at ffffffffaaef6b9f
#14 [ffff8ff1e3c479f8] security_inode_need_killpriv at ffffffffaaef938c
#15 [ffff8ff1e3c47a08] dentry_needs_remove_privs at ffffffffaae5e6bf
#16 [ffff8ff1e3c47a28] file_remove_privs at ffffffffaae5e8f8
#17 [ffff8ff1e3c47aa0] __generic_file_aio_write at ffffffffaadb88a8
#18 [ffff8ff1e3c47b20] __generic_file_write_iter at ffffffffc0afa36b [lustre]
#19 [ffff8ff1e3c47b90] vvp_io_write_start at ffffffffc0afe6ab [lustre]
#20 [ffff8ff1e3c47c00] cl_io_start at ffffffffc06d4828 [obdclass]
#21 [ffff8ff1e3c47c28] cl_io_loop at ffffffffc06d69fc [obdclass]
#22 [ffff8ff1e3c47c58] ll_file_io_generic at ffffffffc0ab4c1b [lustre]
#23 [ffff8ff1e3c47d60] ll_file_aio_write at ffffffffc0ab58b2 [lustre]
#24 [ffff8ff1e3c47dd8] ll_file_write at ffffffffc0ab5aa4 [lustre]
#25 [ffff8ff1e3c47ec0] vfs_write at ffffffffaae41650
#26 [ffff8ff1e3c47f00] sys_pwrite64 at ffffffffaae42632
#27 [ffff8ff1e3c47f50] system_call_fastpath at ffffffffab374ddb
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="261420" author="vsaveliev" created="Fri, 17 Jan 2020 13:41:21 +0000"  >&lt;blockquote&gt;
&lt;p&gt;So I have two suggestions for different ways to fix this that would be less complicated, and I think (2) should definitely work.  1. I am less sure about.&lt;br/&gt;
1. See if getxattr truly needs to be modifying.  obd_skip_mod_rpc_slot considers it a modifying RPC, but it seems weird that it is.  Perhaps it doesn&apos;t need to be - It would be good to know why getxattr is considered modifying.  If it&apos;s not modifying, problem is solved.&lt;br/&gt;
2. Alternately, looking at:&lt;br/&gt;
obd_mod_rpc_slot_avail_locked()&lt;br/&gt;
We see that one extra close request is allowed to avoid a deadlock.&lt;br/&gt;
&quot; * On the MDC client, to avoid a potential deadlock (see Bugzilla 3462),&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;one close request is allowed above the maximum.&quot;&lt;br/&gt;
If we did the same for getxattr, that should fix this as well.&lt;br/&gt;
Either is much simpler than this, and I think we should be able to do one of those...?&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The patchset 3 is an attempt to implement Patrick&apos;s proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used.&lt;/p&gt;

&lt;p&gt;The below are dumps of tests included in the patch runs with disabled fixe:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;== sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171)
fail_loc=0x80000329
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Input/output error
error: lt-lfs setstripe: invalid layout
write: Input/output error
 sanity test_820: @@@@@@ FAIL: multiop failed 
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from &apos;/mnt/lustre/d820.sanity/f820.sanity&apos;: Transport endpoint is not connected
error: lt-lfs setstripe: invalid layout
  Trace dump:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;== sanity test 821: write race ======================================================================= 07:52:43 (1579236763)
fail_loc=0x80000329
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
 sanity test_821: @@@@@@ FAIL: multiop failed 
write: Input/output error
  Trace dump:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="261793" author="tappro" created="Fri, 24 Jan 2020 09:15:02 +0000"  >&lt;p&gt;Vladimir, in description it says: &quot;all the rpc slots are busy by write processes which wait for enqueue rpc completion&quot; -&#160; could you explain this a bit - are these &apos;write processes&apos; about the same file and what enqueue they are waiting for? I don&apos;t get yet who saturates all rpc slots&lt;/p&gt;</comment>
                            <comment id="261796" author="vsaveliev" created="Fri, 24 Jan 2020 09:38:27 +0000"  >&lt;blockquote&gt;
&lt;p&gt;could you explain this a bit - are these &apos;write processes&apos; about the same file and what enqueue they are waiting for? &lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes.&lt;/p&gt;

&lt;p&gt;1. clientA:write 1 writes file F:&lt;br/&gt;
   it gets to somewhere about to enter file_remove_privs. It holds DLM lock L.&lt;/p&gt;

&lt;p&gt;2. clientB:write1 writes file F:&lt;br/&gt;
   it enqueues for lock, server sees conflict with L owned by clientA, so it sends blocking ast to clientA.&lt;/p&gt;

&lt;p&gt;3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1.&lt;/p&gt;

&lt;p&gt;4. &apos;other writes&apos; to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them.&lt;/p&gt;

&lt;p&gt;5. clientA:write1 continues file_remove_privs()-&amp;gt;getxattr and gets blocked by as all rpc slots are occupied.&lt;/p&gt;
</comment>
                            <comment id="261800" author="tappro" created="Fri, 24 Jan 2020 10:10:09 +0000"  >&lt;p&gt;OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by &apos;write using slot&apos; do you mean write lock enqueue or real BRW?&lt;/p&gt;</comment>
                            <comment id="261810" author="vsaveliev" created="Fri, 24 Jan 2020 13:30:02 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Also by &apos;write using slot&apos; do you mean write lock enqueue or real BRW?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;lock enqueue in the following code path:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
ll_file_io_generic
  cl_io_loop
    cl_io_lock
      cl_lockset_lock
        cl_lock_request
          cl_lock_enqueue
            lov_lock_enqueue
              mdc_lock_enqueue
                mdc_enqueue_send
                  ldlm_lock_match            &amp;lt;-- here we get miss
                  ldlm_cli_enqueue
                    obd_get_request_slot  &amp;lt;-- slot is occupied
                    ptlrpc_queue_wait        &amp;lt;-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; free slots
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()-&amp;gt;ldlm_lock_match() fails.&lt;/p&gt;

&lt;p&gt;All those cli-&amp;gt;cl_max_rpcs_in_flight writes have the following:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered
...
00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered
...
00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0)
...
grab rpc slot via ldlm_cli_enqueue ()-&amp;gt;obd_get_request_slot()
...
00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</comment>
                            <comment id="261819" author="tappro" created="Fri, 24 Jan 2020 15:56:37 +0000"  >&lt;p&gt;yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.&lt;/p&gt;</comment>
                            <comment id="261823" author="vsaveliev" created="Fri, 24 Jan 2020 16:19:00 +0000"  >&lt;blockquote&gt;
&lt;p&gt;I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Ok, that makes sense, however, even if you fix that  - we still need a fix for the rpc slot lockup on ll_file_write-&amp;gt;remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.&lt;/p&gt;</comment>
                            <comment id="261824" author="tappro" created="Fri, 24 Jan 2020 16:30:54 +0000"  >&lt;p&gt;could you describe that lockup with an example, please? There were several related scenarios, I&apos;ve lost track a bit.&lt;/p&gt;</comment>
                            <comment id="262177" author="tappro" created="Thu, 30 Jan 2020 05:06:56 +0000"  >&lt;p&gt;FYI, I&apos;ve found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don&apos;t enqueue new similar lock, but the problem is that we have one cpbending lock which can&apos;t be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots.&lt;br/&gt;
I have some observation and proposals about that.&lt;br/&gt;
1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly.&lt;br/&gt;
2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don&apos;t overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also &apos;atomic&apos; right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: &lt;a href=&quot;https://review.whamcloud.com/36903&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36903&lt;/a&gt; and I think it will help with current issue. &lt;/p&gt;
</comment>
                            <comment id="262178" author="tappro" created="Thu, 30 Jan 2020 05:15:29 +0000"  >&lt;p&gt;So I would propose to don&apos;t add extra slot here but keep all other changes - &lt;tt&gt;IT_GETXATTR&lt;/tt&gt; adding, removal of &lt;tt&gt;ols_has_ref&lt;/tt&gt; and passing einfo in &lt;tt&gt;ldlm_cli_enqueue_fini()&lt;/tt&gt;. Also I&apos;d consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks&lt;/p&gt;</comment>
                            <comment id="262297" author="vsaveliev" created="Fri, 31 Jan 2020 14:32:04 +0000"  >&lt;blockquote&gt;&lt;p&gt;could you describe that lockup with an example, please? There were several related scenarios, I&apos;ve lost track a bit.&lt;/p&gt;&lt;/blockquote&gt; 
&lt;p&gt;1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid-&amp;gt;ll_xattr_cache_refill.&lt;br/&gt;
2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client.&lt;br/&gt;
3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first.&lt;br/&gt;
4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed.&lt;br/&gt;
5.&#160;First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied.&lt;/p&gt;

&lt;p&gt;Patchset 8 of &lt;a href=&quot;https://review.whamcloud.com/#/c/34977/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/34977/&lt;/a&gt; contains this test: sanityn:105c.&lt;/p&gt;</comment>
                            <comment id="306295" author="gerrit" created="Tue, 6 Jul 2021 14:24:39 +0000"  >&lt;p&gt;Vladimir Saveliev (vlaidimir.saveliev@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44151&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44151&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12347&quot; title=&quot;lustre write: do not enqueue rpc holding osc/mdc ldlm lock held&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12347&quot;&gt;&lt;del&gt;LU-12347&lt;/del&gt;&lt;/a&gt; llite: do not take mod rpc slot for getxattr&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c68d11fcbdd03113b5618ea93b7662a5e5790dce&lt;/p&gt;</comment>
                            <comment id="306299" author="vsaveliev" created="Tue, 6 Jul 2021 14:35:45 +0000"  >&lt;blockquote&gt;&lt;p&gt;could you describe that lockup with an example, please? There were several related scenarios, I&apos;ve lost track a bit.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;With &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13645&quot; title=&quot;Various data corruptions possible in lustre.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13645&quot;&gt;&lt;del&gt;LU-13645&lt;/del&gt;&lt;/a&gt; these scenarios became impossible.&lt;/p&gt;

&lt;p&gt;The following however is still possible:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;- clientA:write1 writes to file F and holds mdc ldlm lock (L1) and
  runs somewhere on the way to
  file_remove_privs()-&amp;gt;ll_xattr_get_common()

- clientB:write is going to write file F and enqueues DoM lock. mds
  handles conflict on L1 and sends blocking ast to clientA

- clientA: max_mod_rpcs_in_flight simultaneous creates occupy all mod
  rpc slots and get delayed on mds side waiting for preallocated
  objects. Preallocation is delayed by ost failover.

- clientA:write1 tries to get mod rpc slot to enqueue xattr request,
  all slots are busy, so lock L1 can not be cancelled one of creates
  completes its rpc which it stuck on preallocation.

- lock callback timer expires on mds first and evicts client1.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This can be fixed with adding IT_GETXATTR in mdc_skip_mod_rpc_slot().&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;ok. see&#160; &lt;a href=&quot;https://review.whamcloud.com/44151&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44151&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;and passing einfo in ldlm_cli_enqueue_fini(). Also I&apos;d consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;These are in already as part of &lt;a href=&quot;https://review.whamcloud.com/36903&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36903&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="319453" author="gerrit" created="Tue, 30 Nov 2021 03:45:01 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44151/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44151/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12347&quot; title=&quot;lustre write: do not enqueue rpc holding osc/mdc ldlm lock held&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12347&quot;&gt;&lt;del&gt;LU-12347&lt;/del&gt;&lt;/a&gt; llite: do not take mod rpc slot for getxattr&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: eb64594e4473af859e74a0e831316cead0f5c49b&lt;/p&gt;</comment>
                            <comment id="319531" author="pjones" created="Tue, 30 Nov 2021 13:41:32 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="69072">LU-15639</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00h1z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>