<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:28:10 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16571] &quot;lfs migrate -b&quot; can cause thread starvation on OSS</title>
                <link>https://jira.whamcloud.com/browse/LU-16571</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I am not 100% sure but I think we hit the following issue in production with &quot;lfs migrate -b&quot;&lt;br/&gt;
 parrallel instances on files with several hard links:&lt;/p&gt;

&lt;p&gt;&lt;b&gt;OSS setup:&lt;/b&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oss ~]# lctl get_param ost.OSS.ost.threads_* cpu_partition_table
ost.OSS.ost.threads_max=78
ost.OSS.ost.threads_min=3
ost.OSS.ost.threads_started=4
cpu_partition_table=0   : 0 1
[root@oss ~]# mount -tlustre
/dev/mapper/ost1_flakey on /...
/dev/mapper/ost2_flakey on /...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;Reproducer with 1 client:&lt;/b&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@client test]# lfs setstripe  -c2 .
[root@client test]# dd if=/dev/zero of=test1 bs=1M count=100                                   
[root@client test]# mkdir links                                              
[root@client test]# printf &quot;%s\n&quot; link{1..100} | xargs -I{} ln test1 links/{}
[root@client test]# find -type f | xargs -P100 -I{} lfs migrate -c2 {}
# --&amp;gt; the command hang
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;Node states&lt;/b&gt;&lt;br/&gt;
messages queued:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oss ~]# lctl get_param -n ost.OSS.ost.nrs_policies    

regular_requests:
  - name: fifo
    state: started
    fallback: yes
    queued: 125                 
    active: 76                  
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;OSS dmesg:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[369238.270582] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1676651767, 300s ago); not entering recovery in server code, just going back to sleep ns: 
filter-lustrefs-OST0000_UUID lock: ffff967687a826c0/0xdb3bda27f476b2cd lrc: 3/1,0 mode: --/PR res: [0xc46:0x0:0x0].0x0 rrc: 40 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) gid 0 flags: 0x400
10000000000 nid: local remote: 0x0 expref: -99 pid: 13117 timeout: 0 lvb_type: 0
[369238.288701] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 72 previous similar messages
[369688.315515] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff9676632edf80 x1758012803008832/t0(0) o101-&amp;gt;d58673c6-592c-4241-b
a90-fd6a89dece56@10.0.2.6@tcp:617/0 lens 328/0 e 0 to 0 dl 1676652522 ref 2 fl New:/0/ffffffff rc 0/-1 job:&apos;client.29583&apos;
[369688.315528] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
[369693.860227] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff96769dcb5b00 x1758005042515584/t0(0) o6-&amp;gt;lustrefs-MDT0000-mdtlo
v_UUID@10.0.2.4@tcp:622/0 lens 544/0 e 0 to 0 dl 1676652527 ref 2 fl New:/0/ffffffff rc 0/-1 job:&apos;osp-syn-0-0.0&apos;
[369693.860239] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 163 previous similar messages
[369694.392072] Lustre: lustrefs-OST0000: Client d58673c6-592c-4241-ba90-fd6a89dece56 (at 10.0.2.6@tcp) reconnecting
[369694.392084] Lustre: Skipped 1 previous similar message
[369699.869588] Lustre: lustrefs-OST0000: Client lustrefs-MDT0000-mdtlov_UUID (at 10.0.2.4@tcp) reconnecting
[369699.869591] Lustre: Skipped 1 previous similar message
[370140.452612] ptlrpc_watchdog_fire: 72 callbacks suppressed
[370140.452620] Lustre: ll_ost00_025: service thread pid 13134 was inactive for 1202.140 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging p
urposes:
[370140.452623] Lustre: Skipped 2 previous similar messages
[370140.452626] Pid: 13134, comm: ll_ost00_025 3.10.0-1160.59.1.el7.centos.plus.x86_64 #1 SMP Wed Feb 23 17:40:21 UTC 2022
[370140.452631] Call Trace:
[370140.452705] [&amp;lt;0&amp;gt;] ldlm_completion_ast+0x62d/0xa40 [ptlrpc]
[370140.452767] [&amp;lt;0&amp;gt;] ldlm_cli_enqueue_local+0x25c/0x880 [ptlrpc]
[370140.452818] [&amp;lt;0&amp;gt;] tgt_extent_lock+0xea/0x2a0 [ptlrpc]
[370140.452827] [&amp;lt;0&amp;gt;] ofd_getattr_hdl+0x385/0x750 [ofd]
[370140.452874] [&amp;lt;0&amp;gt;] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
[370140.452929] [&amp;lt;0&amp;gt;] ptlrpc_server_handle_request+0x253/0xc00 [ptlrpc]
[370140.452963] [&amp;lt;0&amp;gt;] ptlrpc_main+0xc3c/0x15f0 [ptlrpc]
[370140.452968] [&amp;lt;0&amp;gt;] kthread+0xd1/0xe0
[370140.452972] [&amp;lt;0&amp;gt;] ret_from_fork_nospec_begin+0x21/0x21
[370140.452993] [&amp;lt;0&amp;gt;] 0xfffffffffffffffe
....
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Client dmesg (reconnections):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[373336.926476] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
[373336.926481] Lustre: Skipped 1 previous similar message
[374092.440057] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654035/real 1676654035]  req@ffff9bf48c66e400 x1758012803097024/t0(0) o17-&amp;gt;lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676654791 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:&apos;&apos;
[374092.440068] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
[374092.440098] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
[374092.440106] Lustre: Skipped 2 previous similar messages
[374092.450347] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
[374092.450354] Lustre: Skipped 1 previous similar message
[374848.233487] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654791/real 1676654791]  req@ffff9bf44b5b9f80 x1758012803195712/t0(0) o17-&amp;gt;lustrefs-OST0000-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676655547 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:&apos;&apos;
[374848.233499] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
[374848.233550] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
[374848.246634] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
[374848.246645] Lustre: Skipped 1 previous similar message
[375031.816746] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676655547/real 1676655547]  req@ffff9bf44b5bb180 x1758012803283392/t0(0) o17-&amp;gt;lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 7 to 1 dl 1676655730 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:&apos;&apos;
[375031.816758] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 168 previous similar messages
[375031.816775] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
[375031.816785] Lustre: Skipped 2 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, the OSS in not able to reply to requests of OST service .&lt;/p&gt;

&lt;p&gt;All the threads on the OSS are waiting for an extent lock on the OST objects:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; foreach &apos;ll_ost00&apos; bt | grep -c &quot;ldlm_completion_ast&quot;                                  
76
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;OST resources:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; ldlm_resource.lr_name -x 0xffff96768edc7600,
  lr_name = {
    name = {0x7fc, 0x0, 0x0, 0x0}
  }
crash&amp;gt; ldlm_resource.lr_name -x 0xffff96768edc7300  
  lr_name = {
    name = {0xd06, 0x0, 0x0, 0x0}
  }


crash&amp;gt; ldlm_resource.lr_granted -o 0xffff96768edc7600
struct ldlm_resource {
  [ffff96768edc7620] struct list_head lr_granted;
}
crash&amp;gt; list -H ffff96768edc7620 -s ldlm_lock.l_granted_mode  ldlm_lock.l_res_link        
ffff967646789200
  l_granted_mode = LCK_GROUP
crash&amp;gt; ldlm_resource.lr_waiting -o 0xffff96768edc7600
struct ldlm_resource {
  [ffff96768edc7630] struct list_head lr_waiting;
}
crash&amp;gt; list -H ffff96768edc7630 -s ldlm_lock.l_req_mode  ldlm_lock.l_res_link         
ffff96766a297d40
  l_req_mode = LCK_PR
...
crash&amp;gt; list -H ffff96768edc7630 | wc -l
38

Same thing for 0xffff96768edc7300 resource:
....
crash&amp;gt; ldlm_resource.lr_waiting -o 0xffff96768edc7300 
struct ldlm_resource {
  [ffff96768edc7330] struct list_head lr_waiting;
}
crash&amp;gt; list -H ffff96768edc7330 | wc -l
38
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;38 + 38  = 76 locks are waiting for 2 group locks on the 2 OST objects of the file.&lt;/p&gt;

&lt;p&gt;On the client, the &quot;lfs migrate&quot; process that has the group lock is waiting for a reply from the OSS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 6136   TASK: ffff9bf44e0b2100  CPU: 0   COMMAND: &quot;lfs&quot;                
 #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
 #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
 #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
 #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
 #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
 #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
 #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
 #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
 #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
 #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
#10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92            
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The other &quot;lfs migrate&quot; are calling stat() or llapi_get_data_version() (before taking the group lock):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 6137   TASK: ffff9bf44e0b3180  CPU: 1   COMMAND: &quot;lfs&quot;                    
 #0 [ffff9bf4ba1d7ac0] __schedule at ffffffffa619d018                          
 #1 [ffff9bf4ba1d7b28] schedule at ffffffffa619d3e9                            
 #2 [ffff9bf4ba1d7b38] schedule_timeout at ffffffffa619b0b1                    
 #3 [ffff9bf4ba1d7be8] wait_for_completion at ffffffffa619d79d                 
 #4 [ffff9bf4ba1d7c48] osc_io_data_version_end at ffffffffc10311e9 [osc]       
 #5 [ffff9bf4ba1d7c80] cl_io_end at ffffffffc0b3075f [obdclass]                
 #6 [ffff9bf4ba1d7cb0] lov_io_end_wrapper at ffffffffc10864eb [lov]            
 #7 [ffff9bf4ba1d7cd0] lov_io_data_version_end at ffffffffc10874a8 [lov]       
 #8 [ffff9bf4ba1d7cf8] cl_io_end at ffffffffc0b3075f [obdclass]                
 #9 [ffff9bf4ba1d7d28] cl_io_loop at ffffffffc0b334dd [obdclass]               
#10 [ffff9bf4ba1d7d60] ll_ioc_data_version at ffffffffc1159f0b [lustre]        
#11 [ffff9bf4ba1d7da8] ll_file_ioctl at ffffffffc11728c3 [lustre]              
#12 [ffff9bf4ba1d7e80] do_vfs_ioctl at ffffffffa5c63ad0                        
#13 [ffff9bf4ba1d7f00] sys_ioctl at ffffffffa5c63d81                           
#14 [ffff9bf4ba1d7f50] system_call_fastpath at ffffffffa61aaf92                

PID: 6158   TASK: ffff9bf4b9145280  CPU: 0   COMMAND: &quot;lfs&quot;                           
 #0 [ffff9bf444cd3bb8] __schedule at ffffffffa619d018                                 
 #1 [ffff9bf444cd3c20] schedule at ffffffffa619d3e9                                   
 #2 [ffff9bf444cd3c30] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]                 
 #3 [ffff9bf444cd3cb0] cl_lock_request at ffffffffc0b2e783 [obdclass]                 
 #4 [ffff9bf444cd3ce8] cl_glimpse_lock at ffffffffc11b4339 [lustre]                   
 #5 [ffff9bf444cd3d30] cl_glimpse_size0 at ffffffffc11b4705 [lustre]                  
 #6 [ffff9bf444cd3d88] ll_getattr_dentry at ffffffffc116c5ee [lustre]                 
 #7 [ffff9bf444cd3e38] ll_getattr at ffffffffc116cafe [lustre]                        
 #8 [ffff9bf444cd3e48] vfs_getattr at ffffffffa5c53ee9                                
 #9 [ffff9bf444cd3e78] vfs_fstat at ffffffffa5c53f65                                  
#10 [ffff9bf444cd3eb8] SYSC_newfstat at ffffffffa5c544d4                              
#11 [ffff9bf444cd3f40] sys_newfstat at ffffffffa5c548ae                               
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;Conclusion:&lt;/b&gt;&lt;br/&gt;
The requests from the &quot;lfs migrate&quot; process with the group lock are stuck in the NRS policy of the OST service because all the OSS threads are waiting for its group lock.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Workarround&lt;/b&gt;&lt;br/&gt;
Temporary increase the number of threads to dequeue all the requests:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oss ~]# lctl set_param ost.OSS.ost.threads_max=200
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>Reproduced on VMs with ldiskfs Lustre 2.15.54:&lt;br/&gt;
1 client&lt;br/&gt;
1 MDS: 2 MDTs&lt;br/&gt;
1 OSS: 2 OSTs</environment>
        <key id="74733">LU-16571</key>
            <summary>&quot;lfs migrate -b&quot; can cause thread starvation on OSS</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="eaujames">Etienne Aujames</assignee>
                                    <reporter username="eaujames">Etienne Aujames</reporter>
                        <labels>
                    </labels>
                <created>Fri, 17 Feb 2023 19:31:38 +0000</created>
                <updated>Wed, 24 May 2023 09:28:52 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="363301" author="paf0186" created="Fri, 17 Feb 2023 20:00:42 +0000"  >&lt;p&gt;So, let&apos;s see...&#160; This is very good analysis.&lt;/p&gt;

&lt;p&gt;Now, the key issue here is someone is holding a group lock, and they&apos;re making a request which can&apos;t complete, so everyone is backing up behind them.&#160; So the competition is between lock requests - which are waiting indefinitely - and an I/O request.&lt;/p&gt;

&lt;p&gt;This seems like it could be possible without a group lock, actually, since the thread holding a write lock could make an I/O request while holding the lock, and that I/O request could get stuck&lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/help_16.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; in the same way.&#160; The main thing the group lock does is make this more likely because it can&apos;t be cancelled even if it&apos;s not in use (so the window for other locks waiting for it is much larger), and there&apos;s no eviction possible either, so you&apos;re guaranteed to see a hang rather than pause, get evicted and maybe not notice the problem.&lt;/p&gt;

&lt;p&gt;Normally, we might solve something like this by moving one of the kinds of request to another portal, but lock requests and I/O requests are both things we&apos;d like to keep on a single, large set of OST threads, since they&apos;re primary functions of the OSS (and linked to each other).&lt;/p&gt;

&lt;p&gt;So, a few further thoughts.&lt;/p&gt;

&lt;p&gt;I think I know why this doesn&apos;t &apos;normally&apos; happen:&lt;br/&gt;
1. The OSS can normally spawn more threads to handle requests up to maxthreads, which is normally very high.&#160; Why did you set it so low?&#160; In my experience - which may be out of date - the OSS maxthreads is normally set very high (hundreds and hundreds), and we let the OSS and the scheduler manage itself.&#160; Threads are only used if there&apos;s work for them to do, but also they spend a lot of time asleep waiting for things, so it&apos;s normal to have maxthreads &amp;gt;&amp;gt; CPU count.&lt;/p&gt;

&lt;p&gt;Basically, this wouldn&apos;t happen unless number of hard links &amp;gt; maxthreads, which is a &lt;b&gt;lot&lt;/b&gt; of hardlinks if your maxthreads isn&apos;t configured low like this.&lt;/p&gt;

&lt;p&gt;2. Do you know what file the group locking holding lfs is trying to access?&#160; If it has a group lock on that file, it shouldn&apos;t need to ask for a lock to do the glimpse like that.&#160; This is less important since other things could still cause the deadlock, but I noticed it.&lt;/p&gt;</comment>
                            <comment id="363354" author="adilger" created="Sat, 18 Feb 2023 01:30:00 +0000"  >&lt;p&gt;I would &lt;em&gt;think&lt;/em&gt; that one &quot;&lt;tt&gt;lfs migrate -b&lt;/tt&gt;&quot; thread would get a group lock on the file, and the other threads would block &lt;b&gt;on the client&lt;/b&gt; waiting to get the group lock (or any lock, really).  &lt;b&gt;However&lt;/b&gt;, I realize that each &quot;&lt;tt&gt;lfs migrate -b&lt;/tt&gt;&quot; thread is using a &lt;b&gt;different&lt;/b&gt; group ID, so these will all be conflicting on the OSS and not getting some &lt;tt&gt;LDLM_FL_CBPENDING&lt;/tt&gt; reply that puts the client back to sleep and doesn&apos;t keep the OST threads blocked?  I don&apos;t recall the exact details of group locks, but it might be that they are not treated like &quot;normal&quot; locks because there are normally so few of them used at one time.&lt;/p&gt;

&lt;p&gt;The second (orthogonal) comment is that having threads doing parallel migration of the same file is of course sub-optimal.  I guess if there are a lot of hard links to the same file from different directory trees that this may happen, but I definitely wouldn&apos;t recommend it as something to do intentionally.  IFF you have a smart parallel migration tool (&lt;b&gt;not&lt;/b&gt; &quot;&lt;tt&gt;lfs migrate&lt;/tt&gt;&quot;) then it should use the &lt;b&gt;same&lt;/b&gt; group ID across all threads to allow them to access the file at the same time.&lt;/p&gt;

&lt;p&gt;To avoid this kind of situation (independently of fixing how OSS handles conflicting group locks) it might make sense to change &quot;&lt;tt&gt;lfs migrate -b&lt;/tt&gt;&quot; to use a &quot;trylock&quot; for the group lock and bail out if there is a conflicting group lock on the file already (assuming this is possible with current semantics?).  It doesn&apos;t make sense for the different &quot;&lt;tt&gt;lfs migrate -b&lt;/tt&gt;&quot; instances to repeatedly migrate the same file, even after they eventually get the lock, that is just a waste of bandwidth.&lt;/p&gt;</comment>
                            <comment id="363490" author="eaujames" created="Mon, 20 Feb 2023 17:06:42 +0000"  >&lt;p&gt;I think that use &quot;trylock&quot; will not resolve the issue because &quot;lfs migrate&quot; processes hang before trying to get the group lock.&lt;/p&gt;

&lt;p&gt;&quot;lfs migrate&quot; do something like this on the target file:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;stat the file&lt;/li&gt;
	&lt;li&gt;get data version&lt;/li&gt;
	&lt;li&gt;take group lock&lt;/li&gt;
	&lt;li&gt;read the files to copy it in the volatile&lt;/li&gt;
	&lt;li&gt;swap the layout&lt;/li&gt;
	&lt;li&gt;release the group lock&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;So, on the OSS, the first request that get the group lock will hang the other requests for &quot;stat&quot; and &quot;get data version&quot; that are pending inside the NRS policy. This is because the &quot;stat&quot; and &quot;get_data_version&quot; try to get a extent lock PR on the OST object that is conflicting with the group lock.&lt;/p&gt;

&lt;p&gt;Each time, on the resource granted list there is one group lock and on the waiting list there are only lock &apos;PR&apos;.&lt;/p&gt;</comment>
                            <comment id="363505" author="paf0186" created="Mon, 20 Feb 2023 21:26:27 +0000"  >&lt;p&gt;Etienne,&lt;/p&gt;

&lt;p&gt;Can you respond to my comment?&#160; Why is the max thread count so low - is that configuration correct?&lt;/p&gt;

&lt;p&gt;Also, what is the first lfs - the one which holds the group lock - blocked on?&#160; It shouldn&apos;t need a glimpse lock if it holds a group lock.&lt;/p&gt;</comment>
                            <comment id="363512" author="adilger" created="Tue, 21 Feb 2023 02:49:28 +0000"  >&lt;p&gt;Could a glimpse still return attributes for an object with a group lock?  I don&apos;t see why that isn&apos;t possible...  The glimpse PR lock should fail because the lock is held, so all that would be returned are the attributes. &lt;/p&gt;</comment>
                            <comment id="363513" author="adilger" created="Tue, 21 Feb 2023 02:52:13 +0000"  >&lt;p&gt;Patrick, I suspect the low thread count is just to demonstrate the problem more easily with fewer client threads. &lt;/p&gt;</comment>
                            <comment id="363517" author="paf0186" created="Tue, 21 Feb 2023 04:02:09 +0000"  >&lt;p&gt;I guess I&apos;m making the point that I&apos;m not sure it&apos;s a real problem in practice, and I&apos;m curious about if it was seen with a normal configuration?&#160; Or what the trigger was?&#160; Etc.&#160; Typical max thread counts on an OSS - a few years ago, and I think the ClusterStor OSSes are on the larger side, but still - was something like 768.&lt;/p&gt;</comment>
                            <comment id="363547" author="eaujames" created="Tue, 21 Feb 2023 13:00:06 +0000"  >&lt;p&gt;Patrick,&lt;br/&gt;
I use the default configuration for OST threads.&lt;br/&gt;
The number of threads is low because of the VM use to reproduce this issue: 2 cores, 1CPT, 1.8 GB of RAM:&lt;br/&gt;
OSS_THR_FACTOR = 7&lt;br/&gt;
OSS_NTHRS_BASE = 64&lt;br/&gt;
nbt_threads = 64 + 64 * 0 / 2 + 2 * 7 = 78&lt;/p&gt;

&lt;p&gt;But, I think, this can be reproducible on an OSS with a lot of CPT. It can starve all the threads of a particular CPT because client NIDs are bound to CPT.&lt;br/&gt;
e.g:&lt;br/&gt;
On a production OSS (DDN 18KX VM), we have 9 CPTs with default threads_max=576. So only 64 threads before hanging a CPT.&lt;/p&gt;

&lt;p&gt;The &quot;lfs&quot; process that has the group lock is in:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
 #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
 #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
 #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
 #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
 #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
 #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
 #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
 #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
 #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
#10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92  
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It tries to read the file. The glimpse request is stuck on the OSS in FIFO policy before ptlrpc_server_handle_request() (no more threads to handle the request). So the process is not able to release its group lock and hang all the CPT.&lt;/p&gt;

&lt;p&gt;This has the same behavior as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15132&quot; title=&quot;Parallel data accesses on a release file could hang a MDT&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15132&quot;&gt;&lt;del&gt;LU-15132&lt;/del&gt;&lt;/a&gt;, but for an OSS.&lt;/p&gt;</comment>
                            <comment id="363587" author="paf0186" created="Tue, 21 Feb 2023 16:02:01 +0000"  >&lt;p&gt;OK, that makes sense, especially re the CPT concern.&#160; I&apos;m still a little skeptical about this as an issue in practice, but that does make more sense.&lt;/p&gt;

&lt;p&gt;RE: the lfs process.&#160; You&apos;re saying &quot;the file&quot;, is that the same file it has a group lock on?&#160; Like I said, it shouldn&apos;t need to do a glimpse if it holds a group lock, the client should see it has a write lock on the whole file and not glimpse.&#160; That&apos;s some sort of bug, if it is for sure trying to glimpse a file it has group locked.&lt;/p&gt;

&lt;p&gt;So if that&apos;s indeed the case, we should be able to solve that, &lt;b&gt;but&lt;/b&gt; it would be possible for this client to get stuck doing I/O instead of requesting a glimpse lock.&#160; If it were trying to do a read or a write RPC, it would have the same issue with thread exhaustion because the other threads are waiting for this lock.&lt;/p&gt;

&lt;p&gt;We have a dependency loop where the other threads/clients are waiting for this lock, and this lock is waiting for &lt;b&gt;either&lt;/b&gt; a lock request or an I/O request.&#160; If the lock request is on the same file, then that&apos;s a bug - it has a group lock, which should let the client do anything.&#160; If the lock request is on a different file, then we have a problem because the group lock can&apos;t be called back.&#160; We could group lock that second file first, but it would be possible to get stuck if the other client requests arrived between the two group lock requests, exhausting the OST threads so the second group lock request can&apos;t be serviced.&lt;/p&gt;

&lt;p&gt;If we moved group lock requests to their own portal, the problem with &lt;b&gt;group locks&lt;/b&gt; goes away if any program using group locks &lt;b&gt;always uses group locks&lt;/b&gt; to access any files.&lt;/p&gt;

&lt;p&gt;But the basic problem of thread exhaustion under a write lock is still possible - client 1 has a PW lock on a file and is going to do a write (or a read, the key part is the type of lock), and other clients/client threads all try to access the same file.&#160; They achieve OST thread exhaustion with &lt;b&gt;lock requests&lt;/b&gt; while client 1 is holding the lock but before it can do the write.&lt;/p&gt;

&lt;p&gt;So group locks can create group lock&amp;lt;- &amp;gt; regular lock dependency chains, and regular locks can create lock &amp;lt; &amp;#8211; -&amp;gt;i/o dependency chains.&#160; In both cases, the chains can be turned in to loops by thread exhaustion, so the second item - the second lock request while holding the first lock or I/O while holding a lock - cannot be acquired.&lt;/p&gt;

&lt;p&gt;In both cases, the &apos;canonical&apos; solution used on the MDT side is to put the different request types in to different portals.&#160; So in our case, that would mean group locks to their own portal, and I/O and lock requests to separate portals.&#160; Yuck, but I don&apos;t see another way around it in the general case.&lt;/p&gt;</comment>
                            <comment id="363590" author="paf0186" created="Tue, 21 Feb 2023 16:05:57 +0000"  >&lt;p&gt;Note also:&lt;br/&gt;
Since group locks can&apos;t be called back by the server, including for timeout (they are not subject to timeouts), it&apos;s always possible to use them to cause a hang.&#160; Request a group lock on a file and try to access it from other threads.&#160; Accesses to that file hang indefinitely.&lt;/p&gt;

&lt;p&gt;There are also more complicated scenarios, like group locking one file, then accessing another file &lt;b&gt;without a group lock&lt;/b&gt;, in that case you&apos;ve created a group lock - regular lock dependency, where you could get thread exhaustion &lt;b&gt;even if group locks have their own portal&lt;/b&gt;, because that regular lock request happens while the group lock is being held, so we can get a loop where the group lock causes thread exhaustion and the regular lock request can&apos;t be serviced.&#160; (That&apos;s what we&apos;re seeing here, actually - Though the glimpse might not be on a separate file?)&lt;/p&gt;</comment>
                            <comment id="363591" author="eaujames" created="Tue, 21 Feb 2023 16:06:00 +0000"  >&lt;p&gt;Andreas,&lt;br/&gt;
I think the issue here is &quot;llapi_get_data_version()&quot; outside the group lock.&lt;/p&gt;

&lt;p&gt;For &quot;data version&quot; the LDLM lock is taken server side:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int osc_io_data_version_start(const struct lu_env *env,
                                     const struct cl_io_slice *slice)
{
.......
        if (dv-&amp;gt;dv_flags &amp;amp; (LL_DV_RD_FLUSH | LL_DV_WR_FLUSH)) {
                oa-&amp;gt;o_valid |= OBD_MD_FLFLAGS;
                oa-&amp;gt;o_flags |= OBD_FL_SRVLOCK;                                       &amp;lt;--------------
                if (dv-&amp;gt;dv_flags &amp;amp; LL_DV_WR_FLUSH)
                        oa-&amp;gt;o_flags |= OBD_FL_FLUSH;
        }


enum obdo_flags {
....
        OBD_FL_CREATE_CROW  = 0x00000400, /* object should be create on write */
        OBD_FL_SRVLOCK      = 0x00000800, /* delegate DLM locking to server */          &amp;lt;--------------
        OBD_FL_CKSUM_CRC32  = 0x00001000, /* CRC32 checksum type */
....
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Therefore, it will monopolize an OSS thread to wait the lock if conflicting and server side lock has no timeout:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int ofd_getattr_hdl(struct tgt_session_info *tsi)                      
{                                                                             
......                  
                                                                              
        srvlock = tsi-&amp;gt;tsi_ost_body-&amp;gt;oa.o_valid &amp;amp; OBD_MD_FLFLAGS &amp;amp;&amp;amp;           
                  tsi-&amp;gt;tsi_ost_body-&amp;gt;oa.o_flags &amp;amp; OBD_FL_SRVLOCK;             
                                                                              
        if (srvlock) {                                                        
                if (unlikely(tsi-&amp;gt;tsi_ost_body-&amp;gt;oa.o_flags &amp;amp; OBD_FL_FLUSH))   
                        lock_mode = LCK_PW;                                   
                                                                              
                rc = tgt_extent_lock(tsi-&amp;gt;tsi_env,                            
                                     tsi-&amp;gt;tsi_tgt-&amp;gt;lut_obd-&amp;gt;obd_namespace,    
                                     &amp;amp;tsi-&amp;gt;tsi_resid, 0, OBD_OBJECT_EOF, &amp;amp;lh, 
                                     lock_mode, &amp;amp;flags);                      
                if (rc != 0)                                                  
                        RETURN(rc);                                           
        }    
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From crash output on the OSS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; bt 17443                                                                  
PID: 17443  TASK: ffff8c4ef5841080  CPU: 1   COMMAND: &quot;ll_ost00_010&quot;
 #0 [ffff8c4f17c9b9a0] __schedule at ffffffff94b9d018
 #1 [ffff8c4f17c9ba08] schedule at ffffffff94b9d3e9
 #2 [ffff8c4f17c9ba18] ldlm_completion_ast at ffffffffc12ff2cd [ptlrpc]
 #3 [ffff8c4f17c9bab8] ldlm_cli_enqueue_local at ffffffffc12fd50c [ptlrpc]
 #4 [ffff8c4f17c9bb58] tgt_extent_lock at ffffffffc1349f4a [ptlrpc]
 #5 [ffff8c4f17c9bc20] ofd_getattr_hdl at ffffffffc1917315 [ofd]
 #6 [ffff8c4f17c9bc98] tgt_request_handle at ffffffffc135277f [ptlrpc]
 #7 [ffff8c4f17c9bd28] ptlrpc_server_handle_request at ffffffffc129d5d3 [ptlrpc]
 #8 [ffff8c4f17c9bde0] ptlrpc_main at ffffffffc129f25c [ptlrpc]
 #9 [ffff8c4f17c9bec8] kthread at ffffffff944c5e61

crash&amp;gt; p ((struct tgt_session_info*)0xffff8c4f17848900)-&amp;gt;tsi_jobid      
$5 = 0xffff8c4ec2e535c0 &quot;lfs.0&quot;
crash&amp;gt; p/x ((struct tgt_session_info*)0xffff8c4f17848900)-&amp;gt;tsi_ost_body-&amp;gt;oa.o_flags  
$2 = 0x800
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="363718" author="gerrit" created="Wed, 22 Feb 2023 11:34:14 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50113&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50113&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16571&quot; title=&quot;&amp;quot;lfs migrate -b&amp;quot; can cause thread starvation on OSS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16571&quot;&gt;LU-16571&lt;/a&gt; utils: fix parallel &quot;lfs migrate -b&quot; on hard links&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 1891b6e3a86159359e55855525868bf061f02615&lt;/p&gt;</comment>
                            <comment id="365182" author="gerrit" created="Wed, 8 Mar 2023 03:26:22 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50113/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50113/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16571&quot; title=&quot;&amp;quot;lfs migrate -b&amp;quot; can cause thread starvation on OSS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16571&quot;&gt;LU-16571&lt;/a&gt; utils: fix parallel &quot;lfs migrate -b&quot; on hard links&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 2310f4b8a6b6050cccedd4982ce80aa1cfbd3fe1&lt;/p&gt;</comment>
                            <comment id="373282" author="gerrit" created="Wed, 24 May 2023 09:09:37 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51112&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51112&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16571&quot; title=&quot;&amp;quot;lfs migrate -b&amp;quot; can cause thread starvation on OSS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16571&quot;&gt;LU-16571&lt;/a&gt; utils: fix parallel &quot;lfs migrate -b&quot; on hard links&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_15&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: b708deae6307eea61527ecf1dfac23ca4c901a64&lt;/p&gt;</comment>
                            <comment id="373284" author="gerrit" created="Wed, 24 May 2023 09:28:52 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51114&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51114&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16571&quot; title=&quot;&amp;quot;lfs migrate -b&amp;quot; can cause thread starvation on OSS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16571&quot;&gt;LU-16571&lt;/a&gt; utils: fix parallel &quot;lfs migrate -b&quot; on hard links&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: aee0ce61a24ff4f965b6e0e5b263e0c3038cc8c1&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03efz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>