Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16571

"lfs migrate -b" can cause thread starvation on OSS

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • Reproduced on VMs with ldiskfs Lustre 2.15.54:
      1 client
      1 MDS: 2 MDTs
      1 OSS: 2 OSTs
    • 3
    • 9223372036854775807

    Description

      I am not 100% sure but I think we hit the following issue in production with "lfs migrate -b"
      parrallel instances on files with several hard links:

      OSS setup:

      [root@oss ~]# lctl get_param ost.OSS.ost.threads_* cpu_partition_table
      ost.OSS.ost.threads_max=78
      ost.OSS.ost.threads_min=3
      ost.OSS.ost.threads_started=4
      cpu_partition_table=0   : 0 1
      [root@oss ~]# mount -tlustre
      /dev/mapper/ost1_flakey on /...
      /dev/mapper/ost2_flakey on /...
      

      Reproducer with 1 client:

      [root@client test]# lfs setstripe  -c2 .
      [root@client test]# dd if=/dev/zero of=test1 bs=1M count=100                                   
      [root@client test]# mkdir links                                              
      [root@client test]# printf "%s\n" link{1..100} | xargs -I{} ln test1 links/{}
      [root@client test]# find -type f | xargs -P100 -I{} lfs migrate -c2 {}
      # --> the command hang
      

      Node states
      messages queued:

      [root@oss ~]# lctl get_param -n ost.OSS.ost.nrs_policies    
      
      regular_requests:
        - name: fifo
          state: started
          fallback: yes
          queued: 125                 
          active: 76                  
      ...
      

      OSS dmesg:

      [369238.270582] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1676651767, 300s ago); not entering recovery in server code, just going back to sleep ns: 
      filter-lustrefs-OST0000_UUID lock: ffff967687a826c0/0xdb3bda27f476b2cd lrc: 3/1,0 mode: --/PR res: [0xc46:0x0:0x0].0x0 rrc: 40 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x400
      10000000000 nid: local remote: 0x0 expref: -99 pid: 13117 timeout: 0 lvb_type: 0
      [369238.288701] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 72 previous similar messages
      [369688.315515] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff9676632edf80 x1758012803008832/t0(0) o101->d58673c6-592c-4241-b
      a90-fd6a89dece56@10.0.2.6@tcp:617/0 lens 328/0 e 0 to 0 dl 1676652522 ref 2 fl New:/0/ffffffff rc 0/-1 job:'client.29583'
      [369688.315528] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      [369693.860227] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff96769dcb5b00 x1758005042515584/t0(0) o6->lustrefs-MDT0000-mdtlo
      v_UUID@10.0.2.4@tcp:622/0 lens 544/0 e 0 to 0 dl 1676652527 ref 2 fl New:/0/ffffffff rc 0/-1 job:'osp-syn-0-0.0'
      [369693.860239] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 163 previous similar messages
      [369694.392072] Lustre: lustrefs-OST0000: Client d58673c6-592c-4241-ba90-fd6a89dece56 (at 10.0.2.6@tcp) reconnecting
      [369694.392084] Lustre: Skipped 1 previous similar message
      [369699.869588] Lustre: lustrefs-OST0000: Client lustrefs-MDT0000-mdtlov_UUID (at 10.0.2.4@tcp) reconnecting
      [369699.869591] Lustre: Skipped 1 previous similar message
      [370140.452612] ptlrpc_watchdog_fire: 72 callbacks suppressed
      [370140.452620] Lustre: ll_ost00_025: service thread pid 13134 was inactive for 1202.140 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging p
      urposes:
      [370140.452623] Lustre: Skipped 2 previous similar messages
      [370140.452626] Pid: 13134, comm: ll_ost00_025 3.10.0-1160.59.1.el7.centos.plus.x86_64 #1 SMP Wed Feb 23 17:40:21 UTC 2022
      [370140.452631] Call Trace:
      [370140.452705] [<0>] ldlm_completion_ast+0x62d/0xa40 [ptlrpc]
      [370140.452767] [<0>] ldlm_cli_enqueue_local+0x25c/0x880 [ptlrpc]
      [370140.452818] [<0>] tgt_extent_lock+0xea/0x2a0 [ptlrpc]
      [370140.452827] [<0>] ofd_getattr_hdl+0x385/0x750 [ofd]
      [370140.452874] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [370140.452929] [<0>] ptlrpc_server_handle_request+0x253/0xc00 [ptlrpc]
      [370140.452963] [<0>] ptlrpc_main+0xc3c/0x15f0 [ptlrpc]
      [370140.452968] [<0>] kthread+0xd1/0xe0
      [370140.452972] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [370140.452993] [<0>] 0xfffffffffffffffe
      ....
      

      Client dmesg (reconnections):

      [373336.926476] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [373336.926481] Lustre: Skipped 1 previous similar message
      [374092.440057] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654035/real 1676654035]  req@ffff9bf48c66e400 x1758012803097024/t0(0) o17->lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676654791 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [374092.440068] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
      [374092.440098] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [374092.440106] Lustre: Skipped 2 previous similar messages
      [374092.450347] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [374092.450354] Lustre: Skipped 1 previous similar message
      [374848.233487] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654791/real 1676654791]  req@ffff9bf44b5b9f80 x1758012803195712/t0(0) o17->lustrefs-OST0000-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676655547 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [374848.233499] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
      [374848.233550] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [374848.246634] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [374848.246645] Lustre: Skipped 1 previous similar message
      [375031.816746] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676655547/real 1676655547]  req@ffff9bf44b5bb180 x1758012803283392/t0(0) o17->lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 7 to 1 dl 1676655730 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [375031.816758] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 168 previous similar messages
      [375031.816775] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [375031.816785] Lustre: Skipped 2 previous similar messages
      

      At this point, the OSS in not able to reply to requests of OST service .

      All the threads on the OSS are waiting for an extent lock on the OST objects:

      crash> foreach 'll_ost00' bt | grep -c "ldlm_completion_ast"                                  
      76
      

      OST resources:

      crash> ldlm_resource.lr_name -x 0xffff96768edc7600,
        lr_name = {
          name = {0x7fc, 0x0, 0x0, 0x0}
        }
      crash> ldlm_resource.lr_name -x 0xffff96768edc7300  
        lr_name = {
          name = {0xd06, 0x0, 0x0, 0x0}
        }
      
      
      crash> ldlm_resource.lr_granted -o 0xffff96768edc7600
      struct ldlm_resource {
        [ffff96768edc7620] struct list_head lr_granted;
      }
      crash> list -H ffff96768edc7620 -s ldlm_lock.l_granted_mode  ldlm_lock.l_res_link        
      ffff967646789200
        l_granted_mode = LCK_GROUP
      crash> ldlm_resource.lr_waiting -o 0xffff96768edc7600
      struct ldlm_resource {
        [ffff96768edc7630] struct list_head lr_waiting;
      }
      crash> list -H ffff96768edc7630 -s ldlm_lock.l_req_mode  ldlm_lock.l_res_link         
      ffff96766a297d40
        l_req_mode = LCK_PR
      ...
      crash> list -H ffff96768edc7630 | wc -l
      38
      
      Same thing for 0xffff96768edc7300 resource:
      ....
      crash> ldlm_resource.lr_waiting -o 0xffff96768edc7300 
      struct ldlm_resource {
        [ffff96768edc7330] struct list_head lr_waiting;
      }
      crash> list -H ffff96768edc7330 | wc -l
      38
      

      38 + 38 = 76 locks are waiting for 2 group locks on the 2 OST objects of the file.

      On the client, the "lfs migrate" process that has the group lock is waiting for a reply from the OSS:

      PID: 6136   TASK: ffff9bf44e0b2100  CPU: 0   COMMAND: "lfs"                
       #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
       #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
       #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
       #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
       #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
       #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
       #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
       #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
       #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
       #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
      #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92            
      

      The other "lfs migrate" are calling stat() or llapi_get_data_version() (before taking the group lock):

      PID: 6137   TASK: ffff9bf44e0b3180  CPU: 1   COMMAND: "lfs"                    
       #0 [ffff9bf4ba1d7ac0] __schedule at ffffffffa619d018                          
       #1 [ffff9bf4ba1d7b28] schedule at ffffffffa619d3e9                            
       #2 [ffff9bf4ba1d7b38] schedule_timeout at ffffffffa619b0b1                    
       #3 [ffff9bf4ba1d7be8] wait_for_completion at ffffffffa619d79d                 
       #4 [ffff9bf4ba1d7c48] osc_io_data_version_end at ffffffffc10311e9 [osc]       
       #5 [ffff9bf4ba1d7c80] cl_io_end at ffffffffc0b3075f [obdclass]                
       #6 [ffff9bf4ba1d7cb0] lov_io_end_wrapper at ffffffffc10864eb [lov]            
       #7 [ffff9bf4ba1d7cd0] lov_io_data_version_end at ffffffffc10874a8 [lov]       
       #8 [ffff9bf4ba1d7cf8] cl_io_end at ffffffffc0b3075f [obdclass]                
       #9 [ffff9bf4ba1d7d28] cl_io_loop at ffffffffc0b334dd [obdclass]               
      #10 [ffff9bf4ba1d7d60] ll_ioc_data_version at ffffffffc1159f0b [lustre]        
      #11 [ffff9bf4ba1d7da8] ll_file_ioctl at ffffffffc11728c3 [lustre]              
      #12 [ffff9bf4ba1d7e80] do_vfs_ioctl at ffffffffa5c63ad0                        
      #13 [ffff9bf4ba1d7f00] sys_ioctl at ffffffffa5c63d81                           
      #14 [ffff9bf4ba1d7f50] system_call_fastpath at ffffffffa61aaf92                
      
      PID: 6158   TASK: ffff9bf4b9145280  CPU: 0   COMMAND: "lfs"                           
       #0 [ffff9bf444cd3bb8] __schedule at ffffffffa619d018                                 
       #1 [ffff9bf444cd3c20] schedule at ffffffffa619d3e9                                   
       #2 [ffff9bf444cd3c30] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]                 
       #3 [ffff9bf444cd3cb0] cl_lock_request at ffffffffc0b2e783 [obdclass]                 
       #4 [ffff9bf444cd3ce8] cl_glimpse_lock at ffffffffc11b4339 [lustre]                   
       #5 [ffff9bf444cd3d30] cl_glimpse_size0 at ffffffffc11b4705 [lustre]                  
       #6 [ffff9bf444cd3d88] ll_getattr_dentry at ffffffffc116c5ee [lustre]                 
       #7 [ffff9bf444cd3e38] ll_getattr at ffffffffc116cafe [lustre]                        
       #8 [ffff9bf444cd3e48] vfs_getattr at ffffffffa5c53ee9                                
       #9 [ffff9bf444cd3e78] vfs_fstat at ffffffffa5c53f65                                  
      #10 [ffff9bf444cd3eb8] SYSC_newfstat at ffffffffa5c544d4                              
      #11 [ffff9bf444cd3f40] sys_newfstat at ffffffffa5c548ae                               
      

      Conclusion:
      The requests from the "lfs migrate" process with the group lock are stuck in the NRS policy of the OST service because all the OSS threads are waiting for its group lock.

      Workarround
      Temporary increase the number of threads to dequeue all the requests:

      [root@oss ~]# lctl set_param ost.OSS.ost.threads_max=200
      

      Attachments

        Issue Links

          Activity

            [LU-16571] "lfs migrate -b" can cause thread starvation on OSS
            paf0186 Patrick Farrell added a comment - - edited

            OK, that makes sense, especially re the CPT concern.  I'm still a little skeptical about this as an issue in practice, but that does make more sense.

            RE: the lfs process.  You're saying "the file", is that the same file it has a group lock on?  Like I said, it shouldn't need to do a glimpse if it holds a group lock, the client should see it has a write lock on the whole file and not glimpse.  That's some sort of bug, if it is for sure trying to glimpse a file it has group locked.

            So if that's indeed the case, we should be able to solve that, but it would be possible for this client to get stuck doing I/O instead of requesting a glimpse lock.  If it were trying to do a read or a write RPC, it would have the same issue with thread exhaustion because the other threads are waiting for this lock.

            We have a dependency loop where the other threads/clients are waiting for this lock, and this lock is waiting for either a lock request or an I/O request.  If the lock request is on the same file, then that's a bug - it has a group lock, which should let the client do anything.  If the lock request is on a different file, then we have a problem because the group lock can't be called back.  We could group lock that second file first, but it would be possible to get stuck if the other client requests arrived between the two group lock requests, exhausting the OST threads so the second group lock request can't be serviced.

            If we moved group lock requests to their own portal, the problem with group locks goes away if any program using group locks always uses group locks to access any files.

            But the basic problem of thread exhaustion under a write lock is still possible - client 1 has a PW lock on a file and is going to do a write (or a read, the key part is the type of lock), and other clients/client threads all try to access the same file.  They achieve OST thread exhaustion with lock requests while client 1 is holding the lock but before it can do the write.

            So group locks can create group lock<- > regular lock dependency chains, and regular locks can create lock < – ->i/o dependency chains.  In both cases, the chains can be turned in to loops by thread exhaustion, so the second item - the second lock request while holding the first lock or I/O while holding a lock - cannot be acquired.

            In both cases, the 'canonical' solution used on the MDT side is to put the different request types in to different portals.  So in our case, that would mean group locks to their own portal, and I/O and lock requests to separate portals.  Yuck, but I don't see another way around it in the general case.

            paf0186 Patrick Farrell added a comment - - edited OK, that makes sense, especially re the CPT concern.  I'm still a little skeptical about this as an issue in practice, but that does make more sense. RE: the lfs process.  You're saying "the file", is that the same file it has a group lock on?  Like I said, it shouldn't need to do a glimpse if it holds a group lock, the client should see it has a write lock on the whole file and not glimpse.  That's some sort of bug, if it is for sure trying to glimpse a file it has group locked. So if that's indeed the case, we should be able to solve that, but it would be possible for this client to get stuck doing I/O instead of requesting a glimpse lock.  If it were trying to do a read or a write RPC, it would have the same issue with thread exhaustion because the other threads are waiting for this lock. We have a dependency loop where the other threads/clients are waiting for this lock, and this lock is waiting for either a lock request or an I/O request.  If the lock request is on the same file, then that's a bug - it has a group lock, which should let the client do anything.  If the lock request is on a different file, then we have a problem because the group lock can't be called back.  We could group lock that second file first, but it would be possible to get stuck if the other client requests arrived between the two group lock requests, exhausting the OST threads so the second group lock request can't be serviced. If we moved group lock requests to their own portal, the problem with group locks goes away if any program using group locks always uses group locks to access any files. But the basic problem of thread exhaustion under a write lock is still possible - client 1 has a PW lock on a file and is going to do a write (or a read, the key part is the type of lock), and other clients/client threads all try to access the same file.  They achieve OST thread exhaustion with lock requests while client 1 is holding the lock but before it can do the write. So group locks can create group lock<- > regular lock dependency chains, and regular locks can create lock < – ->i/o dependency chains.  In both cases, the chains can be turned in to loops by thread exhaustion, so the second item - the second lock request while holding the first lock or I/O while holding a lock - cannot be acquired. In both cases, the 'canonical' solution used on the MDT side is to put the different request types in to different portals.  So in our case, that would mean group locks to their own portal, and I/O and lock requests to separate portals.  Yuck, but I don't see another way around it in the general case.

            Patrick,
            I use the default configuration for OST threads.
            The number of threads is low because of the VM use to reproduce this issue: 2 cores, 1CPT, 1.8 GB of RAM:
            OSS_THR_FACTOR = 7
            OSS_NTHRS_BASE = 64
            nbt_threads = 64 + 64 * 0 / 2 + 2 * 7 = 78

            But, I think, this can be reproducible on an OSS with a lot of CPT. It can starve all the threads of a particular CPT because client NIDs are bound to CPT.
            e.g:
            On a production OSS (DDN 18KX VM), we have 9 CPTs with default threads_max=576. So only 64 threads before hanging a CPT.

            The "lfs" process that has the group lock is in:

             #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
             #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
             #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
             #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
             #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
             #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
             #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
             #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
             #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
             #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
            #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92  
            

            It tries to read the file. The glimpse request is stuck on the OSS in FIFO policy before ptlrpc_server_handle_request() (no more threads to handle the request). So the process is not able to release its group lock and hang all the CPT.

            This has the same behavior as LU-15132, but for an OSS.

            eaujames Etienne Aujames added a comment - Patrick, I use the default configuration for OST threads. The number of threads is low because of the VM use to reproduce this issue: 2 cores, 1CPT, 1.8 GB of RAM: OSS_THR_FACTOR = 7 OSS_NTHRS_BASE = 64 nbt_threads = 64 + 64 * 0 / 2 + 2 * 7 = 78 But, I think, this can be reproducible on an OSS with a lot of CPT. It can starve all the threads of a particular CPT because client NIDs are bound to CPT. e.g: On a production OSS (DDN 18KX VM), we have 9 CPTs with default threads_max=576. So only 64 threads before hanging a CPT. The "lfs" process that has the group lock is in: #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018 #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9 #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass] #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass] #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre] #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre] #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre] #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre] #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472 #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92 It tries to read the file. The glimpse request is stuck on the OSS in FIFO policy before ptlrpc_server_handle_request() (no more threads to handle the request). So the process is not able to release its group lock and hang all the CPT. This has the same behavior as LU-15132 , but for an OSS.

            I guess I'm making the point that I'm not sure it's a real problem in practice, and I'm curious about if it was seen with a normal configuration?  Or what the trigger was?  Etc.  Typical max thread counts on an OSS - a few years ago, and I think the ClusterStor OSSes are on the larger side, but still - was something like 768.

            paf0186 Patrick Farrell added a comment - I guess I'm making the point that I'm not sure it's a real problem in practice, and I'm curious about if it was seen with a normal configuration?  Or what the trigger was?  Etc.  Typical max thread counts on an OSS - a few years ago, and I think the ClusterStor OSSes are on the larger side, but still - was something like 768.

            Patrick, I suspect the low thread count is just to demonstrate the problem more easily with fewer client threads.

            adilger Andreas Dilger added a comment - Patrick, I suspect the low thread count is just to demonstrate the problem more easily with fewer client threads.

            Could a glimpse still return attributes for an object with a group lock? I don't see why that isn't possible... The glimpse PR lock should fail because the lock is held, so all that would be returned are the attributes.

            adilger Andreas Dilger added a comment - Could a glimpse still return attributes for an object with a group lock? I don't see why that isn't possible... The glimpse PR lock should fail because the lock is held, so all that would be returned are the attributes.

            Etienne,

            Can you respond to my comment?  Why is the max thread count so low - is that configuration correct?

            Also, what is the first lfs - the one which holds the group lock - blocked on?  It shouldn't need a glimpse lock if it holds a group lock.

            paf0186 Patrick Farrell added a comment - Etienne, Can you respond to my comment?  Why is the max thread count so low - is that configuration correct? Also, what is the first lfs - the one which holds the group lock - blocked on?  It shouldn't need a glimpse lock if it holds a group lock.
            eaujames Etienne Aujames added a comment - - edited

            I think that use "trylock" will not resolve the issue because "lfs migrate" processes hang before trying to get the group lock.

            "lfs migrate" do something like this on the target file:

            1. stat the file
            2. get data version
            3. take group lock
            4. read the files to copy it in the volatile
            5. swap the layout
            6. release the group lock

            So, on the OSS, the first request that get the group lock will hang the other requests for "stat" and "get data version" that are pending inside the NRS policy. This is because the "stat" and "get_data_version" try to get a extent lock PR on the OST object that is conflicting with the group lock.

            Each time, on the resource granted list there is one group lock and on the waiting list there are only lock 'PR'.

            eaujames Etienne Aujames added a comment - - edited I think that use "trylock" will not resolve the issue because "lfs migrate" processes hang before trying to get the group lock. "lfs migrate" do something like this on the target file: stat the file get data version take group lock read the files to copy it in the volatile swap the layout release the group lock So, on the OSS, the first request that get the group lock will hang the other requests for "stat" and "get data version" that are pending inside the NRS policy. This is because the "stat" and "get_data_version" try to get a extent lock PR on the OST object that is conflicting with the group lock. Each time, on the resource granted list there is one group lock and on the waiting list there are only lock 'PR'.

            I would think that one "lfs migrate -b" thread would get a group lock on the file, and the other threads would block on the client waiting to get the group lock (or any lock, really). However, I realize that each "lfs migrate -b" thread is using a different group ID, so these will all be conflicting on the OSS and not getting some LDLM_FL_CBPENDING reply that puts the client back to sleep and doesn't keep the OST threads blocked? I don't recall the exact details of group locks, but it might be that they are not treated like "normal" locks because there are normally so few of them used at one time.

            The second (orthogonal) comment is that having threads doing parallel migration of the same file is of course sub-optimal. I guess if there are a lot of hard links to the same file from different directory trees that this may happen, but I definitely wouldn't recommend it as something to do intentionally. IFF you have a smart parallel migration tool (not "lfs migrate") then it should use the same group ID across all threads to allow them to access the file at the same time.

            To avoid this kind of situation (independently of fixing how OSS handles conflicting group locks) it might make sense to change "lfs migrate -b" to use a "trylock" for the group lock and bail out if there is a conflicting group lock on the file already (assuming this is possible with current semantics?). It doesn't make sense for the different "lfs migrate -b" instances to repeatedly migrate the same file, even after they eventually get the lock, that is just a waste of bandwidth.

            adilger Andreas Dilger added a comment - I would think that one " lfs migrate -b " thread would get a group lock on the file, and the other threads would block on the client waiting to get the group lock (or any lock, really). However , I realize that each " lfs migrate -b " thread is using a different group ID, so these will all be conflicting on the OSS and not getting some LDLM_FL_CBPENDING reply that puts the client back to sleep and doesn't keep the OST threads blocked? I don't recall the exact details of group locks, but it might be that they are not treated like "normal" locks because there are normally so few of them used at one time. The second (orthogonal) comment is that having threads doing parallel migration of the same file is of course sub-optimal. I guess if there are a lot of hard links to the same file from different directory trees that this may happen, but I definitely wouldn't recommend it as something to do intentionally. IFF you have a smart parallel migration tool ( not " lfs migrate ") then it should use the same group ID across all threads to allow them to access the file at the same time. To avoid this kind of situation (independently of fixing how OSS handles conflicting group locks) it might make sense to change " lfs migrate -b " to use a "trylock" for the group lock and bail out if there is a conflicting group lock on the file already (assuming this is possible with current semantics?). It doesn't make sense for the different " lfs migrate -b " instances to repeatedly migrate the same file, even after they eventually get the lock, that is just a waste of bandwidth.

            So, let's see...  This is very good analysis.

            Now, the key issue here is someone is holding a group lock, and they're making a request which can't complete, so everyone is backing up behind them.  So the competition is between lock requests - which are waiting indefinitely - and an I/O request.

            This seems like it could be possible without a group lock, actually, since the thread holding a write lock could make an I/O request while holding the lock, and that I/O request could get stuck in the same way.  The main thing the group lock does is make this more likely because it can't be cancelled even if it's not in use (so the window for other locks waiting for it is much larger), and there's no eviction possible either, so you're guaranteed to see a hang rather than pause, get evicted and maybe not notice the problem.

            Normally, we might solve something like this by moving one of the kinds of request to another portal, but lock requests and I/O requests are both things we'd like to keep on a single, large set of OST threads, since they're primary functions of the OSS (and linked to each other).

            So, a few further thoughts.

            I think I know why this doesn't 'normally' happen:
            1. The OSS can normally spawn more threads to handle requests up to maxthreads, which is normally very high.  Why did you set it so low?  In my experience - which may be out of date - the OSS maxthreads is normally set very high (hundreds and hundreds), and we let the OSS and the scheduler manage itself.  Threads are only used if there's work for them to do, but also they spend a lot of time asleep waiting for things, so it's normal to have maxthreads >> CPU count.

            Basically, this wouldn't happen unless number of hard links > maxthreads, which is a lot of hardlinks if your maxthreads isn't configured low like this.

            2. Do you know what file the group locking holding lfs is trying to access?  If it has a group lock on that file, it shouldn't need to ask for a lock to do the glimpse like that.  This is less important since other things could still cause the deadlock, but I noticed it.

            paf0186 Patrick Farrell added a comment - So, let's see...  This is very good analysis. Now, the key issue here is someone is holding a group lock, and they're making a request which can't complete, so everyone is backing up behind them.  So the competition is between lock requests - which are waiting indefinitely - and an I/O request. This seems like it could be possible without a group lock, actually, since the thread holding a write lock could make an I/O request while holding the lock, and that I/O request could get stuck in the same way.  The main thing the group lock does is make this more likely because it can't be cancelled even if it's not in use (so the window for other locks waiting for it is much larger), and there's no eviction possible either, so you're guaranteed to see a hang rather than pause, get evicted and maybe not notice the problem. Normally, we might solve something like this by moving one of the kinds of request to another portal, but lock requests and I/O requests are both things we'd like to keep on a single, large set of OST threads, since they're primary functions of the OSS (and linked to each other). So, a few further thoughts. I think I know why this doesn't 'normally' happen: 1. The OSS can normally spawn more threads to handle requests up to maxthreads, which is normally very high.  Why did you set it so low?  In my experience - which may be out of date - the OSS maxthreads is normally set very high (hundreds and hundreds), and we let the OSS and the scheduler manage itself.  Threads are only used if there's work for them to do, but also they spend a lot of time asleep waiting for things, so it's normal to have maxthreads >> CPU count. Basically, this wouldn't happen unless number of hard links > maxthreads, which is a lot of hardlinks if your maxthreads isn't configured low like this. 2. Do you know what file the group locking holding lfs is trying to access?  If it has a group lock on that file, it shouldn't need to ask for a lock to do the glimpse like that.  This is less important since other things could still cause the deadlock, but I noticed it.

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: