Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16571

"lfs migrate -b" can cause thread starvation on OSS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • Reproduced on VMs with ldiskfs Lustre 2.15.54:
      1 client
      1 MDS: 2 MDTs
      1 OSS: 2 OSTs
    • 3
    • 9223372036854775807

    Description

      I am not 100% sure but I think we hit the following issue in production with "lfs migrate -b"
      parrallel instances on files with several hard links:

      OSS setup:

      [root@oss ~]# lctl get_param ost.OSS.ost.threads_* cpu_partition_table
      ost.OSS.ost.threads_max=78
      ost.OSS.ost.threads_min=3
      ost.OSS.ost.threads_started=4
      cpu_partition_table=0   : 0 1
      [root@oss ~]# mount -tlustre
      /dev/mapper/ost1_flakey on /...
      /dev/mapper/ost2_flakey on /...
      

      Reproducer with 1 client:

      [root@client test]# lfs setstripe  -c2 .
      [root@client test]# dd if=/dev/zero of=test1 bs=1M count=100                                   
      [root@client test]# mkdir links                                              
      [root@client test]# printf "%s\n" link{1..100} | xargs -I{} ln test1 links/{}
      [root@client test]# find -type f | xargs -P100 -I{} lfs migrate -c2 {}
      # --> the command hang
      

      Node states
      messages queued:

      [root@oss ~]# lctl get_param -n ost.OSS.ost.nrs_policies    
      
      regular_requests:
        - name: fifo
          state: started
          fallback: yes
          queued: 125                 
          active: 76                  
      ...
      

      OSS dmesg:

      [369238.270582] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1676651767, 300s ago); not entering recovery in server code, just going back to sleep ns: 
      filter-lustrefs-OST0000_UUID lock: ffff967687a826c0/0xdb3bda27f476b2cd lrc: 3/1,0 mode: --/PR res: [0xc46:0x0:0x0].0x0 rrc: 40 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x400
      10000000000 nid: local remote: 0x0 expref: -99 pid: 13117 timeout: 0 lvb_type: 0
      [369238.288701] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 72 previous similar messages
      [369688.315515] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff9676632edf80 x1758012803008832/t0(0) o101->d58673c6-592c-4241-b
      a90-fd6a89dece56@10.0.2.6@tcp:617/0 lens 328/0 e 0 to 0 dl 1676652522 ref 2 fl New:/0/ffffffff rc 0/-1 job:'client.29583'
      [369688.315528] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      [369693.860227] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff96769dcb5b00 x1758005042515584/t0(0) o6->lustrefs-MDT0000-mdtlo
      v_UUID@10.0.2.4@tcp:622/0 lens 544/0 e 0 to 0 dl 1676652527 ref 2 fl New:/0/ffffffff rc 0/-1 job:'osp-syn-0-0.0'
      [369693.860239] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 163 previous similar messages
      [369694.392072] Lustre: lustrefs-OST0000: Client d58673c6-592c-4241-ba90-fd6a89dece56 (at 10.0.2.6@tcp) reconnecting
      [369694.392084] Lustre: Skipped 1 previous similar message
      [369699.869588] Lustre: lustrefs-OST0000: Client lustrefs-MDT0000-mdtlov_UUID (at 10.0.2.4@tcp) reconnecting
      [369699.869591] Lustre: Skipped 1 previous similar message
      [370140.452612] ptlrpc_watchdog_fire: 72 callbacks suppressed
      [370140.452620] Lustre: ll_ost00_025: service thread pid 13134 was inactive for 1202.140 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging p
      urposes:
      [370140.452623] Lustre: Skipped 2 previous similar messages
      [370140.452626] Pid: 13134, comm: ll_ost00_025 3.10.0-1160.59.1.el7.centos.plus.x86_64 #1 SMP Wed Feb 23 17:40:21 UTC 2022
      [370140.452631] Call Trace:
      [370140.452705] [<0>] ldlm_completion_ast+0x62d/0xa40 [ptlrpc]
      [370140.452767] [<0>] ldlm_cli_enqueue_local+0x25c/0x880 [ptlrpc]
      [370140.452818] [<0>] tgt_extent_lock+0xea/0x2a0 [ptlrpc]
      [370140.452827] [<0>] ofd_getattr_hdl+0x385/0x750 [ofd]
      [370140.452874] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [370140.452929] [<0>] ptlrpc_server_handle_request+0x253/0xc00 [ptlrpc]
      [370140.452963] [<0>] ptlrpc_main+0xc3c/0x15f0 [ptlrpc]
      [370140.452968] [<0>] kthread+0xd1/0xe0
      [370140.452972] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [370140.452993] [<0>] 0xfffffffffffffffe
      ....
      

      Client dmesg (reconnections):

      [373336.926476] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [373336.926481] Lustre: Skipped 1 previous similar message
      [374092.440057] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654035/real 1676654035]  req@ffff9bf48c66e400 x1758012803097024/t0(0) o17->lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676654791 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [374092.440068] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
      [374092.440098] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [374092.440106] Lustre: Skipped 2 previous similar messages
      [374092.450347] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [374092.450354] Lustre: Skipped 1 previous similar message
      [374848.233487] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654791/real 1676654791]  req@ffff9bf44b5b9f80 x1758012803195712/t0(0) o17->lustrefs-OST0000-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676655547 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [374848.233499] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
      [374848.233550] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [374848.246634] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [374848.246645] Lustre: Skipped 1 previous similar message
      [375031.816746] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676655547/real 1676655547]  req@ffff9bf44b5bb180 x1758012803283392/t0(0) o17->lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 7 to 1 dl 1676655730 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [375031.816758] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 168 previous similar messages
      [375031.816775] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [375031.816785] Lustre: Skipped 2 previous similar messages
      

      At this point, the OSS in not able to reply to requests of OST service .

      All the threads on the OSS are waiting for an extent lock on the OST objects:

      crash> foreach 'll_ost00' bt | grep -c "ldlm_completion_ast"                                  
      76
      

      OST resources:

      crash> ldlm_resource.lr_name -x 0xffff96768edc7600,
        lr_name = {
          name = {0x7fc, 0x0, 0x0, 0x0}
        }
      crash> ldlm_resource.lr_name -x 0xffff96768edc7300  
        lr_name = {
          name = {0xd06, 0x0, 0x0, 0x0}
        }
      
      
      crash> ldlm_resource.lr_granted -o 0xffff96768edc7600
      struct ldlm_resource {
        [ffff96768edc7620] struct list_head lr_granted;
      }
      crash> list -H ffff96768edc7620 -s ldlm_lock.l_granted_mode  ldlm_lock.l_res_link        
      ffff967646789200
        l_granted_mode = LCK_GROUP
      crash> ldlm_resource.lr_waiting -o 0xffff96768edc7600
      struct ldlm_resource {
        [ffff96768edc7630] struct list_head lr_waiting;
      }
      crash> list -H ffff96768edc7630 -s ldlm_lock.l_req_mode  ldlm_lock.l_res_link         
      ffff96766a297d40
        l_req_mode = LCK_PR
      ...
      crash> list -H ffff96768edc7630 | wc -l
      38
      
      Same thing for 0xffff96768edc7300 resource:
      ....
      crash> ldlm_resource.lr_waiting -o 0xffff96768edc7300 
      struct ldlm_resource {
        [ffff96768edc7330] struct list_head lr_waiting;
      }
      crash> list -H ffff96768edc7330 | wc -l
      38
      

      38 + 38 = 76 locks are waiting for 2 group locks on the 2 OST objects of the file.

      On the client, the "lfs migrate" process that has the group lock is waiting for a reply from the OSS:

      PID: 6136   TASK: ffff9bf44e0b2100  CPU: 0   COMMAND: "lfs"                
       #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
       #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
       #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
       #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
       #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
       #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
       #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
       #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
       #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
       #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
      #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92            
      

      The other "lfs migrate" are calling stat() or llapi_get_data_version() (before taking the group lock):

      PID: 6137   TASK: ffff9bf44e0b3180  CPU: 1   COMMAND: "lfs"                    
       #0 [ffff9bf4ba1d7ac0] __schedule at ffffffffa619d018                          
       #1 [ffff9bf4ba1d7b28] schedule at ffffffffa619d3e9                            
       #2 [ffff9bf4ba1d7b38] schedule_timeout at ffffffffa619b0b1                    
       #3 [ffff9bf4ba1d7be8] wait_for_completion at ffffffffa619d79d                 
       #4 [ffff9bf4ba1d7c48] osc_io_data_version_end at ffffffffc10311e9 [osc]       
       #5 [ffff9bf4ba1d7c80] cl_io_end at ffffffffc0b3075f [obdclass]                
       #6 [ffff9bf4ba1d7cb0] lov_io_end_wrapper at ffffffffc10864eb [lov]            
       #7 [ffff9bf4ba1d7cd0] lov_io_data_version_end at ffffffffc10874a8 [lov]       
       #8 [ffff9bf4ba1d7cf8] cl_io_end at ffffffffc0b3075f [obdclass]                
       #9 [ffff9bf4ba1d7d28] cl_io_loop at ffffffffc0b334dd [obdclass]               
      #10 [ffff9bf4ba1d7d60] ll_ioc_data_version at ffffffffc1159f0b [lustre]        
      #11 [ffff9bf4ba1d7da8] ll_file_ioctl at ffffffffc11728c3 [lustre]              
      #12 [ffff9bf4ba1d7e80] do_vfs_ioctl at ffffffffa5c63ad0                        
      #13 [ffff9bf4ba1d7f00] sys_ioctl at ffffffffa5c63d81                           
      #14 [ffff9bf4ba1d7f50] system_call_fastpath at ffffffffa61aaf92                
      
      PID: 6158   TASK: ffff9bf4b9145280  CPU: 0   COMMAND: "lfs"                           
       #0 [ffff9bf444cd3bb8] __schedule at ffffffffa619d018                                 
       #1 [ffff9bf444cd3c20] schedule at ffffffffa619d3e9                                   
       #2 [ffff9bf444cd3c30] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]                 
       #3 [ffff9bf444cd3cb0] cl_lock_request at ffffffffc0b2e783 [obdclass]                 
       #4 [ffff9bf444cd3ce8] cl_glimpse_lock at ffffffffc11b4339 [lustre]                   
       #5 [ffff9bf444cd3d30] cl_glimpse_size0 at ffffffffc11b4705 [lustre]                  
       #6 [ffff9bf444cd3d88] ll_getattr_dentry at ffffffffc116c5ee [lustre]                 
       #7 [ffff9bf444cd3e38] ll_getattr at ffffffffc116cafe [lustre]                        
       #8 [ffff9bf444cd3e48] vfs_getattr at ffffffffa5c53ee9                                
       #9 [ffff9bf444cd3e78] vfs_fstat at ffffffffa5c53f65                                  
      #10 [ffff9bf444cd3eb8] SYSC_newfstat at ffffffffa5c544d4                              
      #11 [ffff9bf444cd3f40] sys_newfstat at ffffffffa5c548ae                               
      

      Conclusion:
      The requests from the "lfs migrate" process with the group lock are stuck in the NRS policy of the OST service because all the OSS threads are waiting for its group lock.

      Workarround
      Temporary increase the number of threads to dequeue all the requests:

      [root@oss ~]# lctl set_param ost.OSS.ost.threads_max=200
      

      Attachments

        Activity

          People

            eaujames Etienne Aujames
            eaujames Etienne Aujames
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: