Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16571

"lfs migrate -b" can cause thread starvation on OSS

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • Reproduced on VMs with ldiskfs Lustre 2.15.54:
      1 client
      1 MDS: 2 MDTs
      1 OSS: 2 OSTs
    • 3
    • 9223372036854775807

    Description

      I am not 100% sure but I think we hit the following issue in production with "lfs migrate -b"
      parrallel instances on files with several hard links:

      OSS setup:

      [root@oss ~]# lctl get_param ost.OSS.ost.threads_* cpu_partition_table
      ost.OSS.ost.threads_max=78
      ost.OSS.ost.threads_min=3
      ost.OSS.ost.threads_started=4
      cpu_partition_table=0   : 0 1
      [root@oss ~]# mount -tlustre
      /dev/mapper/ost1_flakey on /...
      /dev/mapper/ost2_flakey on /...
      

      Reproducer with 1 client:

      [root@client test]# lfs setstripe  -c2 .
      [root@client test]# dd if=/dev/zero of=test1 bs=1M count=100                                   
      [root@client test]# mkdir links                                              
      [root@client test]# printf "%s\n" link{1..100} | xargs -I{} ln test1 links/{}
      [root@client test]# find -type f | xargs -P100 -I{} lfs migrate -c2 {}
      # --> the command hang
      

      Node states
      messages queued:

      [root@oss ~]# lctl get_param -n ost.OSS.ost.nrs_policies    
      
      regular_requests:
        - name: fifo
          state: started
          fallback: yes
          queued: 125                 
          active: 76                  
      ...
      

      OSS dmesg:

      [369238.270582] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1676651767, 300s ago); not entering recovery in server code, just going back to sleep ns: 
      filter-lustrefs-OST0000_UUID lock: ffff967687a826c0/0xdb3bda27f476b2cd lrc: 3/1,0 mode: --/PR res: [0xc46:0x0:0x0].0x0 rrc: 40 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x400
      10000000000 nid: local remote: 0x0 expref: -99 pid: 13117 timeout: 0 lvb_type: 0
      [369238.288701] LustreError: 13117:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 72 previous similar messages
      [369688.315515] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff9676632edf80 x1758012803008832/t0(0) o101->d58673c6-592c-4241-b
      a90-fd6a89dece56@10.0.2.6@tcp:617/0 lens 328/0 e 0 to 0 dl 1676652522 ref 2 fl New:/0/ffffffff rc 0/-1 job:'client.29583'
      [369688.315528] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      [369693.860227] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/-150), not sending early reply  req@ffff96769dcb5b00 x1758005042515584/t0(0) o6->lustrefs-MDT0000-mdtlo
      v_UUID@10.0.2.4@tcp:622/0 lens 544/0 e 0 to 0 dl 1676652527 ref 2 fl New:/0/ffffffff rc 0/-1 job:'osp-syn-0-0.0'
      [369693.860239] Lustre: 13182:0:(service.c:1453:ptlrpc_at_send_early_reply()) Skipped 163 previous similar messages
      [369694.392072] Lustre: lustrefs-OST0000: Client d58673c6-592c-4241-ba90-fd6a89dece56 (at 10.0.2.6@tcp) reconnecting
      [369694.392084] Lustre: Skipped 1 previous similar message
      [369699.869588] Lustre: lustrefs-OST0000: Client lustrefs-MDT0000-mdtlov_UUID (at 10.0.2.4@tcp) reconnecting
      [369699.869591] Lustre: Skipped 1 previous similar message
      [370140.452612] ptlrpc_watchdog_fire: 72 callbacks suppressed
      [370140.452620] Lustre: ll_ost00_025: service thread pid 13134 was inactive for 1202.140 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging p
      urposes:
      [370140.452623] Lustre: Skipped 2 previous similar messages
      [370140.452626] Pid: 13134, comm: ll_ost00_025 3.10.0-1160.59.1.el7.centos.plus.x86_64 #1 SMP Wed Feb 23 17:40:21 UTC 2022
      [370140.452631] Call Trace:
      [370140.452705] [<0>] ldlm_completion_ast+0x62d/0xa40 [ptlrpc]
      [370140.452767] [<0>] ldlm_cli_enqueue_local+0x25c/0x880 [ptlrpc]
      [370140.452818] [<0>] tgt_extent_lock+0xea/0x2a0 [ptlrpc]
      [370140.452827] [<0>] ofd_getattr_hdl+0x385/0x750 [ofd]
      [370140.452874] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [370140.452929] [<0>] ptlrpc_server_handle_request+0x253/0xc00 [ptlrpc]
      [370140.452963] [<0>] ptlrpc_main+0xc3c/0x15f0 [ptlrpc]
      [370140.452968] [<0>] kthread+0xd1/0xe0
      [370140.452972] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [370140.452993] [<0>] 0xfffffffffffffffe
      ....
      

      Client dmesg (reconnections):

      [373336.926476] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [373336.926481] Lustre: Skipped 1 previous similar message
      [374092.440057] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654035/real 1676654035]  req@ffff9bf48c66e400 x1758012803097024/t0(0) o17->lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676654791 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [374092.440068] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
      [374092.440098] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [374092.440106] Lustre: Skipped 2 previous similar messages
      [374092.450347] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [374092.450354] Lustre: Skipped 1 previous similar message
      [374848.233487] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676654791/real 1676654791]  req@ffff9bf44b5b9f80 x1758012803195712/t0(0) o17->lustrefs-OST0000-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 0 to 1 dl 1676655547 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [374848.233499] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 164 previous similar messages
      [374848.233550] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [374848.246634] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection restored to 10.0.2.5@tcp (at 10.0.2.5@tcp)
      [374848.246645] Lustre: Skipped 1 previous similar message
      [375031.816746] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1676655547/real 1676655547]  req@ffff9bf44b5bb180 x1758012803283392/t0(0) o17->lustrefs-OST0001-osc-ffff9bf4bb70f000@10.0.2.5@tcp:28/4 lens 456/432 e 7 to 1 dl 1676655730 ref 1 fl Rpc:XQr/2/ffffffff rc -11/-1 job:''
      [375031.816758] Lustre: 3473:0:(client.c:2305:ptlrpc_expire_one_request()) Skipped 168 previous similar messages
      [375031.816775] Lustre: lustrefs-OST0000-osc-ffff9bf4bb70f000: Connection to lustrefs-OST0000 (at 10.0.2.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [375031.816785] Lustre: Skipped 2 previous similar messages
      

      At this point, the OSS in not able to reply to requests of OST service .

      All the threads on the OSS are waiting for an extent lock on the OST objects:

      crash> foreach 'll_ost00' bt | grep -c "ldlm_completion_ast"                                  
      76
      

      OST resources:

      crash> ldlm_resource.lr_name -x 0xffff96768edc7600,
        lr_name = {
          name = {0x7fc, 0x0, 0x0, 0x0}
        }
      crash> ldlm_resource.lr_name -x 0xffff96768edc7300  
        lr_name = {
          name = {0xd06, 0x0, 0x0, 0x0}
        }
      
      
      crash> ldlm_resource.lr_granted -o 0xffff96768edc7600
      struct ldlm_resource {
        [ffff96768edc7620] struct list_head lr_granted;
      }
      crash> list -H ffff96768edc7620 -s ldlm_lock.l_granted_mode  ldlm_lock.l_res_link        
      ffff967646789200
        l_granted_mode = LCK_GROUP
      crash> ldlm_resource.lr_waiting -o 0xffff96768edc7600
      struct ldlm_resource {
        [ffff96768edc7630] struct list_head lr_waiting;
      }
      crash> list -H ffff96768edc7630 -s ldlm_lock.l_req_mode  ldlm_lock.l_res_link         
      ffff96766a297d40
        l_req_mode = LCK_PR
      ...
      crash> list -H ffff96768edc7630 | wc -l
      38
      
      Same thing for 0xffff96768edc7300 resource:
      ....
      crash> ldlm_resource.lr_waiting -o 0xffff96768edc7300 
      struct ldlm_resource {
        [ffff96768edc7330] struct list_head lr_waiting;
      }
      crash> list -H ffff96768edc7330 | wc -l
      38
      

      38 + 38 = 76 locks are waiting for 2 group locks on the 2 OST objects of the file.

      On the client, the "lfs migrate" process that has the group lock is waiting for a reply from the OSS:

      PID: 6136   TASK: ffff9bf44e0b2100  CPU: 0   COMMAND: "lfs"                
       #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
       #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
       #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
       #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
       #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
       #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
       #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
       #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
       #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
       #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
      #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92            
      

      The other "lfs migrate" are calling stat() or llapi_get_data_version() (before taking the group lock):

      PID: 6137   TASK: ffff9bf44e0b3180  CPU: 1   COMMAND: "lfs"                    
       #0 [ffff9bf4ba1d7ac0] __schedule at ffffffffa619d018                          
       #1 [ffff9bf4ba1d7b28] schedule at ffffffffa619d3e9                            
       #2 [ffff9bf4ba1d7b38] schedule_timeout at ffffffffa619b0b1                    
       #3 [ffff9bf4ba1d7be8] wait_for_completion at ffffffffa619d79d                 
       #4 [ffff9bf4ba1d7c48] osc_io_data_version_end at ffffffffc10311e9 [osc]       
       #5 [ffff9bf4ba1d7c80] cl_io_end at ffffffffc0b3075f [obdclass]                
       #6 [ffff9bf4ba1d7cb0] lov_io_end_wrapper at ffffffffc10864eb [lov]            
       #7 [ffff9bf4ba1d7cd0] lov_io_data_version_end at ffffffffc10874a8 [lov]       
       #8 [ffff9bf4ba1d7cf8] cl_io_end at ffffffffc0b3075f [obdclass]                
       #9 [ffff9bf4ba1d7d28] cl_io_loop at ffffffffc0b334dd [obdclass]               
      #10 [ffff9bf4ba1d7d60] ll_ioc_data_version at ffffffffc1159f0b [lustre]        
      #11 [ffff9bf4ba1d7da8] ll_file_ioctl at ffffffffc11728c3 [lustre]              
      #12 [ffff9bf4ba1d7e80] do_vfs_ioctl at ffffffffa5c63ad0                        
      #13 [ffff9bf4ba1d7f00] sys_ioctl at ffffffffa5c63d81                           
      #14 [ffff9bf4ba1d7f50] system_call_fastpath at ffffffffa61aaf92                
      
      PID: 6158   TASK: ffff9bf4b9145280  CPU: 0   COMMAND: "lfs"                           
       #0 [ffff9bf444cd3bb8] __schedule at ffffffffa619d018                                 
       #1 [ffff9bf444cd3c20] schedule at ffffffffa619d3e9                                   
       #2 [ffff9bf444cd3c30] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]                 
       #3 [ffff9bf444cd3cb0] cl_lock_request at ffffffffc0b2e783 [obdclass]                 
       #4 [ffff9bf444cd3ce8] cl_glimpse_lock at ffffffffc11b4339 [lustre]                   
       #5 [ffff9bf444cd3d30] cl_glimpse_size0 at ffffffffc11b4705 [lustre]                  
       #6 [ffff9bf444cd3d88] ll_getattr_dentry at ffffffffc116c5ee [lustre]                 
       #7 [ffff9bf444cd3e38] ll_getattr at ffffffffc116cafe [lustre]                        
       #8 [ffff9bf444cd3e48] vfs_getattr at ffffffffa5c53ee9                                
       #9 [ffff9bf444cd3e78] vfs_fstat at ffffffffa5c53f65                                  
      #10 [ffff9bf444cd3eb8] SYSC_newfstat at ffffffffa5c544d4                              
      #11 [ffff9bf444cd3f40] sys_newfstat at ffffffffa5c548ae                               
      

      Conclusion:
      The requests from the "lfs migrate" process with the group lock are stuck in the NRS policy of the OST service because all the OSS threads are waiting for its group lock.

      Workarround
      Temporary increase the number of threads to dequeue all the requests:

      [root@oss ~]# lctl set_param ost.OSS.ost.threads_max=200
      

      Attachments

        Issue Links

          Activity

            [LU-16571] "lfs migrate -b" can cause thread starvation on OSS

            Patch was included in 2.15.55.

            adilger Andreas Dilger added a comment - Patch was included in 2.15.55.

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51114
            Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: aee0ce61a24ff4f965b6e0e5b263e0c3038cc8c1

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51114 Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: aee0ce61a24ff4f965b6e0e5b263e0c3038cc8c1

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51112
            Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: b708deae6307eea61527ecf1dfac23ca4c901a64

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51112 Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: b708deae6307eea61527ecf1dfac23ca4c901a64

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50113/
            Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2310f4b8a6b6050cccedd4982ce80aa1cfbd3fe1

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50113/ Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2310f4b8a6b6050cccedd4982ce80aa1cfbd3fe1

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50113
            Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1891b6e3a86159359e55855525868bf061f02615

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50113 Subject: LU-16571 utils: fix parallel "lfs migrate -b" on hard links Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1891b6e3a86159359e55855525868bf061f02615

            Andreas,
            I think the issue here is "llapi_get_data_version()" outside the group lock.

            For "data version" the LDLM lock is taken server side:

            static int osc_io_data_version_start(const struct lu_env *env,
                                                 const struct cl_io_slice *slice)
            {
            .......
                    if (dv->dv_flags & (LL_DV_RD_FLUSH | LL_DV_WR_FLUSH)) {
                            oa->o_valid |= OBD_MD_FLFLAGS;
                            oa->o_flags |= OBD_FL_SRVLOCK;                                       <--------------
                            if (dv->dv_flags & LL_DV_WR_FLUSH)
                                    oa->o_flags |= OBD_FL_FLUSH;
                    }
            
            
            enum obdo_flags {
            ....
                    OBD_FL_CREATE_CROW  = 0x00000400, /* object should be create on write */
                    OBD_FL_SRVLOCK      = 0x00000800, /* delegate DLM locking to server */          <--------------
                    OBD_FL_CKSUM_CRC32  = 0x00001000, /* CRC32 checksum type */
            ....
            }
            

            Therefore, it will monopolize an OSS thread to wait the lock if conflicting and server side lock has no timeout:

            static int ofd_getattr_hdl(struct tgt_session_info *tsi)                      
            {                                                                             
            ......                  
                                                                                          
                    srvlock = tsi->tsi_ost_body->oa.o_valid & OBD_MD_FLFLAGS &&           
                              tsi->tsi_ost_body->oa.o_flags & OBD_FL_SRVLOCK;             
                                                                                          
                    if (srvlock) {                                                        
                            if (unlikely(tsi->tsi_ost_body->oa.o_flags & OBD_FL_FLUSH))   
                                    lock_mode = LCK_PW;                                   
                                                                                          
                            rc = tgt_extent_lock(tsi->tsi_env,                            
                                                 tsi->tsi_tgt->lut_obd->obd_namespace,    
                                                 &tsi->tsi_resid, 0, OBD_OBJECT_EOF, &lh, 
                                                 lock_mode, &flags);                      
                            if (rc != 0)                                                  
                                    RETURN(rc);                                           
                    }    
            

            From crash output on the OSS:

            crash> bt 17443                                                                  
            PID: 17443  TASK: ffff8c4ef5841080  CPU: 1   COMMAND: "ll_ost00_010"
             #0 [ffff8c4f17c9b9a0] __schedule at ffffffff94b9d018
             #1 [ffff8c4f17c9ba08] schedule at ffffffff94b9d3e9
             #2 [ffff8c4f17c9ba18] ldlm_completion_ast at ffffffffc12ff2cd [ptlrpc]
             #3 [ffff8c4f17c9bab8] ldlm_cli_enqueue_local at ffffffffc12fd50c [ptlrpc]
             #4 [ffff8c4f17c9bb58] tgt_extent_lock at ffffffffc1349f4a [ptlrpc]
             #5 [ffff8c4f17c9bc20] ofd_getattr_hdl at ffffffffc1917315 [ofd]
             #6 [ffff8c4f17c9bc98] tgt_request_handle at ffffffffc135277f [ptlrpc]
             #7 [ffff8c4f17c9bd28] ptlrpc_server_handle_request at ffffffffc129d5d3 [ptlrpc]
             #8 [ffff8c4f17c9bde0] ptlrpc_main at ffffffffc129f25c [ptlrpc]
             #9 [ffff8c4f17c9bec8] kthread at ffffffff944c5e61
            
            crash> p ((struct tgt_session_info*)0xffff8c4f17848900)->tsi_jobid      
            $5 = 0xffff8c4ec2e535c0 "lfs.0"
            crash> p/x ((struct tgt_session_info*)0xffff8c4f17848900)->tsi_ost_body->oa.o_flags  
            $2 = 0x800
            
            eaujames Etienne Aujames added a comment - Andreas, I think the issue here is "llapi_get_data_version()" outside the group lock. For "data version" the LDLM lock is taken server side: static int osc_io_data_version_start(const struct lu_env *env, const struct cl_io_slice *slice) { ....... if (dv->dv_flags & (LL_DV_RD_FLUSH | LL_DV_WR_FLUSH)) { oa->o_valid |= OBD_MD_FLFLAGS; oa->o_flags |= OBD_FL_SRVLOCK; <-------------- if (dv->dv_flags & LL_DV_WR_FLUSH) oa->o_flags |= OBD_FL_FLUSH; } enum obdo_flags { .... OBD_FL_CREATE_CROW = 0x00000400, /* object should be create on write */ OBD_FL_SRVLOCK = 0x00000800, /* delegate DLM locking to server */ <-------------- OBD_FL_CKSUM_CRC32 = 0x00001000, /* CRC32 checksum type */ .... } Therefore, it will monopolize an OSS thread to wait the lock if conflicting and server side lock has no timeout: static int ofd_getattr_hdl(struct tgt_session_info *tsi) { ...... srvlock = tsi->tsi_ost_body->oa.o_valid & OBD_MD_FLFLAGS && tsi->tsi_ost_body->oa.o_flags & OBD_FL_SRVLOCK; if (srvlock) { if (unlikely(tsi->tsi_ost_body->oa.o_flags & OBD_FL_FLUSH)) lock_mode = LCK_PW; rc = tgt_extent_lock(tsi->tsi_env, tsi->tsi_tgt->lut_obd->obd_namespace, &tsi->tsi_resid, 0, OBD_OBJECT_EOF, &lh, lock_mode, &flags); if (rc != 0) RETURN(rc); } From crash output on the OSS: crash> bt 17443 PID: 17443 TASK: ffff8c4ef5841080 CPU: 1 COMMAND: "ll_ost00_010" #0 [ffff8c4f17c9b9a0] __schedule at ffffffff94b9d018 #1 [ffff8c4f17c9ba08] schedule at ffffffff94b9d3e9 #2 [ffff8c4f17c9ba18] ldlm_completion_ast at ffffffffc12ff2cd [ptlrpc] #3 [ffff8c4f17c9bab8] ldlm_cli_enqueue_local at ffffffffc12fd50c [ptlrpc] #4 [ffff8c4f17c9bb58] tgt_extent_lock at ffffffffc1349f4a [ptlrpc] #5 [ffff8c4f17c9bc20] ofd_getattr_hdl at ffffffffc1917315 [ofd] #6 [ffff8c4f17c9bc98] tgt_request_handle at ffffffffc135277f [ptlrpc] #7 [ffff8c4f17c9bd28] ptlrpc_server_handle_request at ffffffffc129d5d3 [ptlrpc] #8 [ffff8c4f17c9bde0] ptlrpc_main at ffffffffc129f25c [ptlrpc] #9 [ffff8c4f17c9bec8] kthread at ffffffff944c5e61 crash> p ((struct tgt_session_info*)0xffff8c4f17848900)->tsi_jobid $5 = 0xffff8c4ec2e535c0 "lfs.0" crash> p/x ((struct tgt_session_info*)0xffff8c4f17848900)->tsi_ost_body->oa.o_flags $2 = 0x800
            paf0186 Patrick Farrell added a comment - - edited

            Note also:
            Since group locks can't be called back by the server, including for timeout (they are not subject to timeouts), it's always possible to use them to cause a hang.  Request a group lock on a file and try to access it from other threads.  Accesses to that file hang indefinitely.

            There are also more complicated scenarios, like group locking one file, then accessing another file without a group lock, in that case you've created a group lock - regular lock dependency, where you could get thread exhaustion even if group locks have their own portal, because that regular lock request happens while the group lock is being held, so we can get a loop where the group lock causes thread exhaustion and the regular lock request can't be serviced.  (That's what we're seeing here, actually - Though the glimpse might not be on a separate file?)

            paf0186 Patrick Farrell added a comment - - edited Note also: Since group locks can't be called back by the server, including for timeout (they are not subject to timeouts), it's always possible to use them to cause a hang.  Request a group lock on a file and try to access it from other threads.  Accesses to that file hang indefinitely. There are also more complicated scenarios, like group locking one file, then accessing another file without a group lock , in that case you've created a group lock - regular lock dependency, where you could get thread exhaustion even if group locks have their own portal , because that regular lock request happens while the group lock is being held, so we can get a loop where the group lock causes thread exhaustion and the regular lock request can't be serviced.  (That's what we're seeing here, actually - Though the glimpse might not be on a separate file?)
            paf0186 Patrick Farrell added a comment - - edited

            OK, that makes sense, especially re the CPT concern.  I'm still a little skeptical about this as an issue in practice, but that does make more sense.

            RE: the lfs process.  You're saying "the file", is that the same file it has a group lock on?  Like I said, it shouldn't need to do a glimpse if it holds a group lock, the client should see it has a write lock on the whole file and not glimpse.  That's some sort of bug, if it is for sure trying to glimpse a file it has group locked.

            So if that's indeed the case, we should be able to solve that, but it would be possible for this client to get stuck doing I/O instead of requesting a glimpse lock.  If it were trying to do a read or a write RPC, it would have the same issue with thread exhaustion because the other threads are waiting for this lock.

            We have a dependency loop where the other threads/clients are waiting for this lock, and this lock is waiting for either a lock request or an I/O request.  If the lock request is on the same file, then that's a bug - it has a group lock, which should let the client do anything.  If the lock request is on a different file, then we have a problem because the group lock can't be called back.  We could group lock that second file first, but it would be possible to get stuck if the other client requests arrived between the two group lock requests, exhausting the OST threads so the second group lock request can't be serviced.

            If we moved group lock requests to their own portal, the problem with group locks goes away if any program using group locks always uses group locks to access any files.

            But the basic problem of thread exhaustion under a write lock is still possible - client 1 has a PW lock on a file and is going to do a write (or a read, the key part is the type of lock), and other clients/client threads all try to access the same file.  They achieve OST thread exhaustion with lock requests while client 1 is holding the lock but before it can do the write.

            So group locks can create group lock<- > regular lock dependency chains, and regular locks can create lock < – ->i/o dependency chains.  In both cases, the chains can be turned in to loops by thread exhaustion, so the second item - the second lock request while holding the first lock or I/O while holding a lock - cannot be acquired.

            In both cases, the 'canonical' solution used on the MDT side is to put the different request types in to different portals.  So in our case, that would mean group locks to their own portal, and I/O and lock requests to separate portals.  Yuck, but I don't see another way around it in the general case.

            paf0186 Patrick Farrell added a comment - - edited OK, that makes sense, especially re the CPT concern.  I'm still a little skeptical about this as an issue in practice, but that does make more sense. RE: the lfs process.  You're saying "the file", is that the same file it has a group lock on?  Like I said, it shouldn't need to do a glimpse if it holds a group lock, the client should see it has a write lock on the whole file and not glimpse.  That's some sort of bug, if it is for sure trying to glimpse a file it has group locked. So if that's indeed the case, we should be able to solve that, but it would be possible for this client to get stuck doing I/O instead of requesting a glimpse lock.  If it were trying to do a read or a write RPC, it would have the same issue with thread exhaustion because the other threads are waiting for this lock. We have a dependency loop where the other threads/clients are waiting for this lock, and this lock is waiting for either a lock request or an I/O request.  If the lock request is on the same file, then that's a bug - it has a group lock, which should let the client do anything.  If the lock request is on a different file, then we have a problem because the group lock can't be called back.  We could group lock that second file first, but it would be possible to get stuck if the other client requests arrived between the two group lock requests, exhausting the OST threads so the second group lock request can't be serviced. If we moved group lock requests to their own portal, the problem with group locks goes away if any program using group locks always uses group locks to access any files. But the basic problem of thread exhaustion under a write lock is still possible - client 1 has a PW lock on a file and is going to do a write (or a read, the key part is the type of lock), and other clients/client threads all try to access the same file.  They achieve OST thread exhaustion with lock requests while client 1 is holding the lock but before it can do the write. So group locks can create group lock<- > regular lock dependency chains, and regular locks can create lock < – ->i/o dependency chains.  In both cases, the chains can be turned in to loops by thread exhaustion, so the second item - the second lock request while holding the first lock or I/O while holding a lock - cannot be acquired. In both cases, the 'canonical' solution used on the MDT side is to put the different request types in to different portals.  So in our case, that would mean group locks to their own portal, and I/O and lock requests to separate portals.  Yuck, but I don't see another way around it in the general case.

            Patrick,
            I use the default configuration for OST threads.
            The number of threads is low because of the VM use to reproduce this issue: 2 cores, 1CPT, 1.8 GB of RAM:
            OSS_THR_FACTOR = 7
            OSS_NTHRS_BASE = 64
            nbt_threads = 64 + 64 * 0 / 2 + 2 * 7 = 78

            But, I think, this can be reproducible on an OSS with a lot of CPT. It can starve all the threads of a particular CPT because client NIDs are bound to CPT.
            e.g:
            On a production OSS (DDN 18KX VM), we have 9 CPTs with default threads_max=576. So only 64 threads before hanging a CPT.

            The "lfs" process that has the group lock is in:

             #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018                      
             #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9                        
             #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass]      
             #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass]      
             #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre]        
             #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre]       
             #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre]       
             #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre]           
             #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff                        
             #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472                     
            #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92  
            

            It tries to read the file. The glimpse request is stuck on the OSS in FIFO policy before ptlrpc_server_handle_request() (no more threads to handle the request). So the process is not able to release its group lock and hang all the CPT.

            This has the same behavior as LU-15132, but for an OSS.

            eaujames Etienne Aujames added a comment - Patrick, I use the default configuration for OST threads. The number of threads is low because of the VM use to reproduce this issue: 2 cores, 1CPT, 1.8 GB of RAM: OSS_THR_FACTOR = 7 OSS_NTHRS_BASE = 64 nbt_threads = 64 + 64 * 0 / 2 + 2 * 7 = 78 But, I think, this can be reproducible on an OSS with a lot of CPT. It can starve all the threads of a particular CPT because client NIDs are bound to CPT. e.g: On a production OSS (DDN 18KX VM), we have 9 CPTs with default threads_max=576. So only 64 threads before hanging a CPT. The "lfs" process that has the group lock is in: #0 [ffff9bf4450a7b38] __schedule at ffffffffa619d018 #1 [ffff9bf4450a7ba0] schedule at ffffffffa619d3e9 #2 [ffff9bf4450a7bb0] cl_sync_io_wait at ffffffffc0b32d55 [obdclass] #3 [ffff9bf4450a7c30] cl_lock_request at ffffffffc0b2e783 [obdclass] #4 [ffff9bf4450a7c68] cl_glimpse_lock at ffffffffc11b4339 [lustre] #5 [ffff9bf4450a7cb0] cl_glimpse_size0 at ffffffffc11b4705 [lustre] #6 [ffff9bf4450a7d08] ll_file_aio_read at ffffffffc1163706 [lustre] #7 [ffff9bf4450a7de8] ll_file_read at ffffffffc1163fa1 [lustre] #8 [ffff9bf4450a7ed0] vfs_read at ffffffffa5c4e3ff #9 [ffff9bf4450a7f00] sys_pread64 at ffffffffa5c4f472 #10 [ffff9bf4450a7f50] system_call_fastpath at ffffffffa61aaf92 It tries to read the file. The glimpse request is stuck on the OSS in FIFO policy before ptlrpc_server_handle_request() (no more threads to handle the request). So the process is not able to release its group lock and hang all the CPT. This has the same behavior as LU-15132 , but for an OSS.

            I guess I'm making the point that I'm not sure it's a real problem in practice, and I'm curious about if it was seen with a normal configuration?  Or what the trigger was?  Etc.  Typical max thread counts on an OSS - a few years ago, and I think the ClusterStor OSSes are on the larger side, but still - was something like 768.

            paf0186 Patrick Farrell added a comment - I guess I'm making the point that I'm not sure it's a real problem in practice, and I'm curious about if it was seen with a normal configuration?  Or what the trigger was?  Etc.  Typical max thread counts on an OSS - a few years ago, and I think the ClusterStor OSSes are on the larger side, but still - was something like 768.

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: