Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2622

All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6133

    Description

      During testing on Sequoia, we've seen specific IO nodes heavily contend on the cl_envs_guard lock while memory reclaim is happening. As a result, this severely degrades the performance of the problem client.

      An example stack trace of a thread spinning on the lock is below:

      CPU56:                                                                             
      Call Trace:                                                                        
      [c00000000fe3bb30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)          
      [c00000000fe3bbe0] [c00000000027604c] .showacpu+0x64/0x94                          
      [c00000000fe3bc70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe3bd40] [c00000000001d11c] .smp_message_recv+0x34/0x78                  
      [c00000000fe3bdc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c                
      [c00000000fe3be50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c                 
      [c00000000fe3bf00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100                
      [c00000000fe3bf90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                   
      [c0000003e1c4a4c0] [c0000000000059f0] .do_IRQ+0x154/0x1e0                          
      [c0000003e1c4a570] [c0000000000144dc] exc_external_input_book3e+0x110/0x114        
      --- Exception: 501 at ._raw_spin_lock+0xd8/0x1a8                                   
          LR = ._raw_spin_lock+0x104/0x1a8                                               
      [c0000003e1c4a860] [8000000000b04f38] libcfs_nidstrings+0x2acc/0xfffffffffffe5824 [libcfs] (unreliable)
      [c0000003e1c4a910] [c00000000042d4cc] ._spin_lock+0x10/0x24                        
      [c0000003e1c4a980] [80000000024c2f4c] .cl_env_get+0xec/0x480 [obdclass]            
      [c0000003e1c4aa60] [80000000024c336c] .cl_env_nested_get+0x8c/0xf0 [obdclass]   
      [c0000003e1c4aaf0] [800000000692070c] .ll_releasepage+0xbc/0x200 [lustre]          
      [c0000003e1c4aba0] [c000000000094110] .try_to_release_page+0x68/0x8c               
      [c0000003e1c4ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c        
      [c0000003e1c4adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690            
      [c0000003e1c4af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                     
      [c0000003e1c4b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0               
      [c0000003e1c4b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738          
      [c0000003e1c4b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec       
      [c0000003e1c4b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]          
      [c0000003e1c4b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003e1c4b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8        
      [c0000003e1c4b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8            
      [c0000003e1c4b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003e1c4b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]           
      [c0000003e1c4b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]           
      [c0000003e1c4ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003e1c4bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003e1c4bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]          
      [c0000003e1c4bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                        
      [c0000003e1c4bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                         
      [c0000003e1c4be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      I also see this for calls in cl_env_put:

      CPU63:                                                                          
      Call Trace:                                                                     
      [c00000000fe03b30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)       
      [c00000000fe03be0] [c00000000027604c] .showacpu+0x64/0x94                       
      [c00000000fe03c70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe03d40] [c00000000001d11c] .smp_message_recv+0x34/0x78               
      [c00000000fe03dc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c             
      [c00000000fe03e50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c              
      [c00000000fe03f00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100             
      [c00000000fe03f90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                
      [c0000003c4f0a510] [c0000000000059f0] .do_IRQ+0x154/0x1e0                       
      [c0000003c4f0a5c0] [c0000000000144dc] exc_external_input_book3e+0x110/0x114     
      --- Exception: 501 at ._raw_spin_lock+0xdc/0x1a8                                
          LR = ._raw_spin_lock+0x104/0x1a8                                             
      [c0000003c4f0a8b0] [800000000697a578] msgdata.87439+0x20/0xfffffffffffccf88 [lustre] (unreliable)
      [c0000003c4f0a960] [c00000000042d4cc] ._spin_lock+0x10/0x24                     
      [c0000003c4f0a9d0] [80000000024c17e8] .cl_env_put+0x178/0x420 [obdclass]        
      [c0000003c4f0aa70] [80000000024c1ab0] .cl_env_nested_put+0x20/0x40 [obdclass]   
      [c0000003c4f0aaf0] [8000000006920794] .ll_releasepage+0x144/0x200 [lustre]      
      [c0000003c4f0aba0] [c000000000094110] .try_to_release_page+0x68/0x8c            
      [c0000003c4f0ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c     
      [c0000003c4f0adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690         
      [c0000003c4f0af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                  
      [c0000003c4f0b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0            
      [c0000003c4f0b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738       
      [c0000003c4f0b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec    
      [c0000003c4f0b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]       
      [c0000003c4f0b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003c4f0b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8     
      [c0000003c4f0b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8         
      [c0000003c4f0b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003c4f0b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]        
      [c0000003c4f0b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]        
      [c0000003c4f0ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003c4f0bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003c4f0bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]       
      [c0000003c4f0bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                     
      [c0000003c4f0bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                      
      [c0000003c4f0be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Listing the cl_env_* addresses:

      cl_env_get+0xec
      (gdb) l *cl_env_get+0xec
      0xa2f4c is in cl_env_get (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:804).
      
       803         ENTRY;                                                                  
       804         spin_lock(&cl_envs_guard);                                              
       805         LASSERT(equi(cl_envs_cached_nr == 0, cfs_list_empty(&cl_envs))); 
      
      cl_env_put+0x178
      (gdb) l *cl_env_put+0x178
      0xa17e8 is in cl_env_put (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:979).
      
       978                     (env->le_ses->lc_tags & ~LCT_HAS_EXIT) == LCT_SESSION) {    
       979                         spin_lock(&cl_envs_guard);                              
       980                         cfs_list_add(&cle->ce_linkage, &cl_envs);
      

      Dumping the stacks for all active CPUs shows nearly every one of them contending on this lock.

      Attachments

        Issue Links

          Activity

            [LU-2622] All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

            I don't think it's ready to be closed, but I'm OK dropping the priority.

            prakash Prakash Surya (Inactive) added a comment - I don't think it's ready to be closed, but I'm OK dropping the priority.
            pjones Peter Jones added a comment -

            Prakash

            Does the remaining work left to land on this ticket warrant it remaining as a blocker?

            Peter

            pjones Peter Jones added a comment - Prakash Does the remaining work left to land on this ticket warrant it remaining as a blocker? Peter

            Hi Prakash, I'd like to have a LCT_ATOMIC one. Sorry for delay response.

            jay Jinshan Xiong (Inactive) added a comment - Hi Prakash, I'd like to have a LCT_ATOMIC one. Sorry for delay response.

            The patch to remove the list has already landed. Should there be a follow up patch to add the LCT_ATOMIC flag when allocating cl_env's? Or add that to the preallocated cl_env per cpu patch?

            prakash Prakash Surya (Inactive) added a comment - The patch to remove the list has already landed. Should there be a follow up patch to add the LCT_ATOMIC flag when allocating cl_env's? Or add that to the preallocated cl_env per cpu patch ?

            Hi Prakash, can you please add one more tag: LCT_ATOMIC for cl_env. This tag will be used in ll_releasepage() so that keys will be allocated with GFP_ATOMIC; then a few number of preallocated cl_env list will be used in case the allocation fails.

            We make this effort so that ll_releasepage() can make progress even under the worst situation.

            jay Jinshan Xiong (Inactive) added a comment - Hi Prakash, can you please add one more tag: LCT_ATOMIC for cl_env. This tag will be used in ll_releasepage() so that keys will be allocated with GFP_ATOMIC; then a few number of preallocated cl_env list will be used in case the allocation fails. We make this effort so that ll_releasepage() can make progress even under the worst situation.
            prakash Prakash Surya (Inactive) added a comment - - edited

            I have also pushed this patch which completely removes the list, as I now think it is unnecessary: http://review.whamcloud.com/5204

            prakash Prakash Surya (Inactive) added a comment - - edited I have also pushed this patch which completely removes the list, as I now think it is unnecessary: http://review.whamcloud.com/5204

            I pushed a preliminary patch of the per CPU lists idea: http://review.whamcloud.com/5215

            It's still far from being complete, but can be a starting point for discussion on implementation details.

            prakash Prakash Surya (Inactive) added a comment - I pushed a preliminary patch of the per CPU lists idea: http://review.whamcloud.com/5215 It's still far from being complete, but can be a starting point for discussion on implementation details.

            Yes, yesterday ior reproduced it quite well under 2.3.58-7chaos on Sequoia. Command was:

            ior -F -e -g -t1m -b512m -o /p/lsfull/morrone/f

            But smaller transfer sizes might be even better:

            ior -F -e -g -t32k -b128m -o /p/lsfull/morrone/f

            That was 96K MPI tasks of ior that I was using, which means 128 writers writing to 128 files per lustre client.

            morrone Christopher Morrone (Inactive) added a comment - Yes, yesterday ior reproduced it quite well under 2.3.58-7chaos on Sequoia. Command was: ior -F -e -g -t1m -b512m -o /p/lsfull/morrone/f But smaller transfer sizes might be even better: ior -F -e -g -t32k -b128m -o /p/lsfull/morrone/f That was 96K MPI tasks of ior that I was using, which means 128 writers writing to 128 files per lustre client.

            no, no. I meant we will disable preemption by our own.

            I'm pretty interested in the backtrace of sleep in the function osc_enter_cache(). Do you have any use case to reproduce it?

            jay Jinshan Xiong (Inactive) added a comment - no, no. I meant we will disable preemption by our own. I'm pretty interested in the backtrace of sleep in the function osc_enter_cache(). Do you have any use case to reproduce it?

            I don't think that you should rely on kernel preemption being disabled for client code. We're already patchless on the client side, so that assumption may be false.

            morrone Christopher Morrone (Inactive) added a comment - I don't think that you should rely on kernel preemption being disabled for client code. We're already patchless on the client side, so that assumption may be false.

            there may be lots of related call behind cl_page_put() which needs env. I would suggest to use a percpu preallocated env array for ll_releasepage() purpose. We just need to disable preempt(should be disabled by default for server kernels) to call it.

            jay Jinshan Xiong (Inactive) added a comment - there may be lots of related call behind cl_page_put() which needs env. I would suggest to use a percpu preallocated env array for ll_releasepage() purpose. We just need to disable preempt(should be disabled by default for server kernels) to call it.

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: