Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2622

All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6133

    Description

      During testing on Sequoia, we've seen specific IO nodes heavily contend on the cl_envs_guard lock while memory reclaim is happening. As a result, this severely degrades the performance of the problem client.

      An example stack trace of a thread spinning on the lock is below:

      CPU56:                                                                             
      Call Trace:                                                                        
      [c00000000fe3bb30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)          
      [c00000000fe3bbe0] [c00000000027604c] .showacpu+0x64/0x94                          
      [c00000000fe3bc70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe3bd40] [c00000000001d11c] .smp_message_recv+0x34/0x78                  
      [c00000000fe3bdc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c                
      [c00000000fe3be50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c                 
      [c00000000fe3bf00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100                
      [c00000000fe3bf90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                   
      [c0000003e1c4a4c0] [c0000000000059f0] .do_IRQ+0x154/0x1e0                          
      [c0000003e1c4a570] [c0000000000144dc] exc_external_input_book3e+0x110/0x114        
      --- Exception: 501 at ._raw_spin_lock+0xd8/0x1a8                                   
          LR = ._raw_spin_lock+0x104/0x1a8                                               
      [c0000003e1c4a860] [8000000000b04f38] libcfs_nidstrings+0x2acc/0xfffffffffffe5824 [libcfs] (unreliable)
      [c0000003e1c4a910] [c00000000042d4cc] ._spin_lock+0x10/0x24                        
      [c0000003e1c4a980] [80000000024c2f4c] .cl_env_get+0xec/0x480 [obdclass]            
      [c0000003e1c4aa60] [80000000024c336c] .cl_env_nested_get+0x8c/0xf0 [obdclass]   
      [c0000003e1c4aaf0] [800000000692070c] .ll_releasepage+0xbc/0x200 [lustre]          
      [c0000003e1c4aba0] [c000000000094110] .try_to_release_page+0x68/0x8c               
      [c0000003e1c4ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c        
      [c0000003e1c4adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690            
      [c0000003e1c4af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                     
      [c0000003e1c4b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0               
      [c0000003e1c4b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738          
      [c0000003e1c4b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec       
      [c0000003e1c4b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]          
      [c0000003e1c4b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003e1c4b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8        
      [c0000003e1c4b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8            
      [c0000003e1c4b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003e1c4b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]           
      [c0000003e1c4b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]           
      [c0000003e1c4ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003e1c4bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003e1c4bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]          
      [c0000003e1c4bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                        
      [c0000003e1c4bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                         
      [c0000003e1c4be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      I also see this for calls in cl_env_put:

      CPU63:                                                                          
      Call Trace:                                                                     
      [c00000000fe03b30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)       
      [c00000000fe03be0] [c00000000027604c] .showacpu+0x64/0x94                       
      [c00000000fe03c70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe03d40] [c00000000001d11c] .smp_message_recv+0x34/0x78               
      [c00000000fe03dc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c             
      [c00000000fe03e50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c              
      [c00000000fe03f00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100             
      [c00000000fe03f90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                
      [c0000003c4f0a510] [c0000000000059f0] .do_IRQ+0x154/0x1e0                       
      [c0000003c4f0a5c0] [c0000000000144dc] exc_external_input_book3e+0x110/0x114     
      --- Exception: 501 at ._raw_spin_lock+0xdc/0x1a8                                
          LR = ._raw_spin_lock+0x104/0x1a8                                             
      [c0000003c4f0a8b0] [800000000697a578] msgdata.87439+0x20/0xfffffffffffccf88 [lustre] (unreliable)
      [c0000003c4f0a960] [c00000000042d4cc] ._spin_lock+0x10/0x24                     
      [c0000003c4f0a9d0] [80000000024c17e8] .cl_env_put+0x178/0x420 [obdclass]        
      [c0000003c4f0aa70] [80000000024c1ab0] .cl_env_nested_put+0x20/0x40 [obdclass]   
      [c0000003c4f0aaf0] [8000000006920794] .ll_releasepage+0x144/0x200 [lustre]      
      [c0000003c4f0aba0] [c000000000094110] .try_to_release_page+0x68/0x8c            
      [c0000003c4f0ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c     
      [c0000003c4f0adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690         
      [c0000003c4f0af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                  
      [c0000003c4f0b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0            
      [c0000003c4f0b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738       
      [c0000003c4f0b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec    
      [c0000003c4f0b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]       
      [c0000003c4f0b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003c4f0b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8     
      [c0000003c4f0b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8         
      [c0000003c4f0b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003c4f0b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]        
      [c0000003c4f0b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]        
      [c0000003c4f0ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003c4f0bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003c4f0bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]       
      [c0000003c4f0bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                     
      [c0000003c4f0bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                      
      [c0000003c4f0be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Listing the cl_env_* addresses:

      cl_env_get+0xec
      (gdb) l *cl_env_get+0xec
      0xa2f4c is in cl_env_get (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:804).
      
       803         ENTRY;                                                                  
       804         spin_lock(&cl_envs_guard);                                              
       805         LASSERT(equi(cl_envs_cached_nr == 0, cfs_list_empty(&cl_envs))); 
      
      cl_env_put+0x178
      (gdb) l *cl_env_put+0x178
      0xa17e8 is in cl_env_put (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:979).
      
       978                     (env->le_ses->lc_tags & ~LCT_HAS_EXIT) == LCT_SESSION) {    
       979                         spin_lock(&cl_envs_guard);                              
       980                         cfs_list_add(&cle->ce_linkage, &cl_envs);
      

      Dumping the stacks for all active CPUs shows nearly every one of them contending on this lock.

      Attachments

        Issue Links

          Activity

            [LU-2622] All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

            While testing the patch for this bug: http://review.whamcloud.com/#change,5446 for LU-3321 I came across an lbug. I think it happened dropping the caches manually (echo 1 > /proc/sys/vm/drop_caches).

            kernel:LustreError: 28127:0:(cl_object.c:1104:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed:

            jfilizetti Jeremy Filizetti added a comment - While testing the patch for this bug: http://review.whamcloud.com/#change,5446 for LU-3321 I came across an lbug. I think it happened dropping the caches manually (echo 1 > /proc/sys/vm/drop_caches). kernel:LustreError: 28127:0:(cl_object.c:1104:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu [cpu] ) failed:
            jay Jinshan Xiong (Inactive) added a comment - - edited

            So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)?

            We should define a new context tag, say LCT_ATOMIC and this tag will be set when calling cl_env_alloc() in ll_releasepage(). Of course, key allocation functions should be changed to use GFP_ATOMIC if LCT_ATOMIC tag is set.

            Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847. I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.

            The major problem is that cl_page_put() can sleep, the calling path is as follows: cl_page_put -> cl_page_free -> cl_object_put -> lu_object_put -> lu_object_free -> lov_delete_raid0 -> cl_locks_prune. I'm quite sure env will be used in this calling path.

            Update: after taking a further look, I realize ll_releasepage() may not sleep in the above code path. I will write a patch to make sure.

            jay Jinshan Xiong (Inactive) added a comment - - edited So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)? We should define a new context tag, say LCT_ATOMIC and this tag will be set when calling cl_env_alloc() in ll_releasepage(). Of course, key allocation functions should be changed to use GFP_ATOMIC if LCT_ATOMIC tag is set. Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847 . I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic. The major problem is that cl_page_put() can sleep, the calling path is as follows: cl_page_put -> cl_page_free -> cl_object_put -> lu_object_put -> lu_object_free -> lov_delete_raid0 -> cl_locks_prune. I'm quite sure env will be used in this calling path. Update: after taking a further look, I realize ll_releasepage() may not sleep in the above code path. I will write a patch to make sure.

            Jinshan, I briefly looked into making the memory allocations atomic under ll_releasepage, but it doesn't look as simple as I hoped. From what I can tell there will potentially by 3 or more memory allocations when creating a new lu_env structure: 1. for the lu_env itself, 2. for the lu_env's keys, 3. for the lu_global_key 4. any allocations for other keys. So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)?

            Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847. I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.

            prakash Prakash Surya (Inactive) added a comment - Jinshan, I briefly looked into making the memory allocations atomic under ll_releasepage, but it doesn't look as simple as I hoped. From what I can tell there will potentially by 3 or more memory allocations when creating a new lu_env structure: 1. for the lu_env itself, 2. for the lu_env's keys, 3. for the lu_global_key 4. any allocations for other keys. So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)? Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847 . I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.
            pjones Peter Jones added a comment -

            ok thanks Prakash!

            pjones Peter Jones added a comment - ok thanks Prakash!

            I don't think it's ready to be closed, but I'm OK dropping the priority.

            prakash Prakash Surya (Inactive) added a comment - I don't think it's ready to be closed, but I'm OK dropping the priority.
            pjones Peter Jones added a comment -

            Prakash

            Does the remaining work left to land on this ticket warrant it remaining as a blocker?

            Peter

            pjones Peter Jones added a comment - Prakash Does the remaining work left to land on this ticket warrant it remaining as a blocker? Peter

            Hi Prakash, I'd like to have a LCT_ATOMIC one. Sorry for delay response.

            jay Jinshan Xiong (Inactive) added a comment - Hi Prakash, I'd like to have a LCT_ATOMIC one. Sorry for delay response.

            The patch to remove the list has already landed. Should there be a follow up patch to add the LCT_ATOMIC flag when allocating cl_env's? Or add that to the preallocated cl_env per cpu patch?

            prakash Prakash Surya (Inactive) added a comment - The patch to remove the list has already landed. Should there be a follow up patch to add the LCT_ATOMIC flag when allocating cl_env's? Or add that to the preallocated cl_env per cpu patch ?

            Hi Prakash, can you please add one more tag: LCT_ATOMIC for cl_env. This tag will be used in ll_releasepage() so that keys will be allocated with GFP_ATOMIC; then a few number of preallocated cl_env list will be used in case the allocation fails.

            We make this effort so that ll_releasepage() can make progress even under the worst situation.

            jay Jinshan Xiong (Inactive) added a comment - Hi Prakash, can you please add one more tag: LCT_ATOMIC for cl_env. This tag will be used in ll_releasepage() so that keys will be allocated with GFP_ATOMIC; then a few number of preallocated cl_env list will be used in case the allocation fails. We make this effort so that ll_releasepage() can make progress even under the worst situation.
            prakash Prakash Surya (Inactive) added a comment - - edited

            I have also pushed this patch which completely removes the list, as I now think it is unnecessary: http://review.whamcloud.com/5204

            prakash Prakash Surya (Inactive) added a comment - - edited I have also pushed this patch which completely removes the list, as I now think it is unnecessary: http://review.whamcloud.com/5204

            I pushed a preliminary patch of the per CPU lists idea: http://review.whamcloud.com/5215

            It's still far from being complete, but can be a starting point for discussion on implementation details.

            prakash Prakash Surya (Inactive) added a comment - I pushed a preliminary patch of the per CPU lists idea: http://review.whamcloud.com/5215 It's still far from being complete, but can be a starting point for discussion on implementation details.

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: