Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2622

All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6133

    Description

      During testing on Sequoia, we've seen specific IO nodes heavily contend on the cl_envs_guard lock while memory reclaim is happening. As a result, this severely degrades the performance of the problem client.

      An example stack trace of a thread spinning on the lock is below:

      CPU56:                                                                             
      Call Trace:                                                                        
      [c00000000fe3bb30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)          
      [c00000000fe3bbe0] [c00000000027604c] .showacpu+0x64/0x94                          
      [c00000000fe3bc70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe3bd40] [c00000000001d11c] .smp_message_recv+0x34/0x78                  
      [c00000000fe3bdc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c                
      [c00000000fe3be50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c                 
      [c00000000fe3bf00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100                
      [c00000000fe3bf90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                   
      [c0000003e1c4a4c0] [c0000000000059f0] .do_IRQ+0x154/0x1e0                          
      [c0000003e1c4a570] [c0000000000144dc] exc_external_input_book3e+0x110/0x114        
      --- Exception: 501 at ._raw_spin_lock+0xd8/0x1a8                                   
          LR = ._raw_spin_lock+0x104/0x1a8                                               
      [c0000003e1c4a860] [8000000000b04f38] libcfs_nidstrings+0x2acc/0xfffffffffffe5824 [libcfs] (unreliable)
      [c0000003e1c4a910] [c00000000042d4cc] ._spin_lock+0x10/0x24                        
      [c0000003e1c4a980] [80000000024c2f4c] .cl_env_get+0xec/0x480 [obdclass]            
      [c0000003e1c4aa60] [80000000024c336c] .cl_env_nested_get+0x8c/0xf0 [obdclass]   
      [c0000003e1c4aaf0] [800000000692070c] .ll_releasepage+0xbc/0x200 [lustre]          
      [c0000003e1c4aba0] [c000000000094110] .try_to_release_page+0x68/0x8c               
      [c0000003e1c4ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c        
      [c0000003e1c4adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690            
      [c0000003e1c4af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                     
      [c0000003e1c4b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0               
      [c0000003e1c4b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738          
      [c0000003e1c4b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec       
      [c0000003e1c4b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]          
      [c0000003e1c4b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003e1c4b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8        
      [c0000003e1c4b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8            
      [c0000003e1c4b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003e1c4b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]           
      [c0000003e1c4b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]           
      [c0000003e1c4ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003e1c4bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003e1c4bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]          
      [c0000003e1c4bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                        
      [c0000003e1c4bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                         
      [c0000003e1c4be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      I also see this for calls in cl_env_put:

      CPU63:                                                                          
      Call Trace:                                                                     
      [c00000000fe03b30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)       
      [c00000000fe03be0] [c00000000027604c] .showacpu+0x64/0x94                       
      [c00000000fe03c70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe03d40] [c00000000001d11c] .smp_message_recv+0x34/0x78               
      [c00000000fe03dc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c             
      [c00000000fe03e50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c              
      [c00000000fe03f00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100             
      [c00000000fe03f90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                
      [c0000003c4f0a510] [c0000000000059f0] .do_IRQ+0x154/0x1e0                       
      [c0000003c4f0a5c0] [c0000000000144dc] exc_external_input_book3e+0x110/0x114     
      --- Exception: 501 at ._raw_spin_lock+0xdc/0x1a8                                
          LR = ._raw_spin_lock+0x104/0x1a8                                             
      [c0000003c4f0a8b0] [800000000697a578] msgdata.87439+0x20/0xfffffffffffccf88 [lustre] (unreliable)
      [c0000003c4f0a960] [c00000000042d4cc] ._spin_lock+0x10/0x24                     
      [c0000003c4f0a9d0] [80000000024c17e8] .cl_env_put+0x178/0x420 [obdclass]        
      [c0000003c4f0aa70] [80000000024c1ab0] .cl_env_nested_put+0x20/0x40 [obdclass]   
      [c0000003c4f0aaf0] [8000000006920794] .ll_releasepage+0x144/0x200 [lustre]      
      [c0000003c4f0aba0] [c000000000094110] .try_to_release_page+0x68/0x8c            
      [c0000003c4f0ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c     
      [c0000003c4f0adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690         
      [c0000003c4f0af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                  
      [c0000003c4f0b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0            
      [c0000003c4f0b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738       
      [c0000003c4f0b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec    
      [c0000003c4f0b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]       
      [c0000003c4f0b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003c4f0b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8     
      [c0000003c4f0b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8         
      [c0000003c4f0b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003c4f0b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]        
      [c0000003c4f0b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]        
      [c0000003c4f0ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003c4f0bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003c4f0bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]       
      [c0000003c4f0bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                     
      [c0000003c4f0bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                      
      [c0000003c4f0be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Listing the cl_env_* addresses:

      cl_env_get+0xec
      (gdb) l *cl_env_get+0xec
      0xa2f4c is in cl_env_get (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:804).
      
       803         ENTRY;                                                                  
       804         spin_lock(&cl_envs_guard);                                              
       805         LASSERT(equi(cl_envs_cached_nr == 0, cfs_list_empty(&cl_envs))); 
      
      cl_env_put+0x178
      (gdb) l *cl_env_put+0x178
      0xa17e8 is in cl_env_put (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:979).
      
       978                     (env->le_ses->lc_tags & ~LCT_HAS_EXIT) == LCT_SESSION) {    
       979                         spin_lock(&cl_envs_guard);                              
       980                         cfs_list_add(&cle->ce_linkage, &cl_envs);
      

      Dumping the stacks for all active CPUs shows nearly every one of them contending on this lock.

      Attachments

        Issue Links

          Activity

            [LU-2622] All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

            Fixed via LU-3321.

            adilger Andreas Dilger added a comment - Fixed via LU-3321 .

            Hi Prakash,

            I reposed the percpu patch at http://review.whamcloud.com/8174. I have fixed the reschedule problem so it should be working now.

            jay Jinshan Xiong (Inactive) added a comment - Hi Prakash, I reposed the percpu patch at http://review.whamcloud.com/8174 . I have fixed the reschedule problem so it should be working now.
            prakash Prakash Surya (Inactive) added a comment - - edited

            Sure, looking at the history, I think patch set 2 was where I left off. I'll move that to the front of the queue sometime today, in the mean time, Jeremy should be able to cherry pick from the specific patch set the same way he would have gotten patch set 5. Specifically, I think this command will do it: git fetch http://review.whamcloud.com/p/fs/lustre-release refs/changes/46/5446/2 && git cherry-pick FETCH_HEAD

            prakash Prakash Surya (Inactive) added a comment - - edited Sure, looking at the history, I think patch set 2 was where I left off. I'll move that to the front of the queue sometime today, in the mean time, Jeremy should be able to cherry pick from the specific patch set the same way he would have gotten patch set 5. Specifically, I think this command will do it: git fetch http://review.whamcloud.com/p/fs/lustre-release refs/changes/46/5446/2 && git cherry-pick FETCH_HEAD

            Hi Prakash, it turns out that your original patch is correct. I realize that it can cause reschedule by cl_page_put(). Can you please reinstate your patch so that Jeremy can give it a try?

            I will try to fix this problem in CLIO simplification project.

            Thanks in advance,

            jay Jinshan Xiong (Inactive) added a comment - Hi Prakash, it turns out that your original patch is correct. I realize that it can cause reschedule by cl_page_put(). Can you please reinstate your patch so that Jeremy can give it a try? I will try to fix this problem in CLIO simplification project. Thanks in advance,

            Hmm.. I'm assuming you're running with patch set 5. Is there a back trace associated with the crash that you can post? That assertion makes me think the process was rescheduled to a different CPU in between the get() and put() calls, but preemption should have been disabled, so that shouldn't have happened..

            prakash Prakash Surya (Inactive) added a comment - Hmm.. I'm assuming you're running with patch set 5. Is there a back trace associated with the crash that you can post? That assertion makes me think the process was rescheduled to a different CPU in between the get() and put() calls, but preemption should have been disabled, so that shouldn't have happened..

            While testing the patch for this bug: http://review.whamcloud.com/#change,5446 for LU-3321 I came across an lbug. I think it happened dropping the caches manually (echo 1 > /proc/sys/vm/drop_caches).

            kernel:LustreError: 28127:0:(cl_object.c:1104:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed:

            jfilizetti Jeremy Filizetti added a comment - While testing the patch for this bug: http://review.whamcloud.com/#change,5446 for LU-3321 I came across an lbug. I think it happened dropping the caches manually (echo 1 > /proc/sys/vm/drop_caches). kernel:LustreError: 28127:0:(cl_object.c:1104:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu [cpu] ) failed:
            jay Jinshan Xiong (Inactive) added a comment - - edited

            So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)?

            We should define a new context tag, say LCT_ATOMIC and this tag will be set when calling cl_env_alloc() in ll_releasepage(). Of course, key allocation functions should be changed to use GFP_ATOMIC if LCT_ATOMIC tag is set.

            Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847. I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.

            The major problem is that cl_page_put() can sleep, the calling path is as follows: cl_page_put -> cl_page_free -> cl_object_put -> lu_object_put -> lu_object_free -> lov_delete_raid0 -> cl_locks_prune. I'm quite sure env will be used in this calling path.

            Update: after taking a further look, I realize ll_releasepage() may not sleep in the above code path. I will write a patch to make sure.

            jay Jinshan Xiong (Inactive) added a comment - - edited So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)? We should define a new context tag, say LCT_ATOMIC and this tag will be set when calling cl_env_alloc() in ll_releasepage(). Of course, key allocation functions should be changed to use GFP_ATOMIC if LCT_ATOMIC tag is set. Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847 . I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic. The major problem is that cl_page_put() can sleep, the calling path is as follows: cl_page_put -> cl_page_free -> cl_object_put -> lu_object_put -> lu_object_free -> lov_delete_raid0 -> cl_locks_prune. I'm quite sure env will be used in this calling path. Update: after taking a further look, I realize ll_releasepage() may not sleep in the above code path. I will write a patch to make sure.

            Jinshan, I briefly looked into making the memory allocations atomic under ll_releasepage, but it doesn't look as simple as I hoped. From what I can tell there will potentially by 3 or more memory allocations when creating a new lu_env structure: 1. for the lu_env itself, 2. for the lu_env's keys, 3. for the lu_global_key 4. any allocations for other keys. So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)?

            Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847. I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.

            prakash Prakash Surya (Inactive) added a comment - Jinshan, I briefly looked into making the memory allocations atomic under ll_releasepage, but it doesn't look as simple as I hoped. From what I can tell there will potentially by 3 or more memory allocations when creating a new lu_env structure: 1. for the lu_env itself, 2. for the lu_env's keys, 3. for the lu_global_key 4. any allocations for other keys. So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)? Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847 . I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.
            pjones Peter Jones added a comment -

            ok thanks Prakash!

            pjones Peter Jones added a comment - ok thanks Prakash!

            I don't think it's ready to be closed, but I'm OK dropping the priority.

            prakash Prakash Surya (Inactive) added a comment - I don't think it's ready to be closed, but I'm OK dropping the priority.

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: