Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2622

All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6133

    Description

      During testing on Sequoia, we've seen specific IO nodes heavily contend on the cl_envs_guard lock while memory reclaim is happening. As a result, this severely degrades the performance of the problem client.

      An example stack trace of a thread spinning on the lock is below:

      CPU56:                                                                             
      Call Trace:                                                                        
      [c00000000fe3bb30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)          
      [c00000000fe3bbe0] [c00000000027604c] .showacpu+0x64/0x94                          
      [c00000000fe3bc70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe3bd40] [c00000000001d11c] .smp_message_recv+0x34/0x78                  
      [c00000000fe3bdc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c                
      [c00000000fe3be50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c                 
      [c00000000fe3bf00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100                
      [c00000000fe3bf90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                   
      [c0000003e1c4a4c0] [c0000000000059f0] .do_IRQ+0x154/0x1e0                          
      [c0000003e1c4a570] [c0000000000144dc] exc_external_input_book3e+0x110/0x114        
      --- Exception: 501 at ._raw_spin_lock+0xd8/0x1a8                                   
          LR = ._raw_spin_lock+0x104/0x1a8                                               
      [c0000003e1c4a860] [8000000000b04f38] libcfs_nidstrings+0x2acc/0xfffffffffffe5824 [libcfs] (unreliable)
      [c0000003e1c4a910] [c00000000042d4cc] ._spin_lock+0x10/0x24                        
      [c0000003e1c4a980] [80000000024c2f4c] .cl_env_get+0xec/0x480 [obdclass]            
      [c0000003e1c4aa60] [80000000024c336c] .cl_env_nested_get+0x8c/0xf0 [obdclass]   
      [c0000003e1c4aaf0] [800000000692070c] .ll_releasepage+0xbc/0x200 [lustre]          
      [c0000003e1c4aba0] [c000000000094110] .try_to_release_page+0x68/0x8c               
      [c0000003e1c4ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c        
      [c0000003e1c4adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690            
      [c0000003e1c4af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                     
      [c0000003e1c4b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0               
      [c0000003e1c4b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738          
      [c0000003e1c4b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec       
      [c0000003e1c4b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]          
      [c0000003e1c4b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003e1c4b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8        
      [c0000003e1c4b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8            
      [c0000003e1c4b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003e1c4b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]           
      [c0000003e1c4b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]           
      [c0000003e1c4ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003e1c4bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003e1c4bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]          
      [c0000003e1c4bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                        
      [c0000003e1c4bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                         
      [c0000003e1c4be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      I also see this for calls in cl_env_put:

      CPU63:                                                                          
      Call Trace:                                                                     
      [c00000000fe03b30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)       
      [c00000000fe03be0] [c00000000027604c] .showacpu+0x64/0x94                       
      [c00000000fe03c70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
      [c00000000fe03d40] [c00000000001d11c] .smp_message_recv+0x34/0x78               
      [c00000000fe03dc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c             
      [c00000000fe03e50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c              
      [c00000000fe03f00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100             
      [c00000000fe03f90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                
      [c0000003c4f0a510] [c0000000000059f0] .do_IRQ+0x154/0x1e0                       
      [c0000003c4f0a5c0] [c0000000000144dc] exc_external_input_book3e+0x110/0x114     
      --- Exception: 501 at ._raw_spin_lock+0xdc/0x1a8                                
          LR = ._raw_spin_lock+0x104/0x1a8                                             
      [c0000003c4f0a8b0] [800000000697a578] msgdata.87439+0x20/0xfffffffffffccf88 [lustre] (unreliable)
      [c0000003c4f0a960] [c00000000042d4cc] ._spin_lock+0x10/0x24                     
      [c0000003c4f0a9d0] [80000000024c17e8] .cl_env_put+0x178/0x420 [obdclass]        
      [c0000003c4f0aa70] [80000000024c1ab0] .cl_env_nested_put+0x20/0x40 [obdclass]   
      [c0000003c4f0aaf0] [8000000006920794] .ll_releasepage+0x144/0x200 [lustre]      
      [c0000003c4f0aba0] [c000000000094110] .try_to_release_page+0x68/0x8c            
      [c0000003c4f0ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c     
      [c0000003c4f0adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690         
      [c0000003c4f0af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                  
      [c0000003c4f0b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0            
      [c0000003c4f0b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738       
      [c0000003c4f0b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec    
      [c0000003c4f0b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]       
      [c0000003c4f0b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
      [c0000003c4f0b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8     
      [c0000003c4f0b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8         
      [c0000003c4f0b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
      [c0000003c4f0b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]        
      [c0000003c4f0b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]        
      [c0000003c4f0ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
      [c0000003c4f0bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
      [c0000003c4f0bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]       
      [c0000003c4f0bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                     
      [c0000003c4f0bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                      
      [c0000003c4f0be30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Listing the cl_env_* addresses:

      cl_env_get+0xec
      (gdb) l *cl_env_get+0xec
      0xa2f4c is in cl_env_get (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:804).
      
       803         ENTRY;                                                                  
       804         spin_lock(&cl_envs_guard);                                              
       805         LASSERT(equi(cl_envs_cached_nr == 0, cfs_list_empty(&cl_envs))); 
      
      cl_env_put+0x178
      (gdb) l *cl_env_put+0x178
      0xa17e8 is in cl_env_put (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:979).
      
       978                     (env->le_ses->lc_tags & ~LCT_HAS_EXIT) == LCT_SESSION) {    
       979                         spin_lock(&cl_envs_guard);                              
       980                         cfs_list_add(&cle->ce_linkage, &cl_envs);
      

      Dumping the stacks for all active CPUs shows nearly every one of them contending on this lock.

      Attachments

        Issue Links

          Activity

            [LU-2622] All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

            there may be lots of related call behind cl_page_put() which needs env. I would suggest to use a percpu preallocated env array for ll_releasepage() purpose. We just need to disable preempt(should be disabled by default for server kernels) to call it.

            jay Jinshan Xiong (Inactive) added a comment - there may be lots of related call behind cl_page_put() which needs env. I would suggest to use a percpu preallocated env array for ll_releasepage() purpose. We just need to disable preempt(should be disabled by default for server kernels) to call it.

            The main purpose of env is that some local variables are stored there. We used to have problem for stack use.

            Right. But can we simply remove this dependency on the env in this callpath? For example, pass in NULL and fix up the call path to accept this as a valid value. There is a comment in ll_releasepage which makes it sound like we need an env for cl_page_put. I'm curious if we can just remove this need, and refrain from taking an env from the cache altogether in ll_releasepage.

            prakash Prakash Surya (Inactive) added a comment - The main purpose of env is that some local variables are stored there. We used to have problem for stack use. Right. But can we simply remove this dependency on the env in this callpath? For example, pass in NULL and fix up the call path to accept this as a valid value. There is a comment in ll_releasepage which makes it sound like we need an env for cl_page_put . I'm curious if we can just remove this need, and refrain from taking an env from the cache altogether in ll_releasepage .

            There would be no need to lock the cache during insertion and deletions, but could lead to an uneven distribution of objects on the different CPU caches (i.e. unmatched number of get's and put's to any CPU cache). Although, that might not be much of a problem in practice.

            I think uneven distribution shouldn't be a problem. I think it's really worth trying.

            Jinshan, Do we absolutely need to use a env in the ll_releasepage call path? Rather than making the code more complex, it would be great if we could just eliminate the need for the env structure at all in this call path.

            The main purpose of env is that some local variables are stored there. We used to have problem for stack use.

            To Chris, I guess the current grant algorithm is broken. Once I saw -ENOSPC is returned from writing back of caching pages. Johann should be the right person for what's updated recently for this piece of code. I will forward this question to him tomorrow.

            jay Jinshan Xiong (Inactive) added a comment - There would be no need to lock the cache during insertion and deletions, but could lead to an uneven distribution of objects on the different CPU caches (i.e. unmatched number of get's and put's to any CPU cache). Although, that might not be much of a problem in practice. I think uneven distribution shouldn't be a problem. I think it's really worth trying. Jinshan, Do we absolutely need to use a env in the ll_releasepage call path? Rather than making the code more complex, it would be great if we could just eliminate the need for the env structure at all in this call path. The main purpose of env is that some local variables are stored there. We used to have problem for stack use. To Chris, I guess the current grant algorithm is broken. Once I saw -ENOSPC is returned from writing back of caching pages. Johann should be the right person for what's updated recently for this piece of code. I will forward this question to him tomorrow.

            This does seem to be a very serious performance problem for Sequoia. I don't know the reason, but under 2.3.58-7chaos the problem appears to be even worse. Basically all IONs easily get into the state where many threads are thrashing on a spin lock. While oprofile on ppc64 won't give me a backtrace, the next highest cpu users are cl_env_get and cl_env_put. At the same time, the sysrq-l backtraces show many active tasks spinning under either cl_env_get or cl_env_put.

            The remaining writers that are not currently active on a cpu tend to be in either this spot:

            2013-01-29 12:40:05.157376 {DefaultControlEventListener} [mmcs]{760}.0.0: sysiod        S 00000fffa956633c     0  3926   3075 0x00000000
            2013-01-29 12:40:05.157427 {DefaultControlEventListener} [mmcs]{760}.0.0: Call Trace:
            2013-01-29 12:40:05.157478 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de340] [c0000003e06de400] 0xc0000003e06de400 (unreliable)
            2013-01-29 12:40:05.157534 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de510] [c000000000009b2c] .__switch_to+0xc4/0x100
            2013-01-29 12:40:05.157586 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de5a0] [c00000000042a418] .schedule+0x7d4/0x944
            2013-01-29 12:40:05.157636 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de850] [c00000000042ab04] .schedule_timeout+0x1e0/0x228
            2013-01-29 12:40:05.157689 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de930] [8000000000ab2034] .cfs_waitq_timedwait+0x14/0x30 [libcfs]
            2013-01-29 12:40:05.157751 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de9a0] [80000000046a39c8] .osc_enter_cache+0xc78/0x16a0 [osc]
            2013-01-29 12:40:05.157809 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06debd0] [80000000046ab440] .osc_queue_async_io+0xd10/0x1a20 [osc]
            2013-01-29 12:40:05.157860 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dedf0] [8000000004686d48] .osc_page_cache_add+0xf8/0x2a0 [osc]
            2013-01-29 12:40:05.160971 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06deeb0] [80000000024c59c8] .cl_page_cache_add+0xe8/0x3b0 [obdclass]
            2013-01-29 12:40:05.161028 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06defe0] [800000000511c7f8] .lov_page_cache_add+0xc8/0x340 [lov]
            2013-01-29 12:40:05.161079 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df0b0] [80000000024c59c8] .cl_page_cache_add+0xe8/0x3b0 [obdclass]
            2013-01-29 12:40:05.161130 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df1e0] [800000000694bb54] .vvp_io_commit_write+0x474/0x8a0 [lustre]
            2013-01-29 12:40:05.161181 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df300] [80000000024df71c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
            2013-01-29 12:40:05.161232 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df3c0] [800000000690eaa0] .ll_commit_write+0x120/0x3e0 [lustre]
            2013-01-29 12:40:05.161283 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df490] [8000000006930884] .ll_write_end+0x34/0x80 [lustre]
            2013-01-29 12:40:05.161334 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df520] [c00000000009696c] .generic_file_buffered_write+0x1ec/0x374
            2013-01-29 12:40:05.161384 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8
            2013-01-29 12:40:05.161435 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8
            2013-01-29 12:40:05.161486 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df810] [800000000694ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
            2013-01-29 12:40:05.161537 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]
            2013-01-29 12:40:05.161588 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]
            2013-01-29 12:40:05.161638 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfa30] [80000000068ca1d8] .ll_file_io_generic+0x498/0x670 [lustre]
            2013-01-29 12:40:05.161689 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfb30] [80000000068ca834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
            2013-01-29 12:40:05.161740 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfc00] [80000000068cab50] .ll_file_write+0x150/0x320 [lustre]
            2013-01-29 12:40:05.161791 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4
            2013-01-29 12:40:05.161842 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfd80] [c0000000000d1d98] .SyS_write+0x54/0x98
            2013-01-29 12:40:05.161893 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfe30] [c000000000000580] syscall_exit+0x0/0x2c
            

            or waiting here:

            2013-01-29 12:40:05.155420 {DefaultControlEventListener} [mmcs]{760}.0.0: sysiod        R  running task        0  3957   3075 0x00000000
            2013-01-29 12:40:05.155470 {DefaultControlEventListener} [mmcs]{760}.0.0: Call Trace:
            2013-01-29 12:40:05.155522 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083ef90] [80000000024c148c] .cl_object_attr_unlock+0x1c/0x30 [obdclass] (unreliable)
            2013-01-29 12:40:05.155572 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f160] [c000000000009b2c] .__switch_to+0xc4/0x100
            2013-01-29 12:40:05.155624 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f1f0] [c00000000042a418] .schedule+0x7d4/0x944
            2013-01-29 12:40:05.155675 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f4a0] [c00000000042a890] ._cond_resched+0x38/0x64
            2013-01-29 12:40:05.155726 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f520] [c00000000009699c] .generic_file_buffered_write+0x21c/0x374
            2013-01-29 12:40:05.155776 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8
            2013-01-29 12:40:05.155835 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8
            2013-01-29 12:40:05.155897 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f810] [800000000694ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
            2013-01-29 12:40:05.155952 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]
            2013-01-29 12:40:05.156003 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]
            2013-01-29 12:40:05.156055 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fa30] [80000000068ca1d8] .ll_file_io_generic+0x498/0x670 [lustre]
            2013-01-29 12:40:05.156105 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fb30] [80000000068ca834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
            2013-01-29 12:40:05.156156 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fc00] [80000000068cab50] .ll_file_write+0x150/0x320 [lustre]
            2013-01-29 12:40:05.156206 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4
            2013-01-29 12:40:05.156257 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fd80] [c0000000000d1d98] .SyS_write+0x54/0x98
            2013-01-29 12:40:05.156308 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fe30] [c000000000000580] syscall_exit+0x0/0x2c
            

            The net effect of the thrashing clients is that we tend to get less than 200GB/s aggregate throughput seen by the servers.

            Under 2.3.57-2chaos, we would see 850GB/s at the beginning of ior runs. However 2.3.57-2chaos had its own problem. After the bulk of the IO was finished, there was a very long straggler period where a relatively small number of the servers would be handling a only a few MB/s. Today I tracked down the cause of that, and it seemed to be this same bug. Under 2.3.57-2chaos, it seems like perhaps it was just less likely, and only a couple of client nodes would begin thrashing.

            I think that back around 2.3.54 we didn't see this pretty much at all, but I can't remember exactly which version had neither straggler nor constant thrashing problems.

            But all signs seem to point to this cl_env_put/get lock being a major problem.

            morrone Christopher Morrone (Inactive) added a comment - This does seem to be a very serious performance problem for Sequoia. I don't know the reason, but under 2.3.58-7chaos the problem appears to be even worse. Basically all IONs easily get into the state where many threads are thrashing on a spin lock. While oprofile on ppc64 won't give me a backtrace, the next highest cpu users are cl_env_get and cl_env_put. At the same time, the sysrq-l backtraces show many active tasks spinning under either cl_env_get or cl_env_put. The remaining writers that are not currently active on a cpu tend to be in either this spot: 2013-01-29 12:40:05.157376 {DefaultControlEventListener} [mmcs]{760}.0.0: sysiod S 00000fffa956633c 0 3926 3075 0x00000000 2013-01-29 12:40:05.157427 {DefaultControlEventListener} [mmcs]{760}.0.0: Call Trace: 2013-01-29 12:40:05.157478 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de340] [c0000003e06de400] 0xc0000003e06de400 (unreliable) 2013-01-29 12:40:05.157534 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de510] [c000000000009b2c] .__switch_to+0xc4/0x100 2013-01-29 12:40:05.157586 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de5a0] [c00000000042a418] .schedule+0x7d4/0x944 2013-01-29 12:40:05.157636 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de850] [c00000000042ab04] .schedule_timeout+0x1e0/0x228 2013-01-29 12:40:05.157689 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de930] [8000000000ab2034] .cfs_waitq_timedwait+0x14/0x30 [libcfs] 2013-01-29 12:40:05.157751 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06de9a0] [80000000046a39c8] .osc_enter_cache+0xc78/0x16a0 [osc] 2013-01-29 12:40:05.157809 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06debd0] [80000000046ab440] .osc_queue_async_io+0xd10/0x1a20 [osc] 2013-01-29 12:40:05.157860 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dedf0] [8000000004686d48] .osc_page_cache_add+0xf8/0x2a0 [osc] 2013-01-29 12:40:05.160971 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06deeb0] [80000000024c59c8] .cl_page_cache_add+0xe8/0x3b0 [obdclass] 2013-01-29 12:40:05.161028 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06defe0] [800000000511c7f8] .lov_page_cache_add+0xc8/0x340 [lov] 2013-01-29 12:40:05.161079 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df0b0] [80000000024c59c8] .cl_page_cache_add+0xe8/0x3b0 [obdclass] 2013-01-29 12:40:05.161130 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df1e0] [800000000694bb54] .vvp_io_commit_write+0x474/0x8a0 [lustre] 2013-01-29 12:40:05.161181 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df300] [80000000024df71c] .cl_io_commit_write+0x11c/0x2d0 [obdclass] 2013-01-29 12:40:05.161232 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df3c0] [800000000690eaa0] .ll_commit_write+0x120/0x3e0 [lustre] 2013-01-29 12:40:05.161283 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df490] [8000000006930884] .ll_write_end+0x34/0x80 [lustre] 2013-01-29 12:40:05.161334 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df520] [c00000000009696c] .generic_file_buffered_write+0x1ec/0x374 2013-01-29 12:40:05.161384 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8 2013-01-29 12:40:05.161435 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8 2013-01-29 12:40:05.161486 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df810] [800000000694ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre] 2013-01-29 12:40:05.161537 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass] 2013-01-29 12:40:05.161588 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06df980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass] 2013-01-29 12:40:05.161638 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfa30] [80000000068ca1d8] .ll_file_io_generic+0x498/0x670 [lustre] 2013-01-29 12:40:05.161689 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfb30] [80000000068ca834] .ll_file_aio_write+0x1d4/0x3a0 [lustre] 2013-01-29 12:40:05.161740 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfc00] [80000000068cab50] .ll_file_write+0x150/0x320 [lustre] 2013-01-29 12:40:05.161791 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4 2013-01-29 12:40:05.161842 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfd80] [c0000000000d1d98] .SyS_write+0x54/0x98 2013-01-29 12:40:05.161893 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e06dfe30] [c000000000000580] syscall_exit+0x0/0x2c or waiting here: 2013-01-29 12:40:05.155420 {DefaultControlEventListener} [mmcs]{760}.0.0: sysiod R running task 0 3957 3075 0x00000000 2013-01-29 12:40:05.155470 {DefaultControlEventListener} [mmcs]{760}.0.0: Call Trace: 2013-01-29 12:40:05.155522 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083ef90] [80000000024c148c] .cl_object_attr_unlock+0x1c/0x30 [obdclass] (unreliable) 2013-01-29 12:40:05.155572 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f160] [c000000000009b2c] .__switch_to+0xc4/0x100 2013-01-29 12:40:05.155624 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f1f0] [c00000000042a418] .schedule+0x7d4/0x944 2013-01-29 12:40:05.155675 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f4a0] [c00000000042a890] ._cond_resched+0x38/0x64 2013-01-29 12:40:05.155726 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f520] [c00000000009699c] .generic_file_buffered_write+0x21c/0x374 2013-01-29 12:40:05.155776 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8 2013-01-29 12:40:05.155835 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8 2013-01-29 12:40:05.155897 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f810] [800000000694ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre] 2013-01-29 12:40:05.155952 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass] 2013-01-29 12:40:05.156003 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083f980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass] 2013-01-29 12:40:05.156055 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fa30] [80000000068ca1d8] .ll_file_io_generic+0x498/0x670 [lustre] 2013-01-29 12:40:05.156105 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fb30] [80000000068ca834] .ll_file_aio_write+0x1d4/0x3a0 [lustre] 2013-01-29 12:40:05.156156 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fc00] [80000000068cab50] .ll_file_write+0x150/0x320 [lustre] 2013-01-29 12:40:05.156206 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4 2013-01-29 12:40:05.156257 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fd80] [c0000000000d1d98] .SyS_write+0x54/0x98 2013-01-29 12:40:05.156308 {DefaultControlEventListener} [mmcs]{760}.0.0: [c0000003e083fe30] [c000000000000580] syscall_exit+0x0/0x2c The net effect of the thrashing clients is that we tend to get less than 200GB/s aggregate throughput seen by the servers. Under 2.3.57-2chaos, we would see 850GB/s at the beginning of ior runs. However 2.3.57-2chaos had its own problem. After the bulk of the IO was finished, there was a very long straggler period where a relatively small number of the servers would be handling a only a few MB/s. Today I tracked down the cause of that, and it seemed to be this same bug. Under 2.3.57-2chaos, it seems like perhaps it was just less likely, and only a couple of client nodes would begin thrashing. I think that back around 2.3.54 we didn't see this pretty much at all, but I can't remember exactly which version had neither straggler nor constant thrashing problems. But all signs seem to point to this cl_env_put/get lock being a major problem.

            Jinshan, Do we absolutely need to use a env in the ll_releasepage call path? Rather than making the code more complex, it would be great if we could just eliminate the need for the env structure at all in this call path.

            prakash Prakash Surya (Inactive) added a comment - Jinshan, Do we absolutely need to use a env in the ll_releasepage call path? Rather than making the code more complex, it would be great if we could just eliminate the need for the env structure at all in this call path.

            I believe the workload just a simple IOR. We somehow got into a low memory situation where each CPU was trying to call ll_releasepage, causing the contention on cl_env_get/put.

            I also thought about using percpu caches, but how do you suppose we handle threads being rescheduled to different cpus? The first solution that comes to mind:

            1. Thread A on CPU1 cl_env_get's from CPU1's cache
            2. Thread A sleeps and is rescheduled to CPU2
            3. Thread A on CPU2 cl_env_put's to CPU2 cache

            There would be no need to lock the cache during insertion and deletions, but could lead to an uneven distribution of objects on the different CPU caches (i.e. unmatched number of get's and put's to any CPU cache). Although, that might not be much of a problem in practice.

            prakash Prakash Surya (Inactive) added a comment - I believe the workload just a simple IOR. We somehow got into a low memory situation where each CPU was trying to call ll_releasepage, causing the contention on cl_env_get/put. I also thought about using percpu caches, but how do you suppose we handle threads being rescheduled to different cpus? The first solution that comes to mind: 1. Thread A on CPU1 cl_env_get's from CPU1's cache 2. Thread A sleeps and is rescheduled to CPU2 3. Thread A on CPU2 cl_env_put's to CPU2 cache There would be no need to lock the cache during insertion and deletions, but could lead to an uneven distribution of objects on the different CPU caches (i.e. unmatched number of get's and put's to any CPU cache). Although, that might not be much of a problem in practice.

            Shrinker should be okay because it's called when the memory is under pressure.

            For this problem, I think we should use a percpu data structure to manage caching cl_envs. However, I'm wondering what kind of job you were running because usually there should be no so many cl_env_get/put calls.

            jay Jinshan Xiong (Inactive) added a comment - Shrinker should be okay because it's called when the memory is under pressure. For this problem, I think we should use a percpu data structure to manage caching cl_envs. However, I'm wondering what kind of job you were running because usually there should be no so many cl_env_get/put calls.

            OK. I'm glad you agree.

            Slightly related, is it even save to call cl_env_get from within shrinker context? I worry about the case where there are no available env objects in the cache, thus causing an allocation to occur via cl_env_new while we're trying to free memory. Is that a valid concern?

            prakash Prakash Surya (Inactive) added a comment - OK. I'm glad you agree. Slightly related, is it even save to call cl_env_get from within shrinker context? I worry about the case where there are no available env objects in the cache, thus causing an allocation to occur via cl_env_new while we're trying to free memory. Is that a valid concern?

            You're right. I thought cache shrinker is linked to cl_env_cache_purge(), but this patch does optimization anyway. I will work out a new patch soon.

            jay Jinshan Xiong (Inactive) added a comment - You're right. I thought cache shrinker is linked to cl_env_cache_purge(), but this patch does optimization anyway. I will work out a new patch soon.

            Jinshan, why do you think optimizing cl_env_cache_purge() will fix our issue? I didn't see any threads down in that function, so I'm skeptical. I think it's the {{cl_env_

            {put|get}

            }} functions that we need to take a closer look at.

            Also, we don't have enough evidence to make the statement that we haven't seen this previously, IMO. We may have seen it on a much larger portion of the IO nodes than previously, but there's a chance we've been seeing it sporadically all along and didn't notice.

            prakash Prakash Surya (Inactive) added a comment - Jinshan, why do you think optimizing cl_env_cache_purge() will fix our issue? I didn't see any threads down in that function, so I'm skeptical. I think it's the {{cl_env_ {put|get} }} functions that we need to take a closer look at. Also, we don't have enough evidence to make the statement that we haven't seen this previously, IMO. We may have seen it on a much larger portion of the IO nodes than previously, but there's a chance we've been seeing it sporadically all along and didn't notice.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            This sounds like recent patches degraded the performance. It will be great if you can do a comparison and collect performance data.

            Anyway, I pushed a patch to reduce contention on this lock at: http://review.whamcloud.com/5034, please give it a try.

            jay Jinshan Xiong (Inactive) added a comment - - edited This sounds like recent patches degraded the performance. It will be great if you can do a comparison and collect performance data. Anyway, I pushed a patch to reduce contention on this lock at: http://review.whamcloud.com/5034 , please give it a try.

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: