[LU-2622] All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- MB
- llnl

Severity:
3
Rank (Obsolete):
6133

Description

During testing on Sequoia, we've seen specific IO nodes heavily contend on the cl_envs_guard lock while memory reclaim is happening. As a result, this severely degrades the performance of the problem client.

An example stack trace of a thread spinning on the lock is below:

CPU56:                                                                             
Call Trace:                                                                        
[c00000000fe3bb30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)          
[c00000000fe3bbe0] [c00000000027604c] .showacpu+0x64/0x94                          
[c00000000fe3bc70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
[c00000000fe3bd40] [c00000000001d11c] .smp_message_recv+0x34/0x78                  
[c00000000fe3bdc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c                
[c00000000fe3be50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c                 
[c00000000fe3bf00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100                
[c00000000fe3bf90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                   
[c0000003e1c4a4c0] [c0000000000059f0] .do_IRQ+0x154/0x1e0                          
[c0000003e1c4a570] [c0000000000144dc] exc_external_input_book3e+0x110/0x114        
--- Exception: 501 at ._raw_spin_lock+0xd8/0x1a8                                   
    LR = ._raw_spin_lock+0x104/0x1a8                                               
[c0000003e1c4a860] [8000000000b04f38] libcfs_nidstrings+0x2acc/0xfffffffffffe5824 [libcfs] (unreliable)
[c0000003e1c4a910] [c00000000042d4cc] ._spin_lock+0x10/0x24                        
[c0000003e1c4a980] [80000000024c2f4c] .cl_env_get+0xec/0x480 [obdclass]            
[c0000003e1c4aa60] [80000000024c336c] .cl_env_nested_get+0x8c/0xf0 [obdclass]   
[c0000003e1c4aaf0] [800000000692070c] .ll_releasepage+0xbc/0x200 [lustre]          
[c0000003e1c4aba0] [c000000000094110] .try_to_release_page+0x68/0x8c               
[c0000003e1c4ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c        
[c0000003e1c4adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690            
[c0000003e1c4af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                     
[c0000003e1c4b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0               
[c0000003e1c4b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738          
[c0000003e1c4b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec       
[c0000003e1c4b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]          
[c0000003e1c4b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
[c0000003e1c4b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8        
[c0000003e1c4b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8            
[c0000003e1c4b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
[c0000003e1c4b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]           
[c0000003e1c4b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]           
[c0000003e1c4ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
[c0000003e1c4bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
[c0000003e1c4bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]          
[c0000003e1c4bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                        
[c0000003e1c4bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                         
[c0000003e1c4be30] [c000000000000580] syscall_exit+0x0/0x2c

I also see this for calls in cl_env_put:

CPU63:                                                                          
Call Trace:                                                                     
[c00000000fe03b30] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)       
[c00000000fe03be0] [c00000000027604c] .showacpu+0x64/0x94                       
[c00000000fe03c70] [c000000000068b30] .generic_smp_call_function_interrupt+0x10c/0x230
[c00000000fe03d40] [c00000000001d11c] .smp_message_recv+0x34/0x78               
[c00000000fe03dc0] [c00000000002526c] .bgq_ipi_dispatch+0x118/0x18c             
[c00000000fe03e50] [c00000000007b20c] .handle_IRQ_event+0x88/0x18c              
[c00000000fe03f00] [c00000000007dc90] .handle_percpu_irq+0x8c/0x100             
[c00000000fe03f90] [c00000000001b808] .call_handle_irq+0x1c/0x2c                
[c0000003c4f0a510] [c0000000000059f0] .do_IRQ+0x154/0x1e0                       
[c0000003c4f0a5c0] [c0000000000144dc] exc_external_input_book3e+0x110/0x114     
--- Exception: 501 at ._raw_spin_lock+0xdc/0x1a8                                
    LR = ._raw_spin_lock+0x104/0x1a8                                             
[c0000003c4f0a8b0] [800000000697a578] msgdata.87439+0x20/0xfffffffffffccf88 [lustre] (unreliable)
[c0000003c4f0a960] [c00000000042d4cc] ._spin_lock+0x10/0x24                     
[c0000003c4f0a9d0] [80000000024c17e8] .cl_env_put+0x178/0x420 [obdclass]        
[c0000003c4f0aa70] [80000000024c1ab0] .cl_env_nested_put+0x20/0x40 [obdclass]   
[c0000003c4f0aaf0] [8000000006920794] .ll_releasepage+0x144/0x200 [lustre]      
[c0000003c4f0aba0] [c000000000094110] .try_to_release_page+0x68/0x8c            
[c0000003c4f0ac10] [c0000000000a4190] .shrink_page_list.clone.0+0x3d8/0x63c     
[c0000003c4f0adc0] [c0000000000a47d8] .shrink_inactive_list+0x3e4/0x690         
[c0000003c4f0af90] [c0000000000a4f54] .shrink_zone+0x4d0/0x4d4                  
[c0000003c4f0b0c0] [c0000000000a5a68] .try_to_free_pages+0x204/0x3d0            
[c0000003c4f0b220] [c00000000009d044] .__alloc_pages_nodemask+0x460/0x738       
[c0000003c4f0b3a0] [c000000000095af4] .grab_cache_page_write_begin+0x7c/0xec    
[c0000003c4f0b450] [8000000006920964] .ll_write_begin+0x94/0x270 [lustre]       
[c0000003c4f0b520] [c0000000000968c8] .generic_file_buffered_write+0x148/0x374  
[c0000003c4f0b660] [c000000000097050] .__generic_file_aio_write+0x374/0x3d8     
[c0000003c4f0b760] [c00000000009712c] .generic_file_aio_write+0x78/0xe8         
[c0000003c4f0b810] [800000000693ed4c] .vvp_io_write_start+0xfc/0x3e0 [lustre]   
[c0000003c4f0b8e0] [80000000024d9c6c] .cl_io_start+0xcc/0x220 [obdclass]        
[c0000003c4f0b980] [80000000024e1a84] .cl_io_loop+0x194/0x2c0 [obdclass]        
[c0000003c4f0ba30] [80000000068ba1d8] .ll_file_io_generic+0x498/0x670 [lustre]  
[c0000003c4f0bb30] [80000000068ba834] .ll_file_aio_write+0x1d4/0x3a0 [lustre]   
[c0000003c4f0bc00] [80000000068bab50] .ll_file_write+0x150/0x320 [lustre]       
[c0000003c4f0bce0] [c0000000000d1ba8] .vfs_write+0xd0/0x1c4                     
[c0000003c4f0bd80] [c0000000000d1d98] .SyS_write+0x54/0x98                      
[c0000003c4f0be30] [c000000000000580] syscall_exit+0x0/0x2c

Listing the cl_env_* addresses:

cl_env_get+0xec

(gdb) l *cl_env_get+0xec
0xa2f4c is in cl_env_get (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:804).

 803         ENTRY;                                                                  
 804         spin_lock(&cl_envs_guard);                                              
 805         LASSERT(equi(cl_envs_cached_nr == 0, cfs_list_empty(&cl_envs)));

cl_env_put+0x178

(gdb) l *cl_env_put+0x178
0xa17e8 is in cl_env_put (/builddir/build/BUILD/lustre-2.3.58/lustre/obdclass/cl_object.c:979).

 978                     (env->le_ses->lc_tags & ~LCT_HAS_EXIT) == LCT_SESSION) {    
 979                         spin_lock(&cl_envs_guard);                              
 980                         cfs_list_add(&cle->ce_linkage, &cl_envs);

Dumping the stacks for all active CPUs shows nearly every one of them contending on this lock.

Attachments

Issue Links

is related to

LU-3321 2.x single thread/process throughput degraded from 1.8

Resolved

Activity

[LU-2622] All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim

Andreas Dilger added a comment - 03/May/17 8:04 PM

Fixed via ~~LU-3321~~.

Andreas Dilger added a comment - 03/May/17 8:04 PM Fixed via LU-3321 .

Jinshan Xiong (Inactive) added a comment - 08/Nov/13 8:19 PM

Hi Prakash,

I reposed the percpu patch at http://review.whamcloud.com/8174. I have fixed the reschedule problem so it should be working now.

Jinshan Xiong (Inactive) added a comment - 08/Nov/13 8:19 PM Hi Prakash, I reposed the percpu patch at http://review.whamcloud.com/8174 . I have fixed the reschedule problem so it should be working now.

Prakash Surya (Inactive) added a comment - 15/May/13 5:44 PM - edited

Sure, looking at the history, I think patch set 2 was where I left off. I'll move that to the front of the queue sometime today, in the mean time, Jeremy should be able to cherry pick from the specific patch set the same way he would have gotten patch set 5. Specifically, I think this command will do it: git fetch http://review.whamcloud.com/p/fs/lustre-release refs/changes/46/5446/2 && git cherry-pick FETCH_HEAD

Prakash Surya (Inactive) added a comment - 15/May/13 5:44 PM - edited Sure, looking at the history, I think patch set 2 was where I left off. I'll move that to the front of the queue sometime today, in the mean time, Jeremy should be able to cherry pick from the specific patch set the same way he would have gotten patch set 5. Specifically, I think this command will do it: git fetch http://review.whamcloud.com/p/fs/lustre-release refs/changes/46/5446/2 && git cherry-pick FETCH_HEAD

Jinshan Xiong (Inactive) added a comment - 15/May/13 5:01 PM

Hi Prakash, it turns out that your original patch is correct. I realize that it can cause reschedule by cl_page_put(). Can you please reinstate your patch so that Jeremy can give it a try?

I will try to fix this problem in CLIO simplification project.

Thanks in advance,

Jinshan Xiong (Inactive) added a comment - 15/May/13 5:01 PM Hi Prakash, it turns out that your original patch is correct. I realize that it can cause reschedule by cl_page_put(). Can you please reinstate your patch so that Jeremy can give it a try? I will try to fix this problem in CLIO simplification project. Thanks in advance,

Prakash Surya (Inactive) added a comment - 15/May/13 4:54 PM

Hmm.. I'm assuming you're running with patch set 5. Is there a back trace associated with the crash that you can post? That assertion makes me think the process was rescheduled to a different CPU in between the get() and put() calls, but preemption should have been disabled, so that shouldn't have happened..

Prakash Surya (Inactive) added a comment - 15/May/13 4:54 PM Hmm.. I'm assuming you're running with patch set 5. Is there a back trace associated with the crash that you can post? That assertion makes me think the process was rescheduled to a different CPU in between the get() and put() calls, but preemption should have been disabled, so that shouldn't have happened..

Jeremy Filizetti added a comment - 15/May/13 1:27 AM

While testing the patch for this bug: http://review.whamcloud.com/#change,5446 for ~~LU-3321~~ I came across an lbug. I think it happened dropping the caches manually (echo 1 > /proc/sys/vm/drop_caches).

kernel:LustreError: 28127:0:(cl_object.c:1104:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu[cpu] ) failed:

Jeremy Filizetti added a comment - 15/May/13 1:27 AM While testing the patch for this bug: http://review.whamcloud.com/#change,5446 for LU-3321 I came across an lbug. I think it happened dropping the caches manually (echo 1 > /proc/sys/vm/drop_caches). kernel:LustreError: 28127:0:(cl_object.c:1104:cl_env_percpu_put()) ASSERTION( cle == &cl_env_percpu [cpu] ) failed:

Jinshan Xiong (Inactive) added a comment - 29/Mar/13 12:21 AM - edited

So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)?

We should define a new context tag, say LCT_ATOMIC and this tag will be set when calling cl_env_alloc() in ll_releasepage(). Of course, key allocation functions should be changed to use GFP_ATOMIC if LCT_ATOMIC tag is set.

Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847. I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened ~~LU-3033~~ to facilitate discussion on that topic.

The major problem is that cl_page_put() can sleep, the calling path is as follows: cl_page_put -> cl_page_free -> cl_object_put -> lu_object_put -> lu_object_free -> lov_delete_raid0 -> cl_locks_prune. I'm quite sure env will be used in this calling path.

Update: after taking a further look, I realize ll_releasepage() may not sleep in the above code path. I will write a patch to make sure.

Jinshan Xiong (Inactive) added a comment - 29/Mar/13 12:21 AM - edited So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)? We should define a new context tag, say LCT_ATOMIC and this tag will be set when calling cl_env_alloc() in ll_releasepage(). Of course, key allocation functions should be changed to use GFP_ATOMIC if LCT_ATOMIC tag is set. Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847 . I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic. The major problem is that cl_page_put() can sleep, the calling path is as follows: cl_page_put -> cl_page_free -> cl_object_put -> lu_object_put -> lu_object_free -> lov_delete_raid0 -> cl_locks_prune. I'm quite sure env will be used in this calling path. Update: after taking a further look, I realize ll_releasepage() may not sleep in the above code path. I will write a patch to make sure.

Prakash Surya (Inactive) added a comment - 28/Mar/13 5:46 PM

Jinshan, I briefly looked into making the memory allocations atomic under ll_releasepage, but it doesn't look as simple as I hoped. From what I can tell there will potentially by 3 or more memory allocations when creating a new lu_env structure: 1. for the lu_env itself, 2. for the lu_env's keys, 3. for the lu_global_key 4. any allocations for other keys. So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)?

Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847. I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened ~~LU-3033~~ to facilitate discussion on that topic.

Prakash Surya (Inactive) added a comment - 28/Mar/13 5:46 PM Jinshan, I briefly looked into making the memory allocations atomic under ll_releasepage, but it doesn't look as simple as I hoped. From what I can tell there will potentially by 3 or more memory allocations when creating a new lu_env structure: 1. for the lu_env itself, 2. for the lu_env's keys, 3. for the lu_global_key 4. any allocations for other keys. So I don't think just making steps 1 and 2 atomic will provide us any benefit. We also need to ensure steps 3+ are atomically done as well, right? With that said, I'm not sure making these allocations atomic buys us anything. Shouldn't we just pass in the correct gfp flags to ensure we loop back on ourselves (e.g GFP_NOIO)? Also, I made a debug patch to simply drop the lu_env altogether from ll_releasepage, and pushed that through Maloo, http://review.whamcloud.com/5847 . I didn't get any failures, so that is probably a better approach to spend time on, IMO. Although merging a change like that will involve changes to the debug infrastructure. I opened LU-3033 to facilitate discussion on that topic.

Peter Jones added a comment - 27/Mar/13 6:24 PM

ok thanks Prakash!

Peter Jones added a comment - 27/Mar/13 6:24 PM ok thanks Prakash!

Prakash Surya (Inactive) added a comment - 27/Mar/13 6:16 PM

I don't think it's ready to be closed, but I'm OK dropping the priority.

Prakash Surya (Inactive) added a comment - 27/Mar/13 6:16 PM I don't think it's ready to be closed, but I'm OK dropping the priority.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Jan/13 5:55 PM

Updated:: 03/May/17 8:04 PM

Resolved:: 03/May/17 8:04 PM