Details

    • 3
    • 10736

    Description

      Using "perf" and "oprofile" on Grove's MDS (768 OSTs in the FS), I see an absurd amount of CPU time being spent contending on the kernel's vmlist_lock.

      I ran oprofile for 10 mins on grove-mds1 to collect some CPU profiling information and it shows the top 5 CPU consumers being:

                                                                            
      %                symbol_name                                                    
      50.2404     __write_lock_failed                                                 
        7.5581     __vmalloc_node                                                     
        6.9957     remove_vm_area                                                     
        5.2425     cfs_percpt_lock                                                    
        4.9054     intel_idle                                                         
      

      If I dig into the details of __vmalloc_node and remove_vm_area, most of the time is being spent on these two lines vmalloc.c:1285 and vmalloc.c:1434 respectively. Both of these lines correspond to a linear traversal of the global vmlist while holding write privileges of the vmlist_lock.

      Using "perf" during a 1 min interval, I collected stack traces of the threads to try and better understand to code path leading to the contention. I've copied a few of these traces below:

                                                                            
      * 1.11% - kiblnd_sd_01_00                                                       
                __write_lock_failed                                                   
                100% - remove_vm_area                                                 
                       __vunmap                                                       
                       vfree                                                          
                       cfs_free_large                                                 
                       null_free_rs                                                   
                       sptlrpc_svc_free_rs                                            
                       lustre_free_reply_state                                        
                       reply_out_callback                                             
                       lnet_eq_enqueue_event                                          
                       lnet_msg_detach_md                                             
                       lnet_finalize                                                  
                       kiblnd_tx_done                                                 
                       kiblnd_complete                                                
                       kiblnd_scheduler                                               
                       child_rip                                                      
      * 0.65% - kiblnd_sd_01_01                                                       
                _spin_lock                                                            
                98.97% - cfs_percpt_lock                                              
                         55.89% - lnet_finalize                                       
                                  kiblnd_tx_done                                      
                                  kiblnd_tx_complete                                  
                                  kiblnd_complete                                     
                                  kiblnd_scheduler                                    
                                  child_rip                                           
                         43.93% - lnet_ptl_match_md                                   
                                  lnet_parse                                          
                                  kiblnd_handle_rx                                    
                                  kiblnd_rx_complete                                  
                                  kiblnd_complete                                     
                                  kiblnd_scheduler                                    
                                  child_rip                                           
      * 0.46% - mdt_rdpg01_001                                                        
                __write_lock_failed                                                   
                69.34% - __vmalloc_node                                               
                         vmalloc                                                      
                         cfs_alloc_large                                              
                         null_alloc_rs                                                
                         sptlrpc_svc_alloc_rs                                         
                         lustre_pack_reply_v2                                         
                         lustre_pack_reply_flags                                      
                         lustre_pack_reply                                            
                         req_capsule_server_pack                                      
                         mdt_close                                                    
                         mdt_handle_common                                            
                         mds_readpage_handle                                          
                         ptlrpc_server_handle_request                                 
                         ptlrpc_main                                                  
                         child_rip                                                    
                30.66% - remove_vm_area                                               
                         __vunmap                                                     
                         vfree                                                        
                         cfs_free_large                                               
                         null_free_rs                                                 
                         sptlrpc_svc_free_rs                                          
                         lustre_free_reply_state                                      
                         ptlrpc_server_free_request                                   
                         ptlrpc_server_drop_request                                   
                         ptlrpc_server_finish_request                                 
                         ptlrpc_server_finish_active_request                          
                         ptlrpc_server_handle_request                                 
                         ptlrpc_main                                                  
                         child_rip
      * 0.35% - mdt00_047                                                             
                __write_lock_failed                                                   
                68.66% - __vmalloc_node                                               
                         vmalloc                                                      
                         cfs_alloc_large                                              
                         null_alloc_rs                                                
                         sptlrpc_svc_alloc_rs                                         
                         lustre_pack_reply_v2                                         
                         lustre_pack_reply_flags                                      
                         lustre_pack_reply                                            
                         req_capsule_server_pack                                      
                         70.48% - mdt_unpack_req_pack_rep                             
                                  mdt_intent_policy                                   
                                  ldlm_lock_enqueue                                   
                                  ldlm_handle_enqueue0                                
                                  mdt_enqueue                                         
                                  mdt_handle_common                                   
                                  mds_regular_handle                                  
                                  ptlrpc_server_handle_request                        
                                  ptlrpc_main                                         
                                  child_rip                                           
                         29.52% - mdt_reint_internal                                  
                                  mdt_intent_reint                                    
                                  mdt_intent_policy                                   
                                  ldlm_lock_enqueue                                   
                                  ldlm_handle_enqueue0                                
                                  mdt_enqueue                                         
                                  mdt_handle_common                                   
                                  mds_regular_handle                                  
                                  ptlrpc_server_handle_request                        
                                  ptlrpc_main                                         
                                  child_rip                                           
                31.34% - remove_vm_area                                               
                         __vunmap                                                     
                         vfree                                                        
                         cfs_free_large                                               
                         null_free_rs                                                 
                         sptlrpc_svc_free_rs                                          
                         lustre_free_reply_state                                      
                         ptlrpc_server_free_request                                   
                         ptlrpc_server_drop_request                                   
                         ptlrpc_server_finish_request                                 
                         ptlrpc_server_finish_active_request                          
                         ptlrpc_server_handle_request                                 
                         ptlrpc_main                                                  
                         child_rip                                                    
      

      These vmalloc and vfree calls need to be eliminated from the common code paths.

      Attachments

        Activity

          People

            green Oleg Drokin
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: