Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.1
-
3
-
10736
Description
Using "perf" and "oprofile" on Grove's MDS (768 OSTs in the FS), I see an absurd amount of CPU time being spent contending on the kernel's vmlist_lock.
I ran oprofile for 10 mins on grove-mds1 to collect some CPU profiling information and it shows the top 5 CPU consumers being:
% symbol_name 50.2404 __write_lock_failed 7.5581 __vmalloc_node 6.9957 remove_vm_area 5.2425 cfs_percpt_lock 4.9054 intel_idle
If I dig into the details of __vmalloc_node and remove_vm_area, most of the time is being spent on these two lines vmalloc.c:1285 and vmalloc.c:1434 respectively. Both of these lines correspond to a linear traversal of the global vmlist while holding write privileges of the vmlist_lock.
Using "perf" during a 1 min interval, I collected stack traces of the threads to try and better understand to code path leading to the contention. I've copied a few of these traces below:
* 1.11% - kiblnd_sd_01_00 __write_lock_failed 100% - remove_vm_area __vunmap vfree cfs_free_large null_free_rs sptlrpc_svc_free_rs lustre_free_reply_state reply_out_callback lnet_eq_enqueue_event lnet_msg_detach_md lnet_finalize kiblnd_tx_done kiblnd_complete kiblnd_scheduler child_rip * 0.65% - kiblnd_sd_01_01 _spin_lock 98.97% - cfs_percpt_lock 55.89% - lnet_finalize kiblnd_tx_done kiblnd_tx_complete kiblnd_complete kiblnd_scheduler child_rip 43.93% - lnet_ptl_match_md lnet_parse kiblnd_handle_rx kiblnd_rx_complete kiblnd_complete kiblnd_scheduler child_rip * 0.46% - mdt_rdpg01_001 __write_lock_failed 69.34% - __vmalloc_node vmalloc cfs_alloc_large null_alloc_rs sptlrpc_svc_alloc_rs lustre_pack_reply_v2 lustre_pack_reply_flags lustre_pack_reply req_capsule_server_pack mdt_close mdt_handle_common mds_readpage_handle ptlrpc_server_handle_request ptlrpc_main child_rip 30.66% - remove_vm_area __vunmap vfree cfs_free_large null_free_rs sptlrpc_svc_free_rs lustre_free_reply_state ptlrpc_server_free_request ptlrpc_server_drop_request ptlrpc_server_finish_request ptlrpc_server_finish_active_request ptlrpc_server_handle_request ptlrpc_main child_rip * 0.35% - mdt00_047 __write_lock_failed 68.66% - __vmalloc_node vmalloc cfs_alloc_large null_alloc_rs sptlrpc_svc_alloc_rs lustre_pack_reply_v2 lustre_pack_reply_flags lustre_pack_reply req_capsule_server_pack 70.48% - mdt_unpack_req_pack_rep mdt_intent_policy ldlm_lock_enqueue ldlm_handle_enqueue0 mdt_enqueue mdt_handle_common mds_regular_handle ptlrpc_server_handle_request ptlrpc_main child_rip 29.52% - mdt_reint_internal mdt_intent_reint mdt_intent_policy ldlm_lock_enqueue ldlm_handle_enqueue0 mdt_enqueue mdt_handle_common mds_regular_handle ptlrpc_server_handle_request ptlrpc_main child_rip 31.34% - remove_vm_area __vunmap vfree cfs_free_large null_free_rs sptlrpc_svc_free_rs lustre_free_reply_state ptlrpc_server_free_request ptlrpc_server_drop_request ptlrpc_server_finish_request ptlrpc_server_finish_active_request ptlrpc_server_handle_request ptlrpc_main child_rip
These vmalloc and vfree calls need to be eliminated from the common code paths.