Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.1
-
3
-
10736
Description
Using "perf" and "oprofile" on Grove's MDS (768 OSTs in the FS), I see an absurd amount of CPU time being spent contending on the kernel's vmlist_lock.
I ran oprofile for 10 mins on grove-mds1 to collect some CPU profiling information and it shows the top 5 CPU consumers being:
% symbol_name 50.2404 __write_lock_failed 7.5581 __vmalloc_node 6.9957 remove_vm_area 5.2425 cfs_percpt_lock 4.9054 intel_idle
If I dig into the details of __vmalloc_node and remove_vm_area, most of the time is being spent on these two lines vmalloc.c:1285 and vmalloc.c:1434 respectively. Both of these lines correspond to a linear traversal of the global vmlist while holding write privileges of the vmlist_lock.
Using "perf" during a 1 min interval, I collected stack traces of the threads to try and better understand to code path leading to the contention. I've copied a few of these traces below:
* 1.11% - kiblnd_sd_01_00
__write_lock_failed
100% - remove_vm_area
__vunmap
vfree
cfs_free_large
null_free_rs
sptlrpc_svc_free_rs
lustre_free_reply_state
reply_out_callback
lnet_eq_enqueue_event
lnet_msg_detach_md
lnet_finalize
kiblnd_tx_done
kiblnd_complete
kiblnd_scheduler
child_rip
* 0.65% - kiblnd_sd_01_01
_spin_lock
98.97% - cfs_percpt_lock
55.89% - lnet_finalize
kiblnd_tx_done
kiblnd_tx_complete
kiblnd_complete
kiblnd_scheduler
child_rip
43.93% - lnet_ptl_match_md
lnet_parse
kiblnd_handle_rx
kiblnd_rx_complete
kiblnd_complete
kiblnd_scheduler
child_rip
* 0.46% - mdt_rdpg01_001
__write_lock_failed
69.34% - __vmalloc_node
vmalloc
cfs_alloc_large
null_alloc_rs
sptlrpc_svc_alloc_rs
lustre_pack_reply_v2
lustre_pack_reply_flags
lustre_pack_reply
req_capsule_server_pack
mdt_close
mdt_handle_common
mds_readpage_handle
ptlrpc_server_handle_request
ptlrpc_main
child_rip
30.66% - remove_vm_area
__vunmap
vfree
cfs_free_large
null_free_rs
sptlrpc_svc_free_rs
lustre_free_reply_state
ptlrpc_server_free_request
ptlrpc_server_drop_request
ptlrpc_server_finish_request
ptlrpc_server_finish_active_request
ptlrpc_server_handle_request
ptlrpc_main
child_rip
* 0.35% - mdt00_047
__write_lock_failed
68.66% - __vmalloc_node
vmalloc
cfs_alloc_large
null_alloc_rs
sptlrpc_svc_alloc_rs
lustre_pack_reply_v2
lustre_pack_reply_flags
lustre_pack_reply
req_capsule_server_pack
70.48% - mdt_unpack_req_pack_rep
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
mdt_enqueue
mdt_handle_common
mds_regular_handle
ptlrpc_server_handle_request
ptlrpc_main
child_rip
29.52% - mdt_reint_internal
mdt_intent_reint
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
mdt_enqueue
mdt_handle_common
mds_regular_handle
ptlrpc_server_handle_request
ptlrpc_main
child_rip
31.34% - remove_vm_area
__vunmap
vfree
cfs_free_large
null_free_rs
sptlrpc_svc_free_rs
lustre_free_reply_state
ptlrpc_server_free_request
ptlrpc_server_drop_request
ptlrpc_server_finish_request
ptlrpc_server_finish_active_request
ptlrpc_server_handle_request
ptlrpc_main
child_rip
These vmalloc and vfree calls need to be eliminated from the common code paths.