[LU-4008] Reduce vmalloc usage on MDS Created: 25/Sep/13 Updated: 29/Apr/14 Resolved: 15/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Severity: | 3 |
| Rank (Obsolete): | 10736 |
| Description |
|
Using "perf" and "oprofile" on Grove's MDS (768 OSTs in the FS), I see an absurd amount of CPU time being spent contending on the kernel's vmlist_lock. I ran oprofile for 10 mins on grove-mds1 to collect some CPU profiling information and it shows the top 5 CPU consumers being: % symbol_name 50.2404 __write_lock_failed 7.5581 __vmalloc_node 6.9957 remove_vm_area 5.2425 cfs_percpt_lock 4.9054 intel_idle If I dig into the details of __vmalloc_node and remove_vm_area, most of the time is being spent on these two lines vmalloc.c:1285 and vmalloc.c:1434 respectively. Both of these lines correspond to a linear traversal of the global vmlist while holding write privileges of the vmlist_lock. Using "perf" during a 1 min interval, I collected stack traces of the threads to try and better understand to code path leading to the contention. I've copied a few of these traces below:
* 1.11% - kiblnd_sd_01_00
__write_lock_failed
100% - remove_vm_area
__vunmap
vfree
cfs_free_large
null_free_rs
sptlrpc_svc_free_rs
lustre_free_reply_state
reply_out_callback
lnet_eq_enqueue_event
lnet_msg_detach_md
lnet_finalize
kiblnd_tx_done
kiblnd_complete
kiblnd_scheduler
child_rip
* 0.65% - kiblnd_sd_01_01
_spin_lock
98.97% - cfs_percpt_lock
55.89% - lnet_finalize
kiblnd_tx_done
kiblnd_tx_complete
kiblnd_complete
kiblnd_scheduler
child_rip
43.93% - lnet_ptl_match_md
lnet_parse
kiblnd_handle_rx
kiblnd_rx_complete
kiblnd_complete
kiblnd_scheduler
child_rip
* 0.46% - mdt_rdpg01_001
__write_lock_failed
69.34% - __vmalloc_node
vmalloc
cfs_alloc_large
null_alloc_rs
sptlrpc_svc_alloc_rs
lustre_pack_reply_v2
lustre_pack_reply_flags
lustre_pack_reply
req_capsule_server_pack
mdt_close
mdt_handle_common
mds_readpage_handle
ptlrpc_server_handle_request
ptlrpc_main
child_rip
30.66% - remove_vm_area
__vunmap
vfree
cfs_free_large
null_free_rs
sptlrpc_svc_free_rs
lustre_free_reply_state
ptlrpc_server_free_request
ptlrpc_server_drop_request
ptlrpc_server_finish_request
ptlrpc_server_finish_active_request
ptlrpc_server_handle_request
ptlrpc_main
child_rip
* 0.35% - mdt00_047
__write_lock_failed
68.66% - __vmalloc_node
vmalloc
cfs_alloc_large
null_alloc_rs
sptlrpc_svc_alloc_rs
lustre_pack_reply_v2
lustre_pack_reply_flags
lustre_pack_reply
req_capsule_server_pack
70.48% - mdt_unpack_req_pack_rep
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
mdt_enqueue
mdt_handle_common
mds_regular_handle
ptlrpc_server_handle_request
ptlrpc_main
child_rip
29.52% - mdt_reint_internal
mdt_intent_reint
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
mdt_enqueue
mdt_handle_common
mds_regular_handle
ptlrpc_server_handle_request
ptlrpc_main
child_rip
31.34% - remove_vm_area
__vunmap
vfree
cfs_free_large
null_free_rs
sptlrpc_svc_free_rs
lustre_free_reply_state
ptlrpc_server_free_request
ptlrpc_server_drop_request
ptlrpc_server_finish_request
ptlrpc_server_finish_active_request
ptlrpc_server_handle_request
ptlrpc_main
child_rip
These vmalloc and vfree calls need to be eliminated from the common code paths. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 26/Sep/13 ] |
|
Hello Prakash, Thanks to report this and have made this so interesting profiling! |
| Comment by Prakash Surya (Inactive) [ 26/Sep/13 ] |
No, unfortunately Grove is on the SCF so the best I can do is hand copy relevant information.
I don't know the specific workload, this was taken from a "random" point in time with a production environment user workload. With that said, we can see from the stacks that ldlm_enqueue's and mdt_closes's were occurring. Here's what I think the problem is (please correct me if I'm wrong).. Grove's filesystem has 768 OSTs in it. This means any buffer allocated based on the max stripe count on the system will be sizeof(struct lov_mds_md) + 768 * sizeof(struct lov_ost_data) bytes, or about 18K. This is bigger than the 4 * PAGE_SIZE limit for kmallocs. The clients are PPC64 with 64K pages, so even though we're running with http://review.whamcloud.com/#/c/6339 , the "default" size used for EA buffers can (and did) grow to the "max" size of 18K (18K is still less than the 64K limit on PPC64). When this buffer is sent to the x86_64 MDS, it results in a vmalloc because the server only has 4K pages (18K is bigger than the 16K kmalloc limit in OBD_ALLOC_LARGE). Thus, for every enqueue a client sends, a vmalloc is performed on the MDS causing the contention on the global kernel vmlist_lock rwlock. That is for enqueue's, the story for closes is similar but slightly different. On a mdt_close call, a buffer is allocated based on the mdt_device's mdt_max_mdsize field. This appears to be sized based on the maximum stripe count encountered (similar to how the clients max_easize is maintained). Thus, the MDS must have seen a file striped over all 768 OSTs, now causing all closes to perform an 18K vmalloc. To compound the issue, these vmalloc calls are made even more expensive on Grove because it is using ZFS as a backend, which currently makes heavy usage of vmalloc. Dumping /proc/vmallocinfo, I see over 85K entries attributed to the SPL and about 100K overall. So the linear traversal of the vmlist is more expensive than it would be on an ldiskfs FS, which only has about 3K entries listed in /proc/vmallocinfo (based on our test systems using ldiskfs). So, to summarize a little, the issue has to do the subtle interaction between 64K page clients and 4K page servers, a large OST filesystem, and vmalloc performance degradation due to a its higher usage with ZFS backends. I see a few different ways to solve this: 1. Change the client to limit the "default" easize to something else:
1.1. Maybe a hardcoded 4K?
1.2. Maybe a "page size" the server and client agree on at connect time
1.3. Something else entirely?
2. Change the server to ignore the buffer size given by the client entirely.
2.1. The server already determines the "correct" size to use if the buffer is too small. This can become the default case.
2.2. If the client allocated a buffer too big, how do we handle this? A "too small" buffer is already handled.
2.3. Prevents a malicious client from endangering the server
3. Make a more dramatic change to use a scatter-gather type implementation for the stripe information
3.1. This would probably be the "best" fix as it removes vmalloc's for even large striped files, but would be the most intrusive.
|
| Comment by Andreas Dilger [ 16/Oct/13 ] |
|
Prakash, one of my review comments for http://review.whamcloud.com/#/c/6339/4..10/lustre/include/lustre_mdc.h was that the client is only tracking the maximum request size, and not the average request size. I see that you made the same comment on a later version of the patch as well. I suspect that if the client tracked an average (decaying mean is probably easiest) and that average is low, it would resolve the problem being seen here. If the average is actually so large (e.g. default stripe count is 768) then there really isn't anything to be done about it. Having the server track the average size would also make sense. |
| Comment by Matt Ezell [ 18/Feb/14 ] |
|
ORNL is seeing similar oprofile data. We have 1008 OSTs/fs. It looks like the patch passed tests in Maloo, can we get it re-reviewed and possibly landed? |
| Comment by Matt Ezell [ 18/Feb/14 ] |
|
Sorry, I mean the |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
At ORNL we have been doing more analysis of what is causing the problem. From the lctl dk we did we got: 00000100:00100000:6.0:1392825143.344295:0:8600:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg01_049:15b455e6-de55-2b3a-2739-59d55e9a8b55+228882:45709:x1458489733340076:12345-10.36.205.208@o2ib:35 and not so often. 00000100:00100000:6.0:1392825143.352790:0:15055:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_354:fbd7b694-dc61-6f11-a2ce-dd71afe2b28d+1242:11359:x1459714658729504:12345-10.36.226.81@o2ib:101 From the op codes we see that sometimes LDLM_ENQUEUE is the offender but it's allocations are a mixture of vmallocs and kmallocs. The constant big offender is MDS_CLOSE. As a test we created |
| Comment by Prakash Surya (Inactive) [ 19/Feb/14 ] |
IMO, you most certainly should not increase the vmalloc/kmalloc threshold to 32K. That would mean on a 4K page system, you would need 8 contiguous pages to satisfy the request, which just isn't acceptable. On some of our servers, we've often seen order 2 page requests (i.e. 4 contiguous pages) go unfulfilled (i.e. fail) after it has been running for awhile and the memory has become sufficiently fragmented. The real solution is to move away from vmalloc entirely, but honestly, I don't see upstream buying in to that idea. The way vmalloc is hidden away is troubling at best. This being a performance critical piece of systems software, I don't really understand why such a expensive call would be brushed under the rug like it is. |
| Comment by Patrick Farrell (Inactive) [ 19/Feb/14 ] |
|
Prakash - Move away from vmalloc to use what instead? |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
I bet he is think scatter gather list. |
| Comment by Prakash Surya (Inactive) [ 19/Feb/14 ] |
That's kind of a loaded question which really depends on the use case, which highlights my gripes with OBD_ALLOC in the fist place. Memory allocations, and the method used to allocate, are delicate in the linux kernel and should be chosen with care. Use __get_free_page(), or kmalloc, or slabs. If you really, really need a large allocation use a vector of pages. If you really, really need a large allocation and you really, really need it to have a contiguous address space use vmalloc (but you're probably doing something "wrong"; the linux kernel frowns heavily on vmalloc usage AFAIK). |
| Comment by Oleg Drokin [ 20/Feb/14 ] |
|
I guess it's really unfortunate that we allocate full-sized buffer for close. And in master - since unlinks are now done by MDT anyway, there's no need to even allocate this, I guess as it's not being sent back. So this patch does not seem to have any ill effects in my light sanity testing: diff --git a/lustre/mdt/mdt_open.c b/lustre/mdt/mdt_open.c
index f5a244a..645f069 100644
--- a/lustre/mdt/mdt_open.c
+++ b/lustre/mdt/mdt_open.c
@@ -2275,7 +2275,7 @@ int mdt_close(struct tgt_session_info *tsi)
req_capsule_set_size(info->mti_pill, &RMF_MDT_MD, RCL_SERVER,
info->mti_mdt->mdt_max_mdsize);
req_capsule_set_size(info->mti_pill, &RMF_LOGCOOKIES, RCL_SERVER,
- info->mti_mdt->mdt_max_cookiesize);
+ 0);
rc = req_capsule_server_pack(info->mti_pill);
if (mdt_check_resent(info, mdt_reconstruct_generic, NULL)) {
mdt_client_compatibility(info);
In fact all other places like mdt_reint_internal() already set it to 0, so this must be just one forgotten place. |
| Comment by Oleg Drokin [ 20/Feb/14 ] |
|
Patch is at http://review.whamcloud.com/9322 |
| Comment by Matt Ezell [ 20/Feb/14 ] |
|
Jason turned on +malloc to our mdstrace utility and I extracted the following statistics: # At 201402201540
$ bzgrep vmalloc rpctrace-mds1.bz2 |awk '{print $4}'|sort -n|uniq -c|sort -nr
7023 25440
3240 25184
5 24832
$ bzgrep "Handling RPC" rpctrace-mds1.bz2 | awk -F":" '{print $20}'|sort -n|uniq -c|sort -nr
19734 400
11101 101
3240 35
901 103
702 36
625 33
109 37
45 41
12 44
11 49
# At 201402201550
$ bzgrep vmalloc rpctrace-mds1.bz2 |awk '{print $4}'|sort -n|uniq -c|sort -nr
5674 25440
3278 25184
2 24832
$ bzgrep "Handling RPC" rpctrace-mds1.bz2 | awk -F":" '{print $20}'|sort -n|uniq -c|sort -nr
19537 400
9423 101
3278 35
940 36
889 103
385 33
174 37
31 41
3 44
2 49
$ bzgrep vmalloc rpctrace-mds1.bz2 |awk '{print $4}'|sort -n|uniq -c|sort -nr
3362 25440
2813 25184
$ bzgrep "Handling RPC" rpctrace-mds1.bz2 | awk -F":" '{print $20}'|sort -n|uniq -c|sort -nr
19551 400
6903 101
2813 35
408 33
342 103
200 36
62 37
32 41
10 44
This is telling me that:
|
| Comment by Matt Ezell [ 20/Feb/14 ] |
|
Quick analysis doesn't show anything obvious that points to a particular lock mode, flag, etc. Here are some samples of lock creates immediately followed by a vmalloc of size 25440: 00010000:00010000:3.0:1392931201.154293:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff882657d60b40/0xfc2e1fc34eb83ce5 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d690b expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.154501:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a82c000. -- 00010000:00010000:3.0:1392931201.155705:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff883ebe0a0d80/0xfc2e1fc34eb83d01 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6935 expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.155917:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a834000. -- 00010000:00010000:3.0:1392931201.156706:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff883eaf3fc900/0xfc2e1fc34eb83d1d lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6966 expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.156926:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a83c000. -- 00010000:00010000:3.0:1392931201.157814:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff88253b373d80/0xfc2e1fc34eb83d39 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6990 expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.158021:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a844000. -- 00010000:00010000:3.0:1392931201.158715:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff88259560ed80/0xfc2e1fc34eb83d55 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d69ba expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.158931:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a84c000. -- 00010000:00010000:3.0:1392931201.159783:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff883f65bbcb40/0xfc2e1fc34eb83d71 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d69eb expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.159995:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a854000. -- 00010000:00010000:3.0:1392931201.160900:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff883fc4e0f6c0/0xfc2e1fc34eb83d8d lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6a15 expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.161112:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a85c000. -- 00010000:00010000:3.0:1392931201.162075:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff88266ac5b000/0xfc2e1fc34eb83da9 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6a3f expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.162290:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a864000. -- 00010000:00010000:3.0:1392931201.163049:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff88251fe1a6c0/0xfc2e1fc34eb83dc5 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6a70 expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.163261:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a86c000. -- 00010000:00010000:3.0:1392931201.164044:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff88251fe1a240/0xfc2e1fc34eb83dfd lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6a9a expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.164257:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a874000. -- 00010000:00010000:3.0:1392931201.165015:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff88263ffaeb40/0xfc2e1fc34eb83e19 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6acb expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.165228:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a87c000. -- 00010000:00010000:3.0:1392931201.165992:0:15472:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff883fbc84bd80/0xfc2e1fc34eb83e35 lrc: 2/0,0 mode: --/CR res: [0x200030c2a:0x10588:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x8e120b6c361d6af5 expref: -99 pid: 15472 timeout: 0 lvb_type: 0 used 0 02000000:00000010:3.0:1392931201.166204:0:15472:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a884000. -- 00010000:00010000:22.0:1392931201.186975:0:14836:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff881d41d7b240/0xfc2e1fc34eb83eba lrc: 2/0,0 mode: --/CR res: [0x2000346c3:0x1e3e1:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0xb81ed23ed695939d expref: -99 pid: 14836 timeout: 0 lvb_type: 0 used 0 02000000:00000010:22.0:1392931201.187670:0:14836:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a8a4000. -- 00010000:00010000:22.0:1392931201.206093:0:14836:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff881b61d52000/0xfc2e1fc34eb83ef9 lrc: 2/0,0 mode: --/CR res: [0x2000346c3:0x1e3e1:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0xb81ed23ed69593e3 expref: -99 pid: 14836 timeout: 0 lvb_type: 0 used 0 02000000:00000010:22.0:1392931201.206463:0:14836:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a8b4000. -- 00010000:00010000:22.0:1392931201.231631:0:14836:0:(ldlm_lockd.c:1333:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlas1-MDT0000_UUID lock: ffff881b8febc000/0xfc2e1fc34eb83f38 lrc: 2/0,0 mode: --/CR res: [0x2000346c3:0x164a2:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0xb81ed23ed6959429 expref: -99 pid: 14836 timeout: 0 lvb_type: 0 used 0 02000000:00000010:22.0:1392931201.231850:0:14836:0:(sec_null.c:320:null_alloc_rs()) vmalloced 'rs': 25440 at ffffc9006a8cc000. |
| Comment by Andreas Dilger [ 21/Feb/14 ] |
|
The http://review.whamcloud.com/6339 patch will not directly help on the server. That patch is only fixing the reply buffer allocation on the client. It might help indirectly by virtue of the server not allocating its own reply buffer as large to match the client allocating the full-sized buffer. |
| Comment by Matt Ezell [ 21/Feb/14 ] |
|
We just checked on a smaller system (280 OSTs) and we didn't see any vmallocs. I commented in patch 9322 that I don't think it will alleviate our issue here. The problem, at least for the MDS_CLOSE RPCs, seems to be using mdt_max_mdsize to size the buffer. Since sizeof(lov_ost_data) is 24, the max LOV EA on our 1008-OST systems is 24K (greater than the 16K boundary to force vmallocs). Like in patch 6339, can the server allocate a smaller buffer and ptlrpc will expand it if it has overflowed? Alex Zhuravlev made the following comment in the patch:
How do we move forward? |
| Comment by Oleg Drokin [ 22/Feb/14 ] |
|
I am reworking the patch into two. |
| Comment by Oleg Drokin [ 24/Feb/14 ] |
|
Ok, I updated the patch that hopefully should reduce the allocation burden during closes, unlinks and lockless getattrs (those are rares than locked ones, though). Lightly tested. Now the more complicated dynamic management one is next, I guess. |
| Comment by Oleg Drokin [ 27/Feb/14 ] |
|
first stab at dynamic reply allocation in: http://review.whamcloud.com/#/c/9409/ Please try it on your many-osts testbed. Also with files striped mo than 100-way and see in what cases do you have any vmallocs left. |
| Comment by Matt Ezell [ 27/Feb/14 ] |
|
Thanks Oleg. After applying 9322 PS3 today we no longer see vmallocs associated with MDS_CLOSE opcodes, but we do see them with LDLM_ENQUEUE. I briefly looked at patch 9409 and I don't see how it would enqueues. I plan on looking into it more tomorrow, but my current train of thought is:
Is this used to supply the layout to the client along with the layout lock? If we could assign this size based on the actual stripe data length it would certainly help. Or if we could default to something smaller and have the lower layers reallocate for wide-striped files. We don't currently have a test setup with > 650 OSTs, which has made testing this a little more difficult. |
| Comment by Oleg Drokin [ 27/Feb/14 ] |
|
Yes, this is used to supply a layout to the client, I guess we can try to tame this as a next step. There's also an extra path to this if mdt_intent_layout is called, but old clients have no way of triggering this. Also note that currently there's only one kind of lock that could be enqueued by the client with no intent: the PR UPDATE lock to guard pages during readdir. Everything else comes in as intent. |
| Comment by Matt Ezell [ 27/Feb/14 ] |
|
We have about 300 2.4+ clients, so it may be possible that they are adding to the problem through the mdt_intent_layout code path. |
| Comment by Matt Ezell [ 07/Mar/14 ] |
|
Thanks for the help Oleg. We deployed the second patch to production today and it has greatly reduced vmalloc calls. The following data is the number of vmallocs from our RPC trace report data collection. Each report covers approximately 60 seconds. We took the MDS down at 201403061330 and it took around an hour to go through recovery. report.201403061000/rpctrace-mds1.bz2:3204 report.201403061010/rpctrace-mds1.bz2:7230 report.201403061020/rpctrace-mds1.bz2:5056 report.201403061030/rpctrace-mds1.bz2:1676 report.201403061040/rpctrace-mds1.bz2:7014 report.201403061050/rpctrace-mds1.bz2:679 report.201403061100/rpctrace-mds1.bz2:2916 report.201403061110/rpctrace-mds1.bz2:3578 report.201403061120/rpctrace-mds1.bz2:5015 report.201403061130/rpctrace-mds1.bz2:2832 report.201403061140/rpctrace-mds1.bz2:5216 report.201403061150/rpctrace-mds1.bz2:2981 report.201403061200/rpctrace-mds1.bz2:1902 report.201403061210/rpctrace-mds1.bz2:2454 report.201403061220/rpctrace-mds1.bz2:12722 report.201403061230/rpctrace-mds1.bz2:2017 report.201403061240/rpctrace-mds1.bz2:2571 report.201403061250/rpctrace-mds1.bz2:2891 report.201403061300/rpctrace-mds1.bz2:3474 report.201403061310/rpctrace-mds1.bz2:2544 report.201403061320/rpctrace-mds1.bz2:3009 report.201403061330/rpctrace-mds1.bz2:3039 report.201403061550/rpctrace-mds1.bz2:139 report.201403061600/rpctrace-mds1.bz2:99 report.201403061610/rpctrace-mds1.bz2:140 report.201403061620/rpctrace-mds1.bz2:127 report.201403061630/rpctrace-mds1.bz2:91 report.201403061640/rpctrace-mds1.bz2:58 report.201403061650/rpctrace-mds1.bz2:13 report.201403061700/rpctrace-mds1.bz2:12 report.201403061710/rpctrace-mds1.bz2:0 report.201403061720/rpctrace-mds1.bz2:0 report.201403061730/rpctrace-mds1.bz2:0 report.201403061740/rpctrace-mds1.bz2:0 report.201403061750/rpctrace-mds1.bz2:0 report.201403061800/rpctrace-mds1.bz2:0 report.201403061810/rpctrace-mds1.bz2:0 report.201403061820/rpctrace-mds1.bz2:0 report.201403061830/rpctrace-mds1.bz2:0 report.201403061840/rpctrace-mds1.bz2:0 report.201403061850/rpctrace-mds1.bz2:2 report.201403061900/rpctrace-mds1.bz2:1 report.201403061910/rpctrace-mds1.bz2:0 A typical lock enqueue for this looks like: 00010000:00010000:13.0:1394140205.643177:0:13482:0:(ldlm_lock.c:1090:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-atlas1-MDT0000_UUID lock: ffff883ec0a0f900/0xdc79d3b3eb5bf392 lrc: 3/0,0 mode: CR/CR res: [0x20008e3a7:0x8592:0x0].0 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: 10.36.202.130@o2ib remote: 0x7096ce8691f14c60 expref: 143537 pid: 13482 timeout: 0 lvb_type: 3 used 0 I looked into it, and it seems like most of the ones after we changed the RPMs are coming from the same node. It's a 2.5 client used for data transfer. I looked at the FIDs and determined that they were all stripe-count 4 in a stripe-count 4 directory. Looking at the process table, I was able to figure out what process was causing them; it is a data transfer app called 'bbcp'. It will take a couple days before we get user feedback about file system responsiveness, but CPU load seems to have been reduced nicely. |
| Comment by Oleg Drokin [ 10/Mar/14 ] |
|
When you say "A typical lock enqueue for this looks like" - what do you mean by this? Remaining vmalloc enqueue? |
| Comment by Matt Ezell [ 11/Mar/14 ] |
|
Yes, that is an example of the lock data related to an enqueue that caused a vmalloc. You are correct, bits 8 is for layout: #define MDS_INODELOCK_LAYOUT 0x000008 /* for layout */
I pushed http://review.whamcloud.com/9571/ that I want to do some testing with this week on our thin-lun wide-stripe file system. Does this look safe? The only reason that the actual size is available there is because of Can/should these patches be marked for additional failover tests to ensure they are safe in those situations? |
| Comment by Matt Ezell [ 29/Mar/14 ] |
|
Just an update - we are running the first two patches in production, and we haven't been seeing ridiculous numbers of vmallocs. When we added in some 2.5 clients, we saw vmallocs again. Patch 9571 appears to successfully avoid those. 9571 will go into production next Tuesday. We do still see vmallocs for wide-striped files (those over 650-ish stripes). This is expected, but we expect use of wide-stripe files to be somewhat rare. The exception to this is users that specify '-1' for the stripe count, resulting in 1008-way striped files. We need all three of these to land to master and then be backported to 2.5 |
| Comment by Jodi Levi (Inactive) [ 07/Apr/14 ] |
|
I see 2 patches for this ticket. Can you point me to the third so I can track? |
| Comment by Matt Ezell [ 07/Apr/14 ] |
|
The three are 9322, 9409, and 9571. 9409 was abandoned and its functionality was merged into 9322. So you only need to track 9322 and 9571. |
| Comment by Jodi Levi (Inactive) [ 15/Apr/14 ] |
|
Both 9322 and 9571 patches have landed to Master. Please reopen ticket if more work is needed. |
| Comment by Matt Ezell [ 15/Apr/14 ] |
|
These need to be landed to b2_5. Patch 9322 for b2_5 is 9942 - it should be ready to land. The b2_5 version of 9571 has not yet been cherry-picked, since it will depend on 9942. |
| Comment by Peter Jones [ 15/Apr/14 ] |
|
It's ok Matt - we already have this one tracked to be included in 2.5.2 and 2.4.4. |
| Comment by James Nunez (Inactive) [ 18/Apr/14 ] |
|
b2_5 patches at |