[LU-7085] Toward smaller memory allocations on wide-stripe file systems - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
None
Environment:
Test nodes with 2.7.57-gea38322

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I'm testing on a fairly recent master build that includes the patch from ~~LU-6587~~ (refactor OBD_ALLOC_LARGE to always do kmalloc first). That band-aid has been great a improving performance on our wide-stripe file systems, but in the face of memory pressure/fragmentation, it will still rely on vmalloc to satisfy memory requests. Since users tend to use RAM, I'd like to see if there are any opportunities to reduce allocation sizes.

Anywhere we need to allocate sizeof(something) * num_stripes, we should check to see if there's any way to avoid per-stripe information or at least reduce sizeof(something).

Attachments

Issue Links

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-7085] Toward smaller memory allocations on wide-stripe file systems

Gerrit Updater added a comment - 04/Dec/15 6:35 AM

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17476
Subject: ~~LU-7085~~ lov: trying smaller memory allocations
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c7f009755ef03080c599347cf0452a9bd7cf5f9

Gerrit Updater added a comment - 04/Dec/15 6:35 AM Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17476 Subject: LU-7085 lov: trying smaller memory allocations Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8c7f009755ef03080c599347cf0452a9bd7cf5f9

Yang Sheng added a comment - 09/Nov/15 4:55 PM

I'll trying second way and then would doing it in first way if we still not satisfy the effect.

Thanks,
YangSheng

Yang Sheng added a comment - 09/Nov/15 4:55 PM I'll trying second way and then would doing it in first way if we still not satisfy the effect. Thanks, YangSheng

Jian Yu added a comment - 06/Nov/15 12:57 AM

Thank you, Yang Sheng. Which way would you prefer to implement?

Jian Yu added a comment - 06/Nov/15 12:57 AM Thank you, Yang Sheng. Which way would you prefer to implement?

Yang Sheng added a comment - 02/Nov/15 10:43 AM

After a few test and discuss with clio expert. Looks like lis_subs buffer allocation is not avoid entirely. One way is allocate every osc separately. So we can skip some osc needn't sync data. But it really bring some complexity. Other way just trying to reduce the struct size.

Thanks,
YangSheng

Yang Sheng added a comment - 02/Nov/15 10:43 AM After a few test and discuss with clio expert. Looks like lis_subs buffer allocation is not avoid entirely. One way is allocate every osc separately. So we can skip some osc needn't sync data. But it really bring some complexity. Other way just trying to reduce the struct size. Thanks, YangSheng

Jian Yu added a comment - 22/Oct/15 12:57 AM

I think we can skip invoke cl_sync_file_range while unlink file.

Hi Yang Sheng, are you going to create a patch for this?

Jian Yu added a comment - 22/Oct/15 12:57 AM I think we can skip invoke cl_sync_file_range while unlink file. Hi Yang Sheng, are you going to create a patch for this?

Yang Sheng added a comment - 14/Oct/15 2:16 PM - edited

The 'lio->lis_subs'big buffer allocated from cl_io_init invoked by cl_sync_file_range. The code path as below:
iput=> iput_final => drop_inode => ll_delete_inode => cl_sync_file_range

I think we can skip invoke cl_sync_file_range while unlink file.

Yang Sheng added a comment - 14/Oct/15 2:16 PM - edited The 'lio->lis_subs'big buffer allocated from cl_io_init invoked by cl_sync_file_range. The code path as below: iput=> iput_final => drop_inode => ll_delete_inode => cl_sync_file_range I think we can skip invoke cl_sync_file_range while unlink file.

Jian Yu added a comment - 01/Oct/15 9:21 PM

There should not be a large buffer allocation for unlinks, that would be a bug I think.

Hi Yang Sheng,

Could you please look into the above issue? Thank you.

Jian Yu added a comment - 01/Oct/15 9:21 PM There should not be a large buffer allocation for unlinks, that would be a bug I think. Hi Yang Sheng, Could you please look into the above issue? Thank you.

Andreas Dilger added a comment - 25/Sep/15 6:26 AM

There should not be a large buffer allocation for unlinks, that would be a bug I think.

As for the max reply size decay, that was something I proposed during patch review but was never implemented. I agree that if access to wide striped files is rare that it may make sense to reduce the allocations again, but the hard part is to figure out what "rarely" means do that there are not continual resends.

Andreas Dilger added a comment - 25/Sep/15 6:26 AM There should not be a large buffer allocation for unlinks, that would be a bug I think. As for the max reply size decay, that was something I proposed during patch review but was never implemented. I agree that if access to wide striped files is rare that it may make sense to reduce the allocations again, but the hard part is to figure out what "rarely" means do that there are not continual resends.

Matt Ezell added a comment - 24/Sep/15 4:00 PM

Oleg, thanks for the explanations. I had forgotten that kmalloc was order-based, and I agree that once you have to vmalloc it doesn't make much difference if you have rounded up or not.

The node I tested on had previously done wide-striping, but I don't know how recently. I don't see any time-based decay, so I guess "recently" just means "ever". So once anyone uses a wide-striped file, all replies to opens will be large?

Do you think the changes to mdc_intent_getxattr_pack() are worthwhile, assuming we add a safety net like ~~LU-4847~~?

I'm still curious about why unlink needs to allocate a large reply buffer since object deletion is handled by the MDT now. What is returned in an unlink reply?

Matt Ezell added a comment - 24/Sep/15 4:00 PM Oleg, thanks for the explanations. I had forgotten that kmalloc was order-based, and I agree that once you have to vmalloc it doesn't make much difference if you have rounded up or not. The node I tested on had previously done wide-striping, but I don't know how recently. I don't see any time-based decay, so I guess "recently" just means "ever". So once anyone uses a wide-striped file, all replies to opens will be large? Do you think the changes to mdc_intent_getxattr_pack() are worthwhile, assuming we add a safety net like LU-4847 ? I'm still curious about why unlink needs to allocate a large reply buffer since object deletion is handled by the MDT now. What is returned in an unlink reply?

Oleg Drokin added a comment - 10/Sep/15 4:24 PM

Also, to address your other concern of non-wide file reads doing large allocations - did you do an access to a wide-striped file from this client recently? Because the client caches a "largest striping I saw from the server lately" value and allocates bigger buffers for some time to avoid performance hit on resends if you do access wide striped file and allocated too small of a buffer. This was fixed by http://review.whamcloud.com/11614 and was broken before causing allocations to always be too small.
This pessimistic allocation only happens on opens (see in mdc_intent_open_pack how obddev->u.cli.cl_max_mds_easize is used).

I guess the usage there is incorrect, though and should have been cl_default_mds_easize instead like everywhere else and consistent with other users of this, but need to be careful of ~~LU-4847~~, I guess.

Oleg Drokin added a comment - 10/Sep/15 4:24 PM Also, to address your other concern of non-wide file reads doing large allocations - did you do an access to a wide-striped file from this client recently? Because the client caches a "largest striping I saw from the server lately" value and allocates bigger buffers for some time to avoid performance hit on resends if you do access wide striped file and allocated too small of a buffer. This was fixed by http://review.whamcloud.com/11614 and was broken before causing allocations to always be too small. This pessimistic allocation only happens on opens (see in mdc_intent_open_pack how obddev->u.cli.cl_max_mds_easize is used). I guess the usage there is incorrect, though and should have been cl_default_mds_easize instead like everywhere else and consistent with other users of this, but need to be careful of LU-4847 , I guess.

Oleg Drokin added a comment - 10/Sep/15 4:08 PM

The requests are rounded up to the next power of two in terms of size. Kernel does it anyway internally, so it was decided it's best if we do it ourselves and use all available buffer space rather than losing some of it.

The code is

static
int null_alloc_repbuf(struct ptlrpc_sec *sec,
                      struct ptlrpc_request *req,
                      int msgsize)
{
        /* add space for early replied */
        msgsize += lustre_msg_early_size();

        msgsize = size_roundup_power2(msgsize);

        OBD_ALLOC_LARGE(req->rq_repbuf, msgsize);

Now, it changed a little bit once ALLOC_LARGE started to do vmalloc for all requests beyond 16K, because vmalloc can round up to the next page boundary, but that was considered unimportant.

Now that we always are doing kmalloc first again (before falling back to vmalloc on error) - this code is again correct in the sense that if you try to alloc 16k+1 bytes, 17k bytes or 31k bytes - internally kernel would still do a 32k allocation. 32k+1 would result in 64k allocation and so on.

If your desire is to reduce vmalloc allocation once kmalloc has failed - that's a bit tricky since right now it is all hidden in OBD_ALLOC_LARGE that has no idea of a possible reduced allocation size if kmalloc fails.
We probably can unwind it just in reply buffer allocation and do a smaller one if we use vmalloc, but it sounds like a lot of trouble for very little gain in a heavily fragmented system corner case. If you are in that corder case, your allocations are already slow due to vmalloc.

Oleg Drokin added a comment - 10/Sep/15 4:08 PM The requests are rounded up to the next power of two in terms of size. Kernel does it anyway internally, so it was decided it's best if we do it ourselves and use all available buffer space rather than losing some of it. The code is static int null_alloc_repbuf(struct ptlrpc_sec *sec, struct ptlrpc_request *req, int msgsize) { /* add space for early replied */ msgsize += lustre_msg_early_size(); msgsize = size_roundup_power2(msgsize); OBD_ALLOC_LARGE(req->rq_repbuf, msgsize); Now, it changed a little bit once ALLOC_LARGE started to do vmalloc for all requests beyond 16K, because vmalloc can round up to the next page boundary, but that was considered unimportant. Now that we always are doing kmalloc first again (before falling back to vmalloc on error) - this code is again correct in the sense that if you try to alloc 16k+1 bytes, 17k bytes or 31k bytes - internally kernel would still do a 32k allocation. 32k+1 would result in 64k allocation and so on. If your desire is to reduce vmalloc allocation once kmalloc has failed - that's a bit tricky since right now it is all hidden in OBD_ALLOC_LARGE that has no idea of a possible reduced allocation size if kmalloc fails. We probably can unwind it just in reply buffer allocation and do a smaller one if we use vmalloc, but it sounds like a lot of trouble for very little gain in a heavily fragmented system corner case. If you are in that corder case, your allocations are already slow due to vmalloc.

People

Assignee:: Yang Sheng

Reporter:: Matt Ezell

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 01/Sep/15 9:20 PM

Updated:: 01/Jul/16 6:46 PM

Resolved:: 07/Jan/16 5:56 PM

Lustre