[LU-13134] try to use slab allocation for cl_page Created: 14/Jan/20 Updated: 26/May/21 Resolved: 04/Jul/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Currently we use kmalloc for cl_page allocation, this is because for most of cases we have 400+ bytes cl_page, kmalloc use 512 bytes slab behind, so instead of doing this behind, We just try to use private slab pool directly(512 bytes). This make much sesne especially for Direct IO since it try to allocate/free cl_page very frequently. |
| Comments |
| Comment by Gerrit Updater [ 14/Jan/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37225 |
| Comment by Gerrit Updater [ 14/Jan/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37227 |
| Comment by Patrick Farrell [ 15/Jan/20 ] |
|
These patches seem reasonable, but do we have any benchmarks showing improvement or perf data showing contention/time lost in these areas? |
| Comment by Wang Shilong (Inactive) [ 16/Jan/20 ] |
|
Hi Patrick, Not yet, i am looking some benchmark numbers from Ihara, the motivation of this is when we did fio(DIO) benchmark using GPUDirect integrated with Lustre, we saw 10% overhead from cl page allocation/free. |
| Comment by Andreas Dilger [ 03/Feb/20 ] |
|
I think this patch is a good start, but I'm wondering if there is something more dramatic that we can do to reduce the size of cl_page (in a separate patch)? If you consider that we are allocating a whopping 408 bytes per 4096-byte page, that is still 10% of ALL of the size of the Lustre cached data. While it is down from previously 512 bytes per 4096-byte page (12%), there is definitely more that could be done. While I'm not an expert with CLIO, it seems like a lot of complexity in this code that could be removed, and although the flexibility of abstraction in cl_page is interesting, it definitely has a lot of overhead. For example, since we already have fixed-size cl_page allocations, do we really need cpl_linkage linked list in cl_page_slice (which is 16 bytes per slice)? Would it be possible to use an 8-byte offset within cl_page (that would still be enough for 256 bytes per slice) similar to offsetof() and container_of() for accessing the various slices? That alone would save 48 bytes per cl_page, depending on how many slices (normally 3 I think), and would likely also be much more CPU efficient than walking a list to get the slice for each layer. We could potentially also remove the cpl_page pointer for each cl_page_slice, since the layers would always be passed the actual top-level cl_page pointer, and use the slice offset within the cl_page to access their slice. Another possibility would be to remove/reduce the size of cpl_ops. There are only 6 different cl_page_operations types in the code, so storing a 64-bit pointer to identify them is a lot of redundancy. If, for example, each layer would call a simple cl_register_ops() function at startup to register the cl_page_operations structs in a global array, and then get an index back, then only an 8-bit (or even 3-bit) index needs to be stored for each slice to reference the operations pointer, but all of the layers can still access the global operations array to call those methods if needed. |
| Comment by Patrick Farrell [ 04/Feb/20 ] |
|
Interesting, yeah, moving all of that to fixed offsets is a good idea. Note that if we're removing cpl_linkage then we'd also remove cp_layers in the top level cl_page_struct, saving another 16 bytes (per cl_page, not per slice). I spent a while with this just now and I didn't find much else that can be dropped, but I did find two things. One thing I considered and rejected: I think the cpl_obj pointer in the slice probably has to stay, because there can be > 1 OSC object associated with a cl_object so there's no simple way to get there from the top level object (without doing an index calculation each time, which seems worse). One thing we can get rid of - since I got inspired to poke around - is cpl_index for the vvp layer, as that's just the vmpage index. Also, the lov layer is already not using it, it has its own version of that field up in cl_page. So removing cpl_index from the slice saves us 8 bytes - 4 for the VVP layer, which we don't need, and 4 for the LOV layer, which was already duplicated. Nothing amazing, but if we're squeezing... In proving to myself that was doable, I wrote a patch for that, so I'll push that in a few.
One more I didn't do: Looking at cl_page with pahole, we have two adjacent enums which are both being rendered as 4 byte fields. (I'm a bit surprised GCC can't evaluate the number of items in an enum and squash them accordingly, but so it goes.) The second enum (page type) is only set at creation and not modified later, so it would be trivial to switch the two enums to an int and use bits rather than the enum. We would just set the page_type bit (or not set it) at creation time. We already only set cl_page_state from a function, so just modify that and you're good. Note also that there's a 4 byte hole in cl_page right now as well. We could move the new enum there. So that's another potential 16 bytes. |
| Comment by Patrick Farrell [ 04/Feb/20 ] |
|
Ah, hey, here's a slightly trickier one. I think we may be able to do without both cp_owner and cp_sync_io. I think it's probably possible to... hm. I'm not 100% sure, but I think the structure of the code is such that we could stick a pointer to cp_sync_io in the owning cl_io. That assumes, though, that the cl_io has a life time that covers the life time of the anchor, which, well, I haven't checked. |
| Comment by Gerrit Updater [ 04/Feb/20 ] |
|
Patrick Farrell (farr0186@gmail.com) uploaded a new patch: https://review.whamcloud.com/37417 |
| Comment by Shuichi Ihara [ 04/Feb/20 ] |
|
tested two patches conjunction with patch Single client (2 x Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz, 192GB memory, 2 x IB-EDR) and AI400(20 x NVMe) # fio -name=randread -ioengine=libaio -rw=randread -blocksize=4k -iodepth=128 -direct=1 -size=2g -runtime=60 -numjobs=256 -group_reporting -directory=/ai400/out -create_serialize=0 -filename_format='f.$jobnum.$filenum' 4K Random Read |
| Comment by Wang Shilong (Inactive) [ 04/Feb/20 ] |
|
Thanks Ihara for testing! At least we confirm no regression and those patches help reduce memory usage. |
| Comment by Wang Shilong (Inactive) [ 04/Feb/20 ] |
|
Thanks adilger and paf0186 good idea of removing cl_page size, i'll take a look how we could do more. |
| Comment by James A Simmons [ 04/Feb/20 ] |
|
Their is also patch https://review.whamcloud.com/#/c/30161 which was forgotten |
| Comment by Gerrit Updater [ 04/Feb/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37428 |
| Comment by Patrick Farrell [ 04/Feb/20 ] |
|
This probably needs its own JIRA, etc, but the vvp, osc, and lov page structs could also get a quick look, and I'm going to note that here and let someone else spin off a Jira if they're interested... Looking at vvp_page and lov_page quickly, they seem very minimal. The OSC page is rather larger. The only obvious thing to remove from osc_page is the ops_submit_time thing - I have never used it, and never previously knew it was there. I'm not sure it really offers useful functionality above and beyond just using debug logs, etc. So I think that could go. The other members of osc_page all seem necessary, but the packing (from pahole) is... weird. Some padding, some strange bitfields, etc. I'm not sure how much it could be improved, but perhaps someone more familiar with packing rules could take a look and see. (Andreas...
osc_async_page is a little odd, but I think the 15 'bit' hole isn't a big deal - If we combined the shorts, I believe we'd just end up with a 2 byte hole for alignment, and no gain in space. However, it also has an OSC object pointer, which I believe could be found from the osc_page and the associated slice...? I am not 100% sure that connects to the same object, but I think it does. So that's another few 10s of bytes of potential space in these structs. |
| Comment by James A Simmons [ 04/Feb/20 ] |
|
We have also |
| Comment by Patrick Farrell [ 04/Feb/20 ] |
|
A quick note on my earlier idea of compacting the cp_type and cp_state enum... That's actually a fair bit of work. It's relatively simple work, but there are a lot of small moving parts that have to be tweaked. Just an FYI that it's not as easy as it appears, so unless it's really the difference between squeezing and not squeezing under a limit, those 8 bytes (4 from the removed enum + 4 from better packing) may be better skipped. |
| Comment by Andreas Dilger [ 06/Feb/20 ] |
Acutally, this turns out to be surprisingly easy, and something that I've never known before - it is possible to use "enum cp_type cp_type:3" and it will only use 3 bits for the enum. As with any bitfield, it is possible to be racy if updated by multiple threads concurrently, but I don't think that is an issue for the cl_page structures. The "enum foo:N" construct seems very useful, and I'd like to start using it in other data structures, instead of having e.g. a "__u32" and just a comment that says "this is enum foo". |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37480 |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37227/ |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37487 |
| Comment by Gerrit Updater [ 07/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37225/ |
| Comment by Gerrit Updater [ 07/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37417/ |
| Comment by Gerrit Updater [ 19/Jun/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/39103 |
| Comment by Gerrit Updater [ 04/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37428/ |
| Comment by Gerrit Updater [ 04/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37480/ |
| Comment by Gerrit Updater [ 04/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37487/ |
| Comment by Gerrit Updater [ 04/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39103/ |