[LU-2477] Poor MDS create performance due to ARC cache growth - Whamcloud Community JIRA

Details

Type: Technical task
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Rank (Obsolete):
2829

Description

When running million file runs of createmany with the '-o' (open) option we are seeing a drop in performance from it's starting rate of about 500 creates per second, to about 150 creates per second. This drop in performance seems to hit at the same time as when the ZFS ARC's 'arc_meta_used' hits and then exceeds it's 'arc_meta_limit'. We believe that since Lustre is doing it's own caching and holds a reference to all of it's objects, the ARC is unable to limit it's cache to 'arc_meta_limit'. Thus, the ARC is spending useless effort trying to uphold it's limit and drop objects (but can't because of Lustre's references) which is causing the create rate decrease.

One method to slightly relive this, is to use the ARC prune callback feature that was recently added to the ZFS on Linux project in commit: https://github.com/zfsonlinux/zfs/commit/ab26409db753bb087842ab6f1af943f3386c764f

This would allow the ARC to notify Lustre that it needs to release some of the objects it is holding so the ARC can free up part of it's cache and uphold it's 'arc_meta_limit'.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image.png
16 kB
19/Jan/12 4:29 PM

Activity

[LU-2477] Poor MDS create performance due to ARC cache growth

Build Master (Inactive) added a comment - 02/May/12 2:55 PM

Integrated in lustre-dev » i686,client,el6,inkernel #340
~~ORI-481~~ osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

Result = SUCCESS
Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
Files :

lustre/osd-zfs/osd_internal.h
lustre/osd-zfs/osd_handler.c

Build Master (Inactive) added a comment - 02/May/12 2:55 PM Integrated in lustre-dev » i686,client,el6,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_internal.h lustre/osd-zfs/osd_handler.c

Build Master (Inactive) added a comment - 02/May/12 2:53 PM

Integrated in lustre-dev » x86_64,client,el5,inkernel #340
~~ORI-481~~ osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

Result = SUCCESS
Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
Files :

lustre/osd-zfs/osd_internal.h
lustre/osd-zfs/osd_handler.c

Build Master (Inactive) added a comment - 02/May/12 2:53 PM Integrated in lustre-dev » x86_64,client,el5,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_internal.h lustre/osd-zfs/osd_handler.c

Prakash Surya (Inactive) added a comment - 27/Jan/12 6:24 PM

http://review.whamcloud.com/#change,2032

Prakash Surya (Inactive) added a comment - 27/Jan/12 6:24 PM http://review.whamcloud.com/#change,2032

Prakash Surya (Inactive) added a comment - 24/Jan/12 1:28 PM

Is there a specific `env` variable I should pass to the `lu_site_purge` function? It's unclear to me exactly how the environments are used.

My initial thinking is I can use a call to `lu_site_purge` similar to the one found in `osd_device_free` in the osd-zfs layer.

Prakash Surya (Inactive) added a comment - 24/Jan/12 1:28 PM Is there a specific `env` variable I should pass to the `lu_site_purge` function? It's unclear to me exactly how the environments are used. My initial thinking is I can use a call to `lu_site_purge` similar to the one found in `osd_device_free` in the osd-zfs layer.

Brian Behlendorf added a comment - 20/Jan/12 12:16 PM

Yes, it would be best to keep this code in the zfs-osd. It just wasn't clear to us if it was possible to drop entries the the lu site safely from within the zfs-osd code. If we can use lu_site_purge() for this that would be perfect.

Brian Behlendorf added a comment - 20/Jan/12 12:16 PM Yes, it would be best to keep this code in the zfs-osd. It just wasn't clear to us if it was possible to drop entries the the lu site safely from within the zfs-osd code. If we can use lu_site_purge() for this that would be perfect.

Alex Zhuravlev added a comment - 20/Jan/12 12:32 AM

hmm. Lustre does not bypass kernel infrastructure related to memory management: it registers lu_cache_shrink() and it's called by the kernel as expected.

given this callback is specific to ARC, it should be done within zfs-osd which in turn can call lu_site_purge()
on own site (in contrast with all registered sites).

Alex Zhuravlev added a comment - 20/Jan/12 12:32 AM hmm. Lustre does not bypass kernel infrastructure related to memory management: it registers lu_cache_shrink() and it's called by the kernel as expected. given this callback is specific to ARC, it should be done within zfs-osd which in turn can call lu_site_purge() on own site (in contrast with all registered sites).

Prakash Surya (Inactive) added a comment - 19/Jan/12 4:29 PM

Here's a graph of our create rate performance. This is on a freshly rebooted MDS. We believe the first drop to ~300 corresponds to reaching the 'arc_meta_limit' but having enough non-lustre buffers to drop to stay at the limit. And the second drop to ~150 corresponds to all non-lustre buffers dropped and now exceeding the 'arc_meta_limit'.

Prakash Surya (Inactive) added a comment - 19/Jan/12 4:29 PM Here's a graph of our create rate performance. This is on a freshly rebooted MDS. We believe the first drop to ~300 corresponds to reaching the 'arc_meta_limit' but having enough non-lustre buffers to drop to stay at the limit. And the second drop to ~150 corresponds to all non-lustre buffers dropped and now exceeding the 'arc_meta_limit'.

Prakash Surya (Inactive) added a comment - 19/Jan/12 4:11 PM

Alex, I believe the ARC is already doing this coordination with the kernel. From my understanding, when the ARC reaches this limit, it will try and use the existing shrinkers to have the kernel/ZPL drop any referneces it can, allowing the ARC to drop it completely. The problem is that Lustre bypasses the ZPL and kernel infrastructure, talking directly with the DMU. Thus, bypassing all the existing shrinker infrastucture.

Without the callback, Lustre pin's it's data in the ARC by always holding a reference to it and there's no current functionality to tell Lustre to drop this reference. This callback is used to inform Lustre to drop what it can.

Also, I don't think the kernel can sort this out, because it has no idea of the ARC's meta limit. As far as I understand it, the ARC is providing similar functionality as Linux's pagecache, but without using any of the Linux infrastructure. From the kernel's perspective, the ARC is using X bytes and there is no need to reclaim since the system still has plenty of available memory. But from the ARC's perspective, it want's to limit itself to Y bytes, and when X surpasses Y, it needs to reclaim no matter the available memory on the system as a whole.

Prakash Surya (Inactive) added a comment - 19/Jan/12 4:11 PM Alex, I believe the ARC is already doing this coordination with the kernel. From my understanding, when the ARC reaches this limit, it will try and use the existing shrinkers to have the kernel/ZPL drop any referneces it can, allowing the ARC to drop it completely. The problem is that Lustre bypasses the ZPL and kernel infrastructure, talking directly with the DMU. Thus, bypassing all the existing shrinker infrastucture. Without the callback, Lustre pin's it's data in the ARC by always holding a reference to it and there's no current functionality to tell Lustre to drop this reference. This callback is used to inform Lustre to drop what it can. Also, I don't think the kernel can sort this out, because it has no idea of the ARC's meta limit. As far as I understand it, the ARC is providing similar functionality as Linux's pagecache, but without using any of the Linux infrastructure. From the kernel's perspective, the ARC is using X bytes and there is no need to reclaim since the system still has plenty of available memory. But from the ARC's perspective, it want's to limit itself to Y bytes, and when X surpasses Y, it needs to reclaim no matter the available memory on the system as a whole.

Andreas Dilger added a comment - 19/Jan/12 11:19 AM

Changing this bug summary to reflect issues with ARC cache shrinking. Other aspects of this performance include xattr overhead (addressed in ORI-361 - SA-based xattrs).

Andreas Dilger added a comment - 19/Jan/12 11:19 AM Changing this bug summary to reflect issues with ARC cache shrinking. Other aspects of this performance include xattr overhead (addressed in ORI-361 - SA-based xattrs).

Alex Zhuravlev added a comment - 18/Jan/12 1:07 PM

I tend to think the better approach would be to ask kernel for more pages and let kernel to sort this problem out (using all shrinkers, pagecache, etc) - it might be that a lot of memory is consumed by somebody else in some cases.

Alex Zhuravlev added a comment - 18/Jan/12 1:07 PM I tend to think the better approach would be to ask kernel for more pages and let kernel to sort this problem out (using all shrinkers, pagecache, etc) - it might be that a lot of memory is consumed by somebody else in some cases.

People

Assignee:: Alex Zhuravlev

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 18/Jan/12 1:02 PM

Updated:: 06/May/14 7:14 PM

Resolved:: 11/Dec/12 1:54 PM