[LU-5164] Limit lu_object cache (ZFS and osd-zfs) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4
Affects Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.3
Labels:
- patch
- performance
- server
- zfs

Severity:
3
Rank (Obsolete):
14237

Description

For OSDs like ZFS to perform optimally it's import that they be allowed to manage their own cache. This maximizes the likelyhood that the ARC will prefetch and cache the right buffers. In the existing ZFS OSD code a cached LU object pins buffers in the ARC preventing them from being dropped. As the LU cache grows it can consume the entire ARC preventing buffers for other objects, such as the OIs, from being cached and severely impacting the performance for FID lookups.

The proposed patch addresses this by limiting the size of the lu_cache but alternate approaches are welcome. We are carrying this patch in LLNLs tree and it does help considerably.

Attachments

Issue Links

is related to

LU-5240 Test failure sanity test_69: write error

Resolved

Activity

[LU-5164] Limit lu_object cache (ZFS and osd-zfs)

Jian Yu added a comment - 20/Sep/14 6:54 AM

Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12001

Jian Yu added a comment - 20/Sep/14 6:54 AM Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12001

Alex Zhuravlev added a comment - 23/Jul/14 2:24 AM

> Is there any expensive operation we want to avoid when creating an object from ARC buffers (i.e. no disk IO)?

OI lookup, SA initialization.

Alex Zhuravlev added a comment - 23/Jul/14 2:24 AM > Is there any expensive operation we want to avoid when creating an object from ARC buffers (i.e. no disk IO)? OI lookup, SA initialization.

Isaac Huang (Inactive) added a comment - 22/Jul/14 11:19 PM

lu_object_put() calls ->loo_object_release() when the last reference to the object gone, but this is not what we need, I guess: this won't work for a client accessing a directory exclusively as every time RPC completes we'll be getting ->loo_object_release() while few cycles later we get another RPC to the same directory.

In this case, if the ARC is doing a decent job, the buffers should still be cached and no IO will be need to create the object again. Is there any expensive operation we want to avoid when creating an object from ARC buffers (i.e. no disk IO)?

Isaac Huang (Inactive) added a comment - 22/Jul/14 11:19 PM lu_object_put() calls ->loo_object_release() when the last reference to the object gone, but this is not what we need, I guess: this won't work for a client accessing a directory exclusively as every time RPC completes we'll be getting ->loo_object_release() while few cycles later we get another RPC to the same directory. In this case, if the ARC is doing a decent job, the buffers should still be cached and no IO will be need to create the object again. Is there any expensive operation we want to avoid when creating an object from ARC buffers (i.e. no disk IO)?

Isaac Huang (Inactive) added a comment - 22/Jul/14 11:15 PM

My knowledge of the Lustre server stack is very limited, so I'm not sure whether it's feasible or not. But here's my thoughts:

1. Get rid of the LRU completely. Objects are freed once the last reference is dropped. Then it'd be equivalent to the ZPL way of holding on to DMU objects/buffers only for the duration of system calls. This also gives the ARC the freedom to decide which buffers to keep or evict. After all, the ARC is supposed to do a better job than a simple LRU.

2. When osd-zfs has the knowledge that certain objects are frequently used or will be used soon, hold references to those objects proactively. For example:

If last_rcvd is used for most RPCs, hold a ref for the lifetime of the MDS kernel module.
When a RPC is queued, do some preprocessing, look at the objects that will be needed, and look them up in the lu_site cache:
- If it's already there, add a ref to it so that it stays in the cache.
- If it's not there already, we may do nothing if cache size is near a threshold, or load the object into the cache aggressively.

This way ARC has the freedom it needs, and osd-zfs also contributes when it knows better what to cache. It should be able to handle to case Alex outlined where a client accesses a directory exclusively, because the queued RPCs will keep objects used by the current RPC in the cache.

Isaac Huang (Inactive) added a comment - 22/Jul/14 11:15 PM My knowledge of the Lustre server stack is very limited, so I'm not sure whether it's feasible or not. But here's my thoughts: 1. Get rid of the LRU completely. Objects are freed once the last reference is dropped. Then it'd be equivalent to the ZPL way of holding on to DMU objects/buffers only for the duration of system calls. This also gives the ARC the freedom to decide which buffers to keep or evict. After all, the ARC is supposed to do a better job than a simple LRU. 2. When osd-zfs has the knowledge that certain objects are frequently used or will be used soon, hold references to those objects proactively. For example: If last_rcvd is used for most RPCs, hold a ref for the lifetime of the MDS kernel module. When a RPC is queued, do some preprocessing, look at the objects that will be needed, and look them up in the lu_site cache: If it's already there, add a ref to it so that it stays in the cache. If it's not there already, we may do nothing if cache size is near a threshold, or load the object into the cache aggressively. This way ARC has the freedom it needs, and osd-zfs also contributes when it knows better what to cache. It should be able to handle to case Alex outlined where a client accesses a directory exclusively, because the queued RPCs will keep objects used by the current RPC in the cache.

Alex Zhuravlev added a comment - 07/Jul/14 10:33 AM

> That sounds reasonable to me. Do we have an easy way to tell the different between frequently accessed objects which should keep their SA cached and rarely accessed objects where it's less critical? I don't want to cache more than we have too.

lu_object_put() calls ->loo_object_release() when the last reference to the object gone, but this is not what we need, I guess:
this won't work for a client accessing a directory exclusively as every time RPC completes we'll be getting ->loo_object_release()
while few cycles later we get another RPC to the same directory.

we could introduce yet another method, probably.. to release resource from the objects from the tail of LRU. but this is yet additional
complexity to the algorithm and additional overhead. this is why I like the idea of limiting cache. but limit I had in mind was in millions
(so memory footprint isn't enormous), rather than literally few objects.

> Sure, but the MM system has code to deal with this. The dentry cache is always pruned before the inode cache which ensures some number of inodes can always be freed.

well, we do register lu_cache_shrink() which is the way MM recycles the memory? very similar if not the same?

Alex Zhuravlev added a comment - 07/Jul/14 10:33 AM > That sounds reasonable to me. Do we have an easy way to tell the different between frequently accessed objects which should keep their SA cached and rarely accessed objects where it's less critical? I don't want to cache more than we have too. lu_object_put() calls ->loo_object_release() when the last reference to the object gone, but this is not what we need, I guess: this won't work for a client accessing a directory exclusively as every time RPC completes we'll be getting ->loo_object_release() while few cycles later we get another RPC to the same directory. we could introduce yet another method, probably.. to release resource from the objects from the tail of LRU. but this is yet additional complexity to the algorithm and additional overhead. this is why I like the idea of limiting cache. but limit I had in mind was in millions (so memory footprint isn't enormous), rather than literally few objects. > Sure, but the MM system has code to deal with this. The dentry cache is always pruned before the inode cache which ensures some number of inodes can always be freed. well, we do register lu_cache_shrink() which is the way MM recycles the memory? very similar if not the same?

Brian Behlendorf added a comment - 02/Jul/14 11:46 PM

> IMHO, ideally we shouldn't pin SA for rarely used objects, but for frequently accessed ones

That sounds reasonable to me. Do we have an easy way to tell the different between frequently accessed objects which should keep their SA cached and rarely accessed objects where it's less critical? I don't want to cache more than we have too.

> also, notice VFS does pin inode with dentry.

Sure, but the MM system has code to deal with this. The dentry cache is always pruned before the inode cache which ensures some number of inodes can always be freed.

Brian Behlendorf added a comment - 02/Jul/14 11:46 PM > IMHO, ideally we shouldn't pin SA for rarely used objects, but for frequently accessed ones That sounds reasonable to me. Do we have an easy way to tell the different between frequently accessed objects which should keep their SA cached and rarely accessed objects where it's less critical? I don't want to cache more than we have too. > also, notice VFS does pin inode with dentry. Sure, but the MM system has code to deal with this. The dentry cache is always pruned before the inode cache which ensures some number of inodes can always be freed.

People

Assignee:: Nathaniel Clark

Reporter:: Brian Behlendorf

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 09/Jun/14 5:24 PM

Updated:: 11/Oct/14 4:42 AM

Resolved:: 18/Jun/14 1:08 PM