Details

    • Technical task
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 2829

    Description

      When running million file runs of createmany with the '-o' (open) option we are seeing a drop in performance from it's starting rate of about 500 creates per second, to about 150 creates per second. This drop in performance seems to hit at the same time as when the ZFS ARC's 'arc_meta_used' hits and then exceeds it's 'arc_meta_limit'. We believe that since Lustre is doing it's own caching and holds a reference to all of it's objects, the ARC is unable to limit it's cache to 'arc_meta_limit'. Thus, the ARC is spending useless effort trying to uphold it's limit and drop objects (but can't because of Lustre's references) which is causing the create rate decrease.

      One method to slightly relive this, is to use the ARC prune callback feature that was recently added to the ZFS on Linux project in commit: https://github.com/zfsonlinux/zfs/commit/ab26409db753bb087842ab6f1af943f3386c764f

      This would allow the ARC to notify Lustre that it needs to release some of the objects it is holding so the ARC can free up part of it's cache and uphold it's 'arc_meta_limit'.

      Attachments

        Activity

          [LU-2477] Poor MDS create performance due to ARC cache growth

          We have pretty much confirmed that the current LU object caching approach for ZFS has the side effect of preventing the ARC from caching any of the OIs. This makes FID lookups terribly expensive. The following patch disables the LU object cache for ZFS OSDs which allows the OIs to be properly cached. I don't have any solid performance numbers for specific workloads. But I wanted to post the patch to start a discussion on what the right way to fix this is.

          http://review.whamcloud.com/10237

          behlendorf Brian Behlendorf added a comment - We have pretty much confirmed that the current LU object caching approach for ZFS has the side effect of preventing the ARC from caching any of the OIs. This makes FID lookups terribly expensive. The following patch disables the LU object cache for ZFS OSDs which allows the OIs to be properly cached. I don't have any solid performance numbers for specific workloads. But I wanted to post the patch to start a discussion on what the right way to fix this is. http://review.whamcloud.com/10237

          SAs are held for the period lu_object stays in the cache. any modifying operation touches some attributes usually (e.g. size, mtime, ctime). I wouldn't mind to benchmark this at some point (a bit busy at the moment) and will try to cook a patch soon. from my previous experience SAs are pretty expensive even in the current design where we don't need to call sa_handle_get() on many objects. say, to create a regular file we have to touch: new MDT object, the directory, the object storing last used OST ID, last_rcvd. then potentially it's changelog.

          bzzz Alex Zhuravlev added a comment - SAs are held for the period lu_object stays in the cache. any modifying operation touches some attributes usually (e.g. size, mtime, ctime). I wouldn't mind to benchmark this at some point (a bit busy at the moment) and will try to cook a patch soon. from my previous experience SAs are pretty expensive even in the current design where we don't need to call sa_handle_get() on many objects. say, to create a regular file we have to touch: new MDT object, the directory, the object storing last used OST ID, last_rcvd. then potentially it's changelog.

          we do pin dbuf's holding dnodes (e.g. osd_object_init() -> __osd_obj2dbuf() -> sa_buf_hold()), but at the moment there is another reasons to do so - we do cache a handle for SA, which stores all the normal attributes (size/blocks/[acm]time/etc) and which we access quite often. that handle, in turn, pins dnode's dbuf anyway. if we do not cache SA's handle, we'd have to initialize it every time, which I believe, is quite expensive.

          That's very interesting. Keep in mind, each held dnode pins the dnode_t (~930 bytes, iirc) and the dnode's object block (16K block). So, worst case, each held SA held will pin an additional ~16.5K. If the dnode_t's all belong to the same dnode object block (32 per 16K block) then best case it'll be an addition 1.5K each ((32*930+16K)/32).

          How long are the SA's held for? And for which operations? It'd be nice to have some benchmarks to say if it's better for Lustre to manage them (e.g. keep holds) or have the ARC manage them (e.g. using the MRU/MFU lists). I hope Lustre doesn't keep holds on them for long..?

          prakash Prakash Surya (Inactive) added a comment - we do pin dbuf's holding dnodes (e.g. osd_object_init() -> __osd_obj2dbuf() -> sa_buf_hold()), but at the moment there is another reasons to do so - we do cache a handle for SA, which stores all the normal attributes (size/blocks/ [acm] time/etc) and which we access quite often. that handle, in turn, pins dnode's dbuf anyway. if we do not cache SA's handle, we'd have to initialize it every time, which I believe, is quite expensive. That's very interesting. Keep in mind, each held dnode pins the dnode_t (~930 bytes, iirc) and the dnode's object block (16K block). So, worst case, each held SA held will pin an additional ~16.5K. If the dnode_t's all belong to the same dnode object block (32 per 16K block) then best case it'll be an addition 1.5K each ( (32*930+16K)/32 ). How long are the SA's held for? And for which operations? It'd be nice to have some benchmarks to say if it's better for Lustre to manage them (e.g. keep holds) or have the ARC manage them (e.g. using the MRU/MFU lists). I hope Lustre doesn't keep holds on them for long..?

          we do pin dbuf's holding dnodes (e.g. osd_object_init() -> __osd_obj2dbuf() -> sa_buf_hold()), but at the moment there is another reasons to do so - we do cache a handle for SA, which stores all the normal attributes (size/blocks/[acm]time/etc) and which we access quite often. that handle, in turn, pins dnode's dbuf anyway. if we do not cache SA's handle, we'd have to initialize it every time, which I believe, is quite expensive.

          bzzz Alex Zhuravlev added a comment - we do pin dbuf's holding dnodes (e.g. osd_object_init() -> __osd_obj2dbuf() -> sa_buf_hold()), but at the moment there is another reasons to do so - we do cache a handle for SA, which stores all the normal attributes (size/blocks/ [acm] time/etc) and which we access quite often. that handle, in turn, pins dnode's dbuf anyway. if we do not cache SA's handle, we'd have to initialize it every time, which I believe, is quite expensive.
          isaac Isaac Huang (Inactive) added a comment - - edited

          Thanks for answers. I assumed that the callback was added because the osd was pinning pages. Is there any reason for the callback to be still there now that it's believed that the osd doesn't pin any page?

          isaac Isaac Huang (Inactive) added a comment - - edited Thanks for answers. I assumed that the callback was added because the osd was pinning pages. Is there any reason for the callback to be still there now that it's believed that the osd doesn't pin any page?

          there is no strong requirement to reference dbuf from osd_object. we discussed this a bit with ZFS team at Oracle (so long ago) to change DMU API so that we can pass dbuf instead of dnode# and save on dnode#->dbuf. it looked like they are aware of potential performance improvement, but that wasn't done.

          bzzz Alex Zhuravlev added a comment - there is no strong requirement to reference dbuf from osd_object. we discussed this a bit with ZFS team at Oracle (so long ago) to change DMU API so that we can pass dbuf instead of dnode# and save on dnode#->dbuf. it looked like they are aware of potential performance improvement, but that wasn't done.

          People

            bzzz Alex Zhuravlev
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: