Details

    • Technical task
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 2829

    Description

      When running million file runs of createmany with the '-o' (open) option we are seeing a drop in performance from it's starting rate of about 500 creates per second, to about 150 creates per second. This drop in performance seems to hit at the same time as when the ZFS ARC's 'arc_meta_used' hits and then exceeds it's 'arc_meta_limit'. We believe that since Lustre is doing it's own caching and holds a reference to all of it's objects, the ARC is unable to limit it's cache to 'arc_meta_limit'. Thus, the ARC is spending useless effort trying to uphold it's limit and drop objects (but can't because of Lustre's references) which is causing the create rate decrease.

      One method to slightly relive this, is to use the ARC prune callback feature that was recently added to the ZFS on Linux project in commit: https://github.com/zfsonlinux/zfs/commit/ab26409db753bb087842ab6f1af943f3386c764f

      This would allow the ARC to notify Lustre that it needs to release some of the objects it is holding so the ARC can free up part of it's cache and uphold it's 'arc_meta_limit'.

      Attachments

        Activity

          [LU-2477] Poor MDS create performance due to ARC cache growth
          isaac Isaac Huang (Inactive) added a comment - - edited

          Thanks for answers. I assumed that the callback was added because the osd was pinning pages. Is there any reason for the callback to be still there now that it's believed that the osd doesn't pin any page?

          isaac Isaac Huang (Inactive) added a comment - - edited Thanks for answers. I assumed that the callback was added because the osd was pinning pages. Is there any reason for the callback to be still there now that it's believed that the osd doesn't pin any page?

          there is no strong requirement to reference dbuf from osd_object. we discussed this a bit with ZFS team at Oracle (so long ago) to change DMU API so that we can pass dbuf instead of dnode# and save on dnode#->dbuf. it looked like they are aware of potential performance improvement, but that wasn't done.

          bzzz Alex Zhuravlev added a comment - there is no strong requirement to reference dbuf from osd_object. we discussed this a bit with ZFS team at Oracle (so long ago) to change DMU API so that we can pass dbuf instead of dnode# and save on dnode#->dbuf. it looked like they are aware of potential performance improvement, but that wasn't done.

          Isaac, I'll try to elaborate some more if I get time (idk if that'll happen with LUG next week), but I wanted to give you a couple quick answers before I forget.

          1. We introduced the callback awhile ago because I thought the osd-zfs was pinning data in the cache, but after looking at the ARC in greater detail, I'm no longer convinced this is happening. So, while it's definitely possible for the OSD to pin ARC pages by holding references to objects, it shouldn't do this and I no longer think it is. I haven't looked at the OSD code much though, do you have some evidence that it is holding references for an extended period of time?

          2. Yes, if only a single read occurred then it wont go on the MFU. But, is that bad? If it's being cached in the upper layers, then does it matter what the ARC thinks (i.e. MRU or MFU) since you won't be going to disk for it anyway?

          prakash Prakash Surya (Inactive) added a comment - Isaac, I'll try to elaborate some more if I get time (idk if that'll happen with LUG next week), but I wanted to give you a couple quick answers before I forget. 1. We introduced the callback awhile ago because I thought the osd-zfs was pinning data in the cache, but after looking at the ARC in greater detail, I'm no longer convinced this is happening. So, while it's definitely possible for the OSD to pin ARC pages by holding references to objects, it shouldn't do this and I no longer think it is. I haven't looked at the OSD code much though, do you have some evidence that it is holding references for an extended period of time? 2. Yes, if only a single read occurred then it wont go on the MFU. But, is that bad? If it's being cached in the upper layers, then does it matter what the ARC thinks (i.e. MRU or MFU) since you won't be going to disk for it anyway?

          I'm reading relevant code lately and have been wondering about two things.

          1. My understanding is: ARC can't free objects in the metadata part of its cache because osd-zfs holds references on them. So osd-zfs registers a callback which is called by ARC to tell it to release objects. In other words, ARC can only free those objects which are chosen to be released by osd-zfs. But osd-zfs's choice is based on a simple LRU. Therefore, it seems to me that, the ARC (at least the metadata part) has essentially been turned into LRU because it's the LRU policy of the osd-zfs that ultimately determines which objects can be freed. Is this the case or have I missed something? If it is the case, then isn't it bad?

          2. If multiple accesses to a same object can be handled totally in the osd-zfs layer (in its hashed/cached objects) without going into ARC, then would it prevent the ARC from seeing the access pattern and thus from learning and adapting? E.g. two accesses to a same object would move it from MRU to MFU in the ARC, but it probably stays in the MRU if the ARC sees only one access due to caching in osd-zfs.

          Alex or Prakash, can you please comment?

          isaac Isaac Huang (Inactive) added a comment - I'm reading relevant code lately and have been wondering about two things. 1. My understanding is: ARC can't free objects in the metadata part of its cache because osd-zfs holds references on them. So osd-zfs registers a callback which is called by ARC to tell it to release objects. In other words, ARC can only free those objects which are chosen to be released by osd-zfs. But osd-zfs's choice is based on a simple LRU. Therefore, it seems to me that, the ARC (at least the metadata part) has essentially been turned into LRU because it's the LRU policy of the osd-zfs that ultimately determines which objects can be freed. Is this the case or have I missed something? If it is the case, then isn't it bad? 2. If multiple accesses to a same object can be handled totally in the osd-zfs layer (in its hashed/cached objects) without going into ARC, then would it prevent the ARC from seeing the access pattern and thus from learning and adapting? E.g. two accesses to a same object would move it from MRU to MFU in the ARC, but it probably stays in the MRU if the ARC sees only one access due to caching in osd-zfs. Alex or Prakash, can you please comment?

          I think it would be safe to close this. The prune function was landed, although I'm not entirely sure if that fixed the drop in create rate. The code has changed considerably, so I'm fine with opening a new ticket with updated info if needed.

          prakash Prakash Surya (Inactive) added a comment - I think it would be safe to close this. The prune function was landed, although I'm not entirely sure if that fixed the drop in create rate. The code has changed considerably, so I'm fine with opening a new ticket with updated info if needed.

          any update here? iirc, the prune function was landed and we fixed memory leak in SA-related code.

          bzzz Alex Zhuravlev added a comment - any update here? iirc, the prune function was landed and we fixed memory leak in SA-related code.

          Integrated in lustre-dev » x86_64,client,el6,inkernel #340
          ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

          Result = SUCCESS
          Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
          Files :

          • lustre/osd-zfs/osd_internal.h
          • lustre/osd-zfs/osd_handler.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-dev » x86_64,client,el6,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_internal.h lustre/osd-zfs/osd_handler.c

          Integrated in lustre-dev » x86_64,server,el5,inkernel #340
          ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

          Result = SUCCESS
          Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
          Files :

          • lustre/osd-zfs/osd_handler.c
          • lustre/osd-zfs/osd_internal.h
          hudson Build Master (Inactive) added a comment - Integrated in lustre-dev » x86_64,server,el5,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_handler.c lustre/osd-zfs/osd_internal.h

          Integrated in lustre-dev » i686,client,el5,inkernel #340
          ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

          Result = SUCCESS
          Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
          Files :

          • lustre/osd-zfs/osd_handler.c
          • lustre/osd-zfs/osd_internal.h
          hudson Build Master (Inactive) added a comment - Integrated in lustre-dev » i686,client,el5,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_handler.c lustre/osd-zfs/osd_internal.h

          Integrated in lustre-dev » x86_64,server,el6,inkernel #340
          ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

          Result = SUCCESS
          Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
          Files :

          • lustre/osd-zfs/osd_handler.c
          • lustre/osd-zfs/osd_internal.h
          hudson Build Master (Inactive) added a comment - Integrated in lustre-dev » x86_64,server,el6,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_handler.c lustre/osd-zfs/osd_internal.h

          Integrated in lustre-dev » i686,server,el5,inkernel #340
          ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703)

          Result = SUCCESS
          Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703
          Files :

          • lustre/osd-zfs/osd_handler.c
          • lustre/osd-zfs/osd_internal.h
          hudson Build Master (Inactive) added a comment - Integrated in lustre-dev » i686,server,el5,inkernel #340 ORI-481 osd-zfs: Register prune function with ARC (Revision c29aef2356638d93b418f23f2835980c8a944703) Result = SUCCESS Mikhail Pershin : c29aef2356638d93b418f23f2835980c8a944703 Files : lustre/osd-zfs/osd_handler.c lustre/osd-zfs/osd_internal.h

          People

            bzzz Alex Zhuravlev
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: