Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10938 Metadata writeback cache support
  3. LU-13563

WBC: Reclaim mechanism for cached inodes and pages under limits in MemFS

Details

    • Technical task
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      It would better to design a reclaim mechanism to free up some reserved inodes for newly creation or cache pages for latter I/O in case of cache saturation.
      We could use a kernel shrinker daemon that runs periodically:

      • Unreserve inodes cached in MemFS.
      • Commit the cache pages from MemFS into Lustre (assimilation phase) and unaccount all cached pages from the MemFS limits.

      The cache shrinker starts to work if the cache allocation has become larger than the upper watermark and it evicts files until the allocation is below a lower watermark.

      Attachments

        Activity

          [LU-13563] WBC: Reclaim mechanism for cached inodes and pages under limits in MemFS

          Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38875
          Subject: LU-13563 mdt: ignore quota when creating slave stripe
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: d497a600c487fd62401d776cea7d18644a74d4e2

          gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38875 Subject: LU-13563 mdt: ignore quota when creating slave stripe Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d497a600c487fd62401d776cea7d18644a74d4e2

          Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38749
          Subject: LU-13563 wbc: reclaim mechanism for pages cached in MemFS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 66944d44e32abbb8a8332746cfa2fa5961267515

          gerrit Gerrit Updater added a comment - Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38749 Subject: LU-13563 wbc: reclaim mechanism for pages cached in MemFS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 66944d44e32abbb8a8332746cfa2fa5961267515
          gerrit Gerrit Updater added a comment - - edited

          Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38739
          Subject: LU-13563 wbc: reclaim mechanism for inodes cached in MemFS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 6bc9b4310c573426e62166168e55c6cce972bbbb

          gerrit Gerrit Updater added a comment - - edited Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38739 Subject: LU-13563 wbc: reclaim mechanism for inodes cached in MemFS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6bc9b4310c573426e62166168e55c6cce972bbbb
          qian_wc Qian Yingjin added a comment -

          "

          One thing that we have to worry about is delaying writeback to the MDT/OST for too long, as that can cause memory pressure to increase significantly, and we will have wasted tens of seconds not sending RPCs, which could have written GBs of dirty data during that time. I think as much as possible it makes sense to have a "write early, free late" kind of policy that we have for dirty file data so that we don't waste the bandwidth/IOPS just waiting until we are short of memory.

          "

          Can we tune the kernel writeback parameters to achieve this goal?

          Linux Writeback Settings

          Variable Description
          dirty_background_ratio As a percentage of total memory, the number of pages at which the flusher threads begin writeback of dirty data.
          dirty_expire_centisecs  In milliseconds, how old data must be to be written out the next time a flusher thread wakes to perform periodic writeback.
          dirty_ratio As a percentage of total memory, the number of pages a process generates before it begins writeback of dirty data.
          dirty_writeback_centisecs In milliseconds, how often a flusher thread should wake up to write data back out to disk.

           

          Moreover, for data IO pages, we can control the limit of cache pages in MemFS per file to allow data caching in MemFS. If exceed this threshold (i.e. max_pages_per_rpc: 16M? or only 1M to allow to cache much more small files), the client will assimilate the cache pages from MemFS into Lustre. After that, all data IO on this file is directed to Lustre OSTs via Lustre normal IO path.

          qian_wc Qian Yingjin added a comment - " One thing that we have to worry about is delaying writeback to the MDT/OST for too long, as that can cause memory pressure to increase significantly, and we will have wasted tens of seconds not sending RPCs, which could have written GBs of dirty data during that time. I think as much as possible it makes sense to have a "write early, free late" kind of policy that we have for dirty file data so that we don't waste the bandwidth/IOPS just waiting until we are short of memory. " Can we tune the kernel writeback parameters to achieve this goal? Linux Writeback Settings Variable Description dirty_background_ratio As a percentage of total memory, the number of pages at which the flusher threads begin writeback of dirty data. dirty_expire_centisecs  In milliseconds, how old data must be to be written out the next time a flusher thread wakes to perform periodic writeback. dirty_ratio As a percentage of total memory, the number of pages a process generates before it begins writeback of dirty data. dirty_writeback_centisecs In milliseconds, how often a flusher thread should wake up to write data back out to disk.   Moreover, for data IO pages, we can control the limit of cache pages in MemFS per file to allow data caching in MemFS. If exceed this threshold (i.e. max_pages_per_rpc: 16M? or only 1M to allow to cache much more small files), the client will assimilate the cache pages from MemFS into Lustre. After that, all data IO on this file is directed to Lustre OSTs via Lustre normal IO path.

          Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38697
          Subject: LU-13563 wbc: lfs wbc unreserve command to reclaim inodes
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: be6a95ca03a262d9e7ec3330a94be7b4d086b921

          gerrit Gerrit Updater added a comment - Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38697 Subject: LU-13563 wbc: lfs wbc unreserve command to reclaim inodes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be6a95ca03a262d9e7ec3330a94be7b4d086b921

          Could we just hook into the existing kernel inode/slab/page shrinkers to manage this? One thing that is important to remember is that these shrinkers are essentially a "notification method" from the kernel about memory pressure, but we should still be free to add/modify the inodes/pages that are being flushed at one time to be more IO/RPC friendly (e.g. selecting contiguous pages to write to the OST, though I'm not sure what would be best for MDT aggregation).

          One thing that we have to worry about is delaying writeback to the MDT/OST for too long, as that can cause memory pressure to increase significantly, and we will have wasted tens of seconds not sending RPCs, which could have written GBs of dirty data during that time. I think as much as possible it makes sense to have a "write early, free late" kind of policy that we have for dirty file data so that we don't waste the bandwidth/IOPS just waiting until we are short of memory.

          adilger Andreas Dilger added a comment - Could we just hook into the existing kernel inode/slab/page shrinkers to manage this? One thing that is important to remember is that these shrinkers are essentially a "notification method" from the kernel about memory pressure, but we should still be free to add/modify the inodes/pages that are being flushed at one time to be more IO/RPC friendly (e.g. selecting contiguous pages to write to the OST, though I'm not sure what would be best for MDT aggregation). One thing that we have to worry about is delaying writeback to the MDT/OST for too long, as that can cause memory pressure to increase significantly, and we will have wasted tens of seconds not sending RPCs, which could have written GBs of dirty data during that time. I think as much as possible it makes sense to have a "write early, free late" kind of policy that we have for dirty file data so that we don't waste the bandwidth/IOPS just waiting until we are short of memory.

          People

            qian_wc Qian Yingjin
            qian_wc Qian Yingjin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: