Metadata writeback cache support (LU-10938)

[LU-13563] WBC: Reclaim mechanism for cached inodes and pages under limits in MemFS Created: 15/May/20  Updated: 13/Feb/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Technical task Priority: Minor
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

It would better to design a reclaim mechanism to free up some reserved inodes for newly creation or cache pages for latter I/O in case of cache saturation.
We could use a kernel shrinker daemon that runs periodically:

  • Unreserve inodes cached in MemFS.
  • Commit the cache pages from MemFS into Lustre (assimilation phase) and unaccount all cached pages from the MemFS limits.

The cache shrinker starts to work if the cache allocation has become larger than the upper watermark and it evicts files until the allocation is below a lower watermark.



 Comments   
Comment by Andreas Dilger [ 15/May/20 ]

Could we just hook into the existing kernel inode/slab/page shrinkers to manage this? One thing that is important to remember is that these shrinkers are essentially a "notification method" from the kernel about memory pressure, but we should still be free to add/modify the inodes/pages that are being flushed at one time to be more IO/RPC friendly (e.g. selecting contiguous pages to write to the OST, though I'm not sure what would be best for MDT aggregation).

One thing that we have to worry about is delaying writeback to the MDT/OST for too long, as that can cause memory pressure to increase significantly, and we will have wasted tens of seconds not sending RPCs, which could have written GBs of dirty data during that time. I think as much as possible it makes sense to have a "write early, free late" kind of policy that we have for dirty file data so that we don't waste the bandwidth/IOPS just waiting until we are short of memory.

Comment by Gerrit Updater [ 22/May/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38697
Subject: LU-13563 wbc: lfs wbc unreserve command to reclaim inodes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: be6a95ca03a262d9e7ec3330a94be7b4d086b921

Comment by Qian Yingjin [ 22/May/20 ]

"

One thing that we have to worry about is delaying writeback to the MDT/OST for too long, as that can cause memory pressure to increase significantly, and we will have wasted tens of seconds not sending RPCs, which could have written GBs of dirty data during that time. I think as much as possible it makes sense to have a "write early, free late" kind of policy that we have for dirty file data so that we don't waste the bandwidth/IOPS just waiting until we are short of memory.

"

Can we tune the kernel writeback parameters to achieve this goal?

Linux Writeback Settings

Variable Description
dirty_background_ratio As a percentage of total memory, the number of pages at which the flusher threads begin writeback of dirty data.
dirty_expire_centisecs  In milliseconds, how old data must be to be written out the next time a flusher thread wakes to perform periodic writeback.
dirty_ratio As a percentage of total memory, the number of pages a process generates before it begins writeback of dirty data.
dirty_writeback_centisecs In milliseconds, how often a flusher thread should wake up to write data back out to disk.

 

Moreover, for data IO pages, we can control the limit of cache pages in MemFS per file to allow data caching in MemFS. If exceed this threshold (i.e. max_pages_per_rpc: 16M? or only 1M to allow to cache much more small files), the client will assimilate the cache pages from MemFS into Lustre. After that, all data IO on this file is directed to Lustre OSTs via Lustre normal IO path.

Comment by Gerrit Updater [ 28/May/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38739
Subject: LU-13563 wbc: reclaim mechanism for inodes cached in MemFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6bc9b4310c573426e62166168e55c6cce972bbbb

Comment by Gerrit Updater [ 28/May/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38749
Subject: LU-13563 wbc: reclaim mechanism for pages cached in MemFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 66944d44e32abbb8a8332746cfa2fa5961267515

Comment by Gerrit Updater [ 09/Jun/20 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38875
Subject: LU-13563 mdt: ignore quota when creating slave stripe
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d497a600c487fd62401d776cea7d18644a74d4e2

Generated at Sat Feb 10 03:02:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.