[LU-16680] add lowmem sync feature Created: 29/Mar/23  Updated: 25/Jul/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-16696 Lustre memcg oom workaround for unpat... Resolved
Related
is related to LU-16671 Fixing unstable pages support in Lustre Resolved
is related to LU-16697 Lustre should set appropriate BDI_CAP... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Currently, when the system is low on memory, it will start asking Lustre to release pages and force them out of the page cache.  However, when we do a buffered write, we pin the pages until they are committed on the OST.  There's currently no way for the kernel to get us to do anything about those pages - they just sit there, taking up memory, until the OST does a commit and the client gets an RPC updating with last_committed.  This can take several seconds, which means if the memory limit is low and/or write speed is high, will result in tasks doing IO getting OOM killed for failing to free memory.

We've tried to solve this in the past by integrating NFS unstable pages tracking in to Lustre, but this is fraught - it treats our uncommitted pages as dirty, which means we get rate limited on them.  The kernels idea of an appropriate number of outstanding pages is based on local file systems, and isn't enough for us, so this causes performance issues.  The SOFT_SYNC feature we created to work with unstable pages also just asks the OST nicely to do a commit, and includes no way for the client to be notified quickly.

This means it can't be responsive enough to avoid tasks getting OOM-killed.

This ticket is to track a patch using a simpler approach:
When we are doing IO and we detect memory pressure, force the client to do a sync RPC.  This both pauses client IO while we're in severe memory pressure - the user process stops accumulating new uncommitted pages while waiting for the sync RPC - and waiting for that RPC guarantees we will get last_committed updated before we start adding new dirty data.



 Comments   
Comment by Patrick Farrell [ 29/Mar/23 ]

Patch transferred from LU-16671:
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50460
Subject: LU-16680 llite: add lowmem_sync feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d9c169ffc6e64e6f188261f3991e641408f92b9a

Generated at Sat Feb 10 03:29:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.