Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
9223372036854775807
Description
This ticket is to explore fixing our unstable pages support in Lustre.
We pin write pages until the write transaction is committed on the server, and these pinned pages are not flushable in any way by the kernel.
This is a main driver of the rather high and variable memory requirements to do buffered writes on Lustre - While the pages which are in cache are limited by the kernel and reclaimed on memory pressure, pages which are pinned waiting for commit are tracked in our memory usage stats (for example, by cgroups), but the kernel cannot flush them. This means the normal memory pressure mechanisms don't work on these pages - they are removed from the page cache on pressure, but not actually freed because they are pinned.
That means there's not really any restriction on how many of these we can create or hold on to - they are flushed from the page cache on memory pressure, but after that, they just sit there unfreeable until they are unpinned. The only limit to how many of these pages we can have (as long as the client does not run out of memory entirely) is how much data the client can push before the server commits and the client learns about it, so a function of server and client speed.
This means that when we hit memory pressure there can be a large number of unfreeable pages, so we can end up OOM killed because when asked to free up memory, we can't. This is particularly common with cgroups, because when cgroups are in use, the memory limit is generally much lower than the total system memory limits, so we're more likely to hit it. On systems not using cgroups, we - in practice - tend to stay below global memory limits and avoid getting OOM killed. (For various reasons; one is Lustre's default limit of 1/2 of RAM for page caching.)
This was supposed to be solved in part by the unstable pages mechanism, which we put in some years ago, but then disabled because it caused performance issues, primarily because of writeback limiting. There's also a notable lack of test coverage for them.
So this ticket is to look at what's required to get those working correctly, starting by turning them on and seeing what breaks. More detail in comments.