Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
9223372036854775807
Description
We have seen some performance results which make us believe that the LRU cache slot reclaiming might need some improvement.
In one of the case, when we run a specific kind of application, if we use more memory for Lustre cache (64GB V.S. 8GB), the peformance of the application will be worse, which shouldn't happen if reclaiming of cache works well.
And in some other benchmarks, we found performance are really good when the cache is not full. But when performance becomes full, performance drops immediately. That shouldn't happen either, because the benchmarks only do sequential read, and won't read back.
Those results shows that reclaiming of cache needs some improvement, especially when there are a lot of memory cache. One possible cause of the problem is, when memory becomes bigger, osc_lru_reclaim() needs more time to scan the whole list to free slots. Note that even the caller of osc_lru_reclaim() only need one slot, osc_lru_reclaim() will try to reclaim cl_max_pages_per_rpc slots. And the caller of osc_lru_reclaim() is always the I/O thread, that means the reclaiming a batch of slots will intoruduce overhead directly to the application.And the overhead gets larger when memory gets larger.
Maybe there is a different reason, but we are testing a patch. And I am going to push the patch.
Is this app single threaded? Or is each thread somehow working on a different OSC...?
It’s hard to see how reducing the batch size could improve performance, even in the single threaded case, though. It will just result in more overhead for each page freed. Do you have any benchmarks showing that it helps?
Also, I agree that hitting the LRU limit has a performance cost, but that’s expected. Freeing pages is not free... Your description of it getting worse with more memory makes me wonder if we’re walking the list in the wrong order, but I don’t see how that could be the case.
I’d be interested in seeing basic perf traces of the application here. This is certainly a case that could be improved, but my experience with it suggests this is not the way to do that.