[LU-9920] Use pagevec for marking pages dirty Created: 25/Aug/17 Updated: 22/Oct/20 Resolved: 01/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
When doing i/o from multiple writers to a single file, the Most current uses are single page at a time. This converts This improves shared file write performance notably when [NB: Detailed numbers coming. Looks like no change in |
| Comments |
| Comment by Gerrit Updater [ 25/Aug/17 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/28711 |
| Comment by Andreas Dilger [ 21/Sep/17 ] |
|
Patrick, a similar issue exists when pages are dropped from cache upon lock cancellation. It would be useful to clean this up to use invalidate_page_range() or similar to drop pages from cache (at least in stripe_size chunks) instead of doing it one page at a time as it does today. |
| Comment by Patrick Farrell (Inactive) [ 21/Sep/17 ] |
|
Andreas, See Unlike this patch (which is new and has not run in production), Cray has been using a version of |
| Comment by Patrick Farrell (Inactive) [ 21/Sep/17 ] |
|
Ah, sorry, you've already seen that. Still, I think it's the answer to your thought. |
| Comment by Andreas Dilger [ 21/Sep/17 ] |
|
You are right. I thought I'd seen something similar, but when searching Jira I couldn't find |
| Comment by Patrick Farrell (Inactive) [ 26/Mar/19 ] |
|
As discussed in the gerrit patch for this, the improvement isn't worth the amount of kernel code we have to copy (in slightly modified form). |
| Comment by Gerrit Updater [ 12/Jun/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35206 |
| Comment by Andreas Dilger [ 15/Jun/19 ] |
|
Patrick, is it possible that the high lock contention is a result of something that Lustre is doing under the lock (e.g. per-page lock callbacks or something) or is it conclusive that all of the contention is in the pagecache itself? There is an ongoing discussion in linux-fsdevel about page cache scalability w.r.t. concurrent writers, so if there is something we could usefully contribute to the upstream kernel that would help us down the road, now would be a good time to do so. |
| Comment by Patrick Farrell (Inactive) [ 15/Jun/19 ] |
|
My word, that's quite a thread. I should've gotten involved a few days ago. It's also degraded a bit, since Linus and Dave are ... getting along well ... in their most recent notes. I'm going to gather my thoughts/explain a bit here, I know some of this should go on the list, but I've got a lot of catching up to do before I can comment there. So our workload is lots of threads doing buffered i/o to one file*, and we're spending lots of time on spinlocks in two functions: The particular spinlock is the one that protects the page cache itself. Used to mapping->tree_lock, but it's not called a tree any more, so the lock has a new name... Same lock, though. These functions are both modifying the radix tree/xarray (the data structure used for the page cache was renamed and re-APIed in the last year [no functional changes]) that holds the pages for this particular file. The first is adding pages, and the second one is tagging radix tree entries as dirty. (Pages have dirty state, but the radix tree/xarray entries are also tagged as dirty, which is a separate thing, and which allows writeback (mostly writeback) to rapidly find dirty pages. It is this tagging that needs to write the tree.) In upstream, both of these operations are done a single page at a time, going back several calls. The dirty tagging side is quite simple to do with a pagevec of pages instead, and not release the lock for each page. This is still tagging one page at a time, but at least it holds the lock. That's what my patch does - Hold the lock around the loop where the pages are added. (The latest version takes advantage of Lustre specific behavior to move several operations outside of the lock, an upstream version would look more like the earlier one... But as Ihara proved, that didn't make much difference anyway.) The xarray/radix tree has the ability to tag pages in batches, but, bizarrely, it's only exposed with a function that does "if have TAG X, add TAG Y", which is used for marking dirty pages for writeback. There is no way to just batch TAG X (and you can't trick the function either). It's a slightly bizarre choice and would be very easy to fix with a modification of that function. Briefly put, the adding side is notably more complex to do in batches, mostly because there are a bunch of operations that can potentially fail, and also because the write functions deal with one page at a time, so there's no obvious place to pull in the pages. The core write function in the kernel (generic_perform_write) is a big loop that's basically: This is the thing Kent open-coded. He doesn't use write_begin/write_end (though he's basically got versions of them). Like I said, I'll have to pore over that thread carefully, but the straightforward things to do would be: Allow bulk tagging in the xarray interface The basic problem with the rest is code complexity, and then getting anything else to benefit. Without multiple concurrent writers to a single file (so, range locking rather than the inode mutex), there's not really contention on managing the per-file page tracking... Any benefit experienced by anyone without that would be very small. |
| Comment by Gerrit Updater [ 30/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/28711/ |
| Comment by Peter Jones [ 01/Oct/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 22/Oct/20 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40347 |